Newsgroups: php.internals
Path: news.php.net
Xref: news.php.net php.internals:18322
Mailing-List: contact internals-help@lists.php.net; run by ezmlm
In-Reply-To: <20050823152306.94806.qmail@web31802.mail.mud.yahoo.com>
References: <20050823152306.94806.qmail@web31802.mail.mud.yahoo.com>
Mime-Version: 1.0 (Apple Message framework v622)
Content-Type: text/plain; charset=US-ASCII; format=flowed
Message-ID: <10f8974e9d68bae15cfb7dfbd27ed5ba@gravitonic.com>
Content-Transfer-Encoding: 7bit
Cc: Tex Texin <tex@yahoo-inc.com>, internals@lists.php.net
Date: Tue, 23 Aug 2005 13:58:04 -0700
To: Rolland Santimano <rollandsantimano@yahoo.com>
Subject: Re: [PHP-DEV] PHP Unicode strings impl proposal
From: andrei@gravitonic.com (Andrei Zmievski)

On Aug 23, 2005, at 8:23 AM, Rolland Santimano wrote:
> [1] string substr_replace(string original, string new, int start[, int
> length])
> Returns string where original[start..length] is replaced with
> new. Input args can be arrays, in which case case the operation is:
> substr_replace(original[i], new[i], start[i], length[i])
> Impl:
> The current impl is written in terms of memcpy(), after adjusting
> satrt & length correctly. With Unicode input, 'start' & 'length' may
> not be aligned with codepoint/grapheme boundaries. If args are mixed
> string types, convert to common type.

What do you mean, they may not be aligned with codepoint boundaries? We 
have to make sure that they are. In order to do this, we need to use 
U16_FWD() macro to iterate through the number of codepoints indicated 
by 'start', and then from that point do the same for 'length'. Once 
that's done you will have the boundaries in terms of UChar*'s.

> [2] int substr_count(string text, string token[, int start[, int
> length]])
> Returns no of occurrences of token in text[start..length]
> Impl:
> The current impl is around php_memnstr() and can be extended for
> Unicode with zend_u_memnstr()

Same thing with regard to start and length applies here.

> [3] string strtok([string text, ]string separator)
> Tokenize string
> Impl:
> Current impl uses global state, in the form of char ptrs and a
> 256-char array. Mixed string type input would be converted to common
> type, and new global state would have to include initial type of
> separator. Tokenizing should honor base+combining sequences.

I think we need to flesh out more details here. We can't possibly keep 
a strtok table the size of the entire Unicode set. ICU has its own 
u_strtok_r() function, but its limitation is that it does not support 
surrogate pairs (which we should). As for honoring base+combining 
sequences, why should strtok() be any more special than strstr()?

> [5] string str_pad(string text, int length[, string pad[, int
> pad_type]])
> Returns input string padded on the left and/or right (determined by
> pad_type) to specified length with pad string.
> Impl:
> The impl builds the output string by copying appropriate pad
> characters to the left and/or right of the input string.
> Q: With STR_PAD_BOTH, lets say 'length' == input 'text' length + 2
> (lengths in UChars), but 'pad' text is non-BMP (ie. 2 UChars), then
> the 'pad' text can't be added at either end. More generally, the 'pad'
> text can't be split in the middle of non-BMP codepts or base+combining
> sequences. If such a condn occurs, an error should be returned. Any
> other thoughts ?

We should not split padding strings in the middle of surrogate pairs. 
As for combining sequences, I would defer to Tex and see what he has to 
say. The input length parameter should indicate the number of 
codepoints to pad to, not the number of UChars.

> [7] int levenshtein(string str1, string str2[, int ins_cost, int
> rep_cost, int del_cost])
> Calculate Levenshtein distance between str1 & str2.
>
> Q: Any gotchas in extending the Levenshtein algo for Unicode ? Should
> the ins/del/subst cost be expressed in graphemes or codepts ?

I think it can be fairly easily extended to Unicode strings, since the 
algorithm only cares about the insertion, deletion, or substitution of 
characters. We should once again work on the codepoint level.

> The foll funcns generally work on ASCII input, and should be made
> Unicode-aware. However, should they be converted to process Unicode
> input ?
>
> [1] string addslashes(string text)
> [2] string stripslashes(string text)
> Escape single/double quotes & backslashes with backslashes

I don't see any problems with these two.

> [3] string addcslashes(string text, string charlist)
> [4] string stripcslashes(string text)
> Escape chars < 32 or > 126 with octal sequences, and escape characters
> from charlist with backspace.

Same here.

> [5] string strip_tags(string text[, string allowed_tags])
> Strip HTML/PHP tags from text

Should be ok, but I think we'll end up duplicating a large chunk of 
code..

-Andrei