PHP Unicode strings impl proposal

20 years ago by Rolland Santimano — view source — reply

unread

Comments / impl suggestions please.

TIA,
Rolland

[1] string substr_replace(string original, string new, int start[, int
length])
Returns string where original[start..length] is replaced with
new. Input args can be arrays, in which case case the operation is:
substr_replace(original[i], new[i], start[i], length[i])
Impl:
The current impl is written in terms of memcpy(), after adjusting
satrt & length correctly. With Unicode input, 'start' & 'length' may
not be aligned with codepoint/grapheme boundaries. If args are mixed
string types, convert to common type.

[2] int substr_count(string text, string token[, int start[, int
length]])
Returns no of occurrences of token in text[start..length]
Impl:
The current impl is around php_memnstr() and can be extended for
Unicode with zend_u_memnstr()

[3] string strtok([string text, ]string separator)
Tokenize string
Impl:
Current impl uses global state, in the form of char ptrs and a
256-char array. Mixed string type input would be converted to common
type, and new global state would have to include initial type of
separator. Tokenizing should honor base+combining sequences.

[4] string strrev(string text)
Returns reversed string equivalent of input.
Impl:
The current impl walks the input string in reverse and copies it one
character at a time. This can be achieved using the U16_NEXT/U16_PREV
macros. Combining characters can be copied together using the
u_getCombiningClass() API.

[5] string str_pad(string text, int length[, string pad[, int
pad_type]])
Returns input string padded on the left and/or right (determined by
pad_type) to specified length with pad string.
Impl:
The impl builds the output string by copying appropriate pad
characters to the left and/or right of the input string.

Q: With STR_PAD_BOTH, lets say 'length' == input 'text' length + 2
(lengths in UChars), but 'pad' text is non-BMP (ie. 2 UChars), then
the 'pad' text can't be added at either end. More generally, the 'pad'
text can't be split in the middle of non-BMP codepts or base+combining
sequences. If such a condn occurs, an error should be returned. Any
other thoughts ?

[6] int similar_text(string str1, string str2[, int percentage])
Returns no of common characters between str1 & str2.
Impl:
The current impl determines common characters by comparing characters
to generate common sequences. Comparisons for Unicode strings should
be done with codepoints.

[7] int levenshtein(string str1, string str2[, int ins_cost, int
rep_cost, int del_cost])
Calculate Levenshtein distance between str1 & str2.

Q: Any gotchas in extending the Levenshtein algo for Unicode ? Should
the ins/del/subst cost be expressed in graphemes or codepts ?

=================================================================

The foll funcns generally work on ASCII input, and should be made
Unicode-aware. However, should they be converted to process Unicode
input ?

[1] string addslashes(string text)
[2] string stripslashes(string text)
Escape single/double quotes & backslashes with backslashes

[3] string addcslashes(string text, string charlist)
[4] string stripcslashes(string text)
Escape chars < 32 or > 126 with octal sequences, and escape characters
from charlist with backspace.

[5] string strip_tags(string text[, string allowed_tags])
Strip HTML/PHP tags from text

20 years ago by Andrei Zmievski — view source — reply

unread

[1] string substr_replace(string original, string new, int start[, int
length])
Returns string where original[start..length] is replaced with
new. Input args can be arrays, in which case case the operation is:
substr_replace(original[i], new[i], start[i], length[i])
Impl:
The current impl is written in terms of memcpy(), after adjusting
satrt & length correctly. With Unicode input, 'start' & 'length' may
not be aligned with codepoint/grapheme boundaries. If args are mixed
string types, convert to common type.

What do you mean, they may not be aligned with codepoint boundaries? We
have to make sure that they are. In order to do this, we need to use
U16_FWD() macro to iterate through the number of codepoints indicated
by 'start', and then from that point do the same for 'length'. Once
that's done you will have the boundaries in terms of UChar*'s.

[2] int substr_count(string text, string token[, int start[, int
length]])
Returns no of occurrences of token in text[start..length]
Impl:
The current impl is around php_memnstr() and can be extended for
Unicode with zend_u_memnstr()

Same thing with regard to start and length applies here.

[3] string strtok([string text, ]string separator)
Tokenize string
Impl:
Current impl uses global state, in the form of char ptrs and a
256-char array. Mixed string type input would be converted to common
type, and new global state would have to include initial type of
separator. Tokenizing should honor base+combining sequences.

I think we need to flesh out more details here. We can't possibly keep
a strtok table the size of the entire Unicode set. ICU has its own
u_strtok_r() function, but its limitation is that it does not support
surrogate pairs (which we should). As for honoring base+combining
sequences, why should strtok() be any more special than strstr()?

[5] string str_pad(string text, int length[, string pad[, int
pad_type]])
Returns input string padded on the left and/or right (determined by
pad_type) to specified length with pad string.
Impl:
The impl builds the output string by copying appropriate pad
characters to the left and/or right of the input string.
Q: With STR_PAD_BOTH, lets say 'length' == input 'text' length + 2
(lengths in UChars), but 'pad' text is non-BMP (ie. 2 UChars), then
the 'pad' text can't be added at either end. More generally, the 'pad'
text can't be split in the middle of non-BMP codepts or base+combining
sequences. If such a condn occurs, an error should be returned. Any
other thoughts ?

We should not split padding strings in the middle of surrogate pairs.
As for combining sequences, I would defer to Tex and see what he has to
say. The input length parameter should indicate the number of
codepoints to pad to, not the number of UChars.

[7] int levenshtein(string str1, string str2[, int ins_cost, int
rep_cost, int del_cost])
Calculate Levenshtein distance between str1 & str2.

Q: Any gotchas in extending the Levenshtein algo for Unicode ? Should
the ins/del/subst cost be expressed in graphemes or codepts ?

I think it can be fairly easily extended to Unicode strings, since the
algorithm only cares about the insertion, deletion, or substitution of
characters. We should once again work on the codepoint level.

The foll funcns generally work on ASCII input, and should be made
Unicode-aware. However, should they be converted to process Unicode
input ?

[1] string addslashes(string text)
[2] string stripslashes(string text)
Escape single/double quotes & backslashes with backslashes

I don't see any problems with these two.

[3] string addcslashes(string text, string charlist)
[4] string stripcslashes(string text)
Escape chars < 32 or > 126 with octal sequences, and escape characters
from charlist with backspace.

Same here.

[5] string strip_tags(string text[, string allowed_tags])
Strip HTML/PHP tags from text

Should be ok, but I think we'll end up duplicating a large chunk of
code..

-Andrei

20 years ago by Rolland Santimano — view source — reply

unread

[3] string addcslashes(string text, string charlist)
[4] string stripcslashes(string text)
Escape chars < 32 or > 126 with octal sequences, and escape
characters from charlist with backspace.

Escaping chars/codepts with values > 126 is a pblm in Unicode
strings. Using the 3-digit octal escape sequence, only the first
0x1FF codepts will be escaped.

One soln is to only escape values < 32 with the 3-digit octal
sequence. Or use hex sequences for escaping everything.

Thoughts ?

Rolland

PHP Unicode strings impl proposal

TIA, Rolland

TIA,
Rolland