Newsgroups: php.internals Path: news.php.net Xref: news.php.net php.internals:18322 Return-Path: Mailing-List: contact internals-help@lists.php.net; run by ezmlm Delivered-To: mailing list internals@lists.php.net Received: (qmail 8004 invoked by uid 1010); 23 Aug 2005 20:59:15 -0000 Delivered-To: ezmlm-scan-internals@lists.php.net Delivered-To: ezmlm-internals@lists.php.net Received: (qmail 7989 invoked from network); 23 Aug 2005 20:59:15 -0000 Received: from unknown (HELO lists.php.net) (127.0.0.1) by localhost with SMTP; 23 Aug 2005 20:59:15 -0000 X-Host-Fingerprint: 216.145.54.171 mrout1.yahoo.com FreeBSD 4.7-5.2 (or MacOS X 10.2-10.3) (2) Received: from ([216.145.54.171:43081] helo=mrout1.yahoo.com) by pb1.pair.com (ecelerity 2.0 beta r(6323M)) with SMTP id 0C/6C-28235-22E8B034 for ; Tue, 23 Aug 2005 16:59:15 -0400 Received: from [66.228.175.145] (borndress-lm.corp.yahoo.com [66.228.175.145]) by mrout1.yahoo.com (8.13.4/8.13.4/y.out) with ESMTP id j7NKvtU5041419; Tue, 23 Aug 2005 13:57:55 -0700 (PDT) In-Reply-To: <20050823152306.94806.qmail@web31802.mail.mud.yahoo.com> References: <20050823152306.94806.qmail@web31802.mail.mud.yahoo.com> Mime-Version: 1.0 (Apple Message framework v622) Content-Type: text/plain; charset=US-ASCII; format=flowed Message-ID: <10f8974e9d68bae15cfb7dfbd27ed5ba@gravitonic.com> Content-Transfer-Encoding: 7bit Cc: Tex Texin , internals@lists.php.net Date: Tue, 23 Aug 2005 13:58:04 -0700 To: Rolland Santimano X-Mailer: Apple Mail (2.622) Subject: Re: [PHP-DEV] PHP Unicode strings impl proposal From: andrei@gravitonic.com (Andrei Zmievski) On Aug 23, 2005, at 8:23 AM, Rolland Santimano wrote: > [1] string substr_replace(string original, string new, int start[, int > length]) > Returns string where original[start..length] is replaced with > new. Input args can be arrays, in which case case the operation is: > substr_replace(original[i], new[i], start[i], length[i]) > Impl: > The current impl is written in terms of memcpy(), after adjusting > satrt & length correctly. With Unicode input, 'start' & 'length' may > not be aligned with codepoint/grapheme boundaries. If args are mixed > string types, convert to common type. What do you mean, they may not be aligned with codepoint boundaries? We have to make sure that they are. In order to do this, we need to use U16_FWD() macro to iterate through the number of codepoints indicated by 'start', and then from that point do the same for 'length'. Once that's done you will have the boundaries in terms of UChar*'s. > [2] int substr_count(string text, string token[, int start[, int > length]]) > Returns no of occurrences of token in text[start..length] > Impl: > The current impl is around php_memnstr() and can be extended for > Unicode with zend_u_memnstr() Same thing with regard to start and length applies here. > [3] string strtok([string text, ]string separator) > Tokenize string > Impl: > Current impl uses global state, in the form of char ptrs and a > 256-char array. Mixed string type input would be converted to common > type, and new global state would have to include initial type of > separator. Tokenizing should honor base+combining sequences. I think we need to flesh out more details here. We can't possibly keep a strtok table the size of the entire Unicode set. ICU has its own u_strtok_r() function, but its limitation is that it does not support surrogate pairs (which we should). As for honoring base+combining sequences, why should strtok() be any more special than strstr()? > [5] string str_pad(string text, int length[, string pad[, int > pad_type]]) > Returns input string padded on the left and/or right (determined by > pad_type) to specified length with pad string. > Impl: > The impl builds the output string by copying appropriate pad > characters to the left and/or right of the input string. > Q: With STR_PAD_BOTH, lets say 'length' == input 'text' length + 2 > (lengths in UChars), but 'pad' text is non-BMP (ie. 2 UChars), then > the 'pad' text can't be added at either end. More generally, the 'pad' > text can't be split in the middle of non-BMP codepts or base+combining > sequences. If such a condn occurs, an error should be returned. Any > other thoughts ? We should not split padding strings in the middle of surrogate pairs. As for combining sequences, I would defer to Tex and see what he has to say. The input length parameter should indicate the number of codepoints to pad to, not the number of UChars. > [7] int levenshtein(string str1, string str2[, int ins_cost, int > rep_cost, int del_cost]) > Calculate Levenshtein distance between str1 & str2. > > Q: Any gotchas in extending the Levenshtein algo for Unicode ? Should > the ins/del/subst cost be expressed in graphemes or codepts ? I think it can be fairly easily extended to Unicode strings, since the algorithm only cares about the insertion, deletion, or substitution of characters. We should once again work on the codepoint level. > The foll funcns generally work on ASCII input, and should be made > Unicode-aware. However, should they be converted to process Unicode > input ? > > [1] string addslashes(string text) > [2] string stripslashes(string text) > Escape single/double quotes & backslashes with backslashes I don't see any problems with these two. > [3] string addcslashes(string text, string charlist) > [4] string stripcslashes(string text) > Escape chars < 32 or > 126 with octal sequences, and escape characters > from charlist with backspace. Same here. > [5] string strip_tags(string text[, string allowed_tags]) > Strip HTML/PHP tags from text Should be ok, but I think we'll end up duplicating a large chunk of code.. -Andrei