Newsgroups: php.internals Path: news.php.net Xref: news.php.net php.internals:18310 Return-Path: Mailing-List: contact internals-help@lists.php.net; run by ezmlm Delivered-To: mailing list internals@lists.php.net Received: (qmail 61161 invoked by uid 1010); 23 Aug 2005 15:23:09 -0000 Delivered-To: ezmlm-scan-internals@lists.php.net Delivered-To: ezmlm-internals@lists.php.net Received: (qmail 61146 invoked from network); 23 Aug 2005 15:23:09 -0000 Received: from unknown (HELO lists.php.net) (127.0.0.1) by localhost with SMTP; 23 Aug 2005 15:23:09 -0000 X-Host-Fingerprint: 68.142.207.65 web31802.mail.mud.yahoo.com Received: from ([68.142.207.65:39938] helo=web31802.mail.mud.yahoo.com) by pb1.pair.com (ecelerity 2.0 beta r(6323M)) with SMTP id 27/BA-28235-D5F3B034 for ; Tue, 23 Aug 2005 11:23:09 -0400 Received: (qmail 94808 invoked by uid 60001); 23 Aug 2005 15:23:06 -0000 DomainKey-Signature: a=rsa-sha1; q=dns; c=nofws; s=s1024; d=yahoo.com; h=Message-ID:Received:Date:From:Subject:To:Cc:MIME-Version:Content-Type:Content-Transfer-Encoding; b=hQjHpN+CTGRJLvd7rqn0UbAXUKm+n3+Wjx2N1bVNjA2YqDSv5kMG9r2nQmOVbC5StmlkszHbpGtIkM3i9FXYfGx9CgkpmtZ/vcTLoJTLPhtaboVKBfWmV+wKT9gae5/n9QpZowtdHYYT+92rkJodQLwqkjDKYD8q8DlP3B9+VZU= ; Message-ID: <20050823152306.94806.qmail@web31802.mail.mud.yahoo.com> Received: from [202.46.19.93] by web31802.mail.mud.yahoo.com via HTTP; Tue, 23 Aug 2005 08:23:06 PDT Date: Tue, 23 Aug 2005 08:23:06 -0700 (PDT) To: Andrei Zmievski , Tex Texin Cc: internals@lists.php.net MIME-Version: 1.0 Content-Type: text/plain; charset=iso-8859-1 Content-Transfer-Encoding: 8bit Subject: PHP Unicode strings impl proposal From: rollandsantimano@yahoo.com (Rolland Santimano) Comments / impl suggestions please. TIA, Rolland -- [1] string substr_replace(string original, string new, int start[, int length]) Returns string where original[start..length] is replaced with new. Input args can be arrays, in which case case the operation is: substr_replace(original[i], new[i], start[i], length[i]) Impl: The current impl is written in terms of memcpy(), after adjusting satrt & length correctly. With Unicode input, 'start' & 'length' may not be aligned with codepoint/grapheme boundaries. If args are mixed string types, convert to common type. [2] int substr_count(string text, string token[, int start[, int length]]) Returns no of occurrences of token in text[start..length] Impl: The current impl is around php_memnstr() and can be extended for Unicode with zend_u_memnstr() [3] string strtok([string text, ]string separator) Tokenize string Impl: Current impl uses global state, in the form of char ptrs and a 256-char array. Mixed string type input would be converted to common type, and new global state would have to include initial type of separator. Tokenizing should honor base+combining sequences. [4] string strrev(string text) Returns reversed string equivalent of input. Impl: The current impl walks the input string in reverse and copies it one character at a time. This can be achieved using the U16_NEXT/U16_PREV macros. Combining characters can be copied together using the u_getCombiningClass() API. [5] string str_pad(string text, int length[, string pad[, int pad_type]]) Returns input string padded on the left and/or right (determined by pad_type) to specified length with pad string. Impl: The impl builds the output string by copying appropriate pad characters to the left and/or right of the input string. Q: With STR_PAD_BOTH, lets say 'length' == input 'text' length + 2 (lengths in UChars), but 'pad' text is non-BMP (ie. 2 UChars), then the 'pad' text can't be added at either end. More generally, the 'pad' text can't be split in the middle of non-BMP codepts or base+combining sequences. If such a condn occurs, an error should be returned. Any other thoughts ? [6] int similar_text(string str1, string str2[, int percentage]) Returns no of common characters between str1 & str2. Impl: The current impl determines common characters by comparing characters to generate common sequences. Comparisons for Unicode strings should be done with codepoints. [7] int levenshtein(string str1, string str2[, int ins_cost, int rep_cost, int del_cost]) Calculate Levenshtein distance between str1 & str2. Q: Any gotchas in extending the Levenshtein algo for Unicode ? Should the ins/del/subst cost be expressed in graphemes or codepts ? ================================================================= The foll funcns generally work on ASCII input, and should be made Unicode-aware. However, should they be converted to process Unicode input ? [1] string addslashes(string text) [2] string stripslashes(string text) Escape single/double quotes & backslashes with backslashes [3] string addcslashes(string text, string charlist) [4] string stripcslashes(string text) Escape chars < 32 or > 126 with octal sequences, and escape characters from charlist with backspace. [5] string strip_tags(string text[, string allowed_tags]) Strip HTML/PHP tags from text