Newsgroups: php.internals Path: news.php.net Xref: news.php.net php.internals:29587 Return-Path: Mailing-List: contact internals-help@lists.php.net; run by ezmlm Delivered-To: mailing list internals@lists.php.net Received: (qmail 50916 invoked by uid 1010); 21 May 2007 05:55:16 -0000 Delivered-To: ezmlm-scan-internals@lists.php.net Delivered-To: ezmlm-internals@lists.php.net Received: (qmail 50901 invoked from network); 21 May 2007 05:55:15 -0000 Received: from unknown (HELO lists.php.net) (127.0.0.1) by localhost with SMTP; 21 May 2007 05:55:15 -0000 Authentication-Results: pb1.pair.com smtp.mail=tokul@users.sourceforge.net; spf=permerror; sender-id=unknown Authentication-Results: pb1.pair.com header.from=tokul@users.sourceforge.net; sender-id=unknown Received-SPF: error (pb1.pair.com: domain users.sourceforge.net from 213.197.162.99 cause and error) X-PHP-List-Original-Sender: tokul@users.sourceforge.net X-Host-Fingerprint: 213.197.162.99 avilys.eik.lt Linux 2.6 Received: from [213.197.162.99] ([213.197.162.99:50815] helo=avilys.eik.lt) by pb1.pair.com (ecelerity 2.1.1.9-wez r(12769M)) with ESMTP id 08/A6-05892-04431564 for ; Mon, 21 May 2007 01:55:13 -0400 Received: from avilys.eik.lt (avilys.local [127.0.0.1]) by avilys.eik.lt (Postfix) with ESMTP id CEF6B1F5148; Mon, 21 May 2007 08:53:51 +0300 (EEST) Received: from avilys.eik.lt (avilys.local [127.0.0.1]) by avilys.eik.lt (Postfix) with ESMTP id AF0141F5147; Mon, 21 May 2007 08:53:51 +0300 (EEST) Received: from 195.22.180.233 (NaSMail authenticated user tomas@topolis.lt) by avilys.eik.lt with HTTP; Mon, 21 May 2007 08:53:51 +0300 (EEST) Message-ID: <50806.195.22.180.233.1179726831.squirrel@avilys.eik.lt> In-Reply-To: <4858f9d90705201443t7a649c80o98c7e566f8ff716f@mail.gmail.com> References: <51491.88.118.163.159.1179577357.squirrel@avilys.eik.lt> <464EEF4B.1030002@zend.com> <40865.88.118.163.159.1179583186.squirrel@avilys.eik.lt> <464F090A.9090200@zend.com> <35054.88.118.163.159.1179589687.squirrel@avilys.eik.lt> <464F650B.6090802@zend.com> <59165.88.118.163.159.1179641635.squirrel@avilys.eik.lt> <4858f9d90705201443t7a649c80o98c7e566f8ff716f@mail.gmail.com> Date: Mon, 21 May 2007 08:53:51 +0300 (EEST) To: "Stefan Walk" Cc: internals@lists.php.net User-Agent: NaSMail/1.0 MIME-Version: 1.0 Content-Type: text/plain;charset=utf-8 Content-Transfer-Encoding: 8bit X-Priority: 3 (Normal) Importance: Normal X-Virus-Scanned: ClamAV using ClamSMTP Subject: Re: [PHP-DEV] PHP Unicode extension in PHP6 From: tokul@users.sourceforge.net ("Tomas Kuliavas") > Disclaimer: I don't know much about the way unicode is implemented in > php, i have only used it a bit, but i believe i can clear some things > up here. > >> 0xC4 and 0x85 are hex codes for latin small letter a with ogonek in >> utf-8. ą >> >> > var_dump("ą" == "\xC4\x85"); >> echo "ą\n"; >> echo "\xC4\x85"; >> ?> >> >> If script is written in utf-8, I expect bool(true) on var_dump() line. > > You expect wrong things. "\xC4\x85" is a unicode string containing two > codepoints, those at 0xC4 and 0x85 (LATIN CAPITAL LETTER A WITH > DIAERESIS and NEXT LINE (NEL)), while "ą" is a unicode string > containing one code point (0x0105, LATIN SMALL LETTER A WITH OGONEK) > (see > http://www.unicode.org/charts/PDF/U0080.pdf and > http://www.unicode.org/charts/PDF/U0100.pdf). Different strings, so > comparision should return false. If you want to type bytes, use the > "b" prefix: b"\xC4\x85", and compare that with the binary version of > your string literal. var_dump(b"ą" == b"\xC4\x85"); should give you > bool(true) if your encoding is utf-8. Latin capital letter A with diaeresis is 00C4. Not C4. I wrote two 8bit values. Not two 16bit ones. Interpreter tries to outsmart me and thinks that I want 00C4, when I write C4. http://www.php.net/language.types.string --- \x[0-9A-Fa-f]{1,2} - the sequence of characters matching the regular expression is a character in hexadecimal notation --- One or two alphanumerics after x. This escape is used to write 8bit values. You can't write 16 bit Unicode characters with one escape. And again you are suggesting me unportable solution. Parse error: syntax error, unexpected T_CONSTANT_ENCAPSED_STRING in test5.php on line 2 I don't want to maintain different script version for PHP6 unicode.semantics=on. >> It >> is bool(false), when unicode.semantics are turned on. Internal >> SquirrelMail character set decoding functions write mapping tables in >> hexadecimals or octals. In some cases they evaluate only byte value and >> not whole symbol. Multibyte character set decoding can use recode, iconv >> and mbstring, but most of single byte decoding is written in plain >> string >> functions and stores hex to html mapping tables in associative arrays. >> >> > // example uses utf-8. similar code is used in iso-8859-2 - >> // iso-8859-16 decoding. utf-8 decoding does not need mapping tables >> // and is written in pcre. >> $s1 = "ą"; >> $s2 = "\xC4\x85"; >> echo str_replace($s2,'ą',$s1); >> ?> >> >> Expected result: ą >> Got: ą > > Same thing. If you want binary replacements, use binary strings, not > unicode strings. mbstring.func_overload and unicode.semantics decisions must be made by script writers and not by end users. That's why I asked for PHP_INI_ALL level controls. I'll wait for better documentation on unicode.*_encoding options and will see what I can do with them. -- Tomas