Newsgroups: php.internals Path: news.php.net Xref: news.php.net php.internals:29632 Return-Path: Mailing-List: contact internals-help@lists.php.net; run by ezmlm Delivered-To: mailing list internals@lists.php.net Received: (qmail 54878 invoked by uid 1010); 21 May 2007 18:27:39 -0000 Delivered-To: ezmlm-scan-internals@lists.php.net Delivered-To: ezmlm-internals@lists.php.net Received: (qmail 54863 invoked from network); 21 May 2007 18:27:39 -0000 Received: from unknown (HELO lists.php.net) (127.0.0.1) by localhost with SMTP; 21 May 2007 18:27:39 -0000 Authentication-Results: pb1.pair.com smtp.mail=tokul@users.sourceforge.net; spf=permerror; sender-id=unknown Authentication-Results: pb1.pair.com header.from=tokul@users.sourceforge.net; sender-id=unknown Received-SPF: error (pb1.pair.com: domain users.sourceforge.net from 213.197.162.99 cause and error) X-PHP-List-Original-Sender: tokul@users.sourceforge.net X-Host-Fingerprint: 213.197.162.99 avilys.eik.lt Linux 2.6 Received: from [213.197.162.99] ([213.197.162.99:52598] helo=avilys.eik.lt) by pb1.pair.com (ecelerity 2.1.1.9-wez r(12769M)) with ESMTP id 2F/C0-03101-794E1564 for ; Mon, 21 May 2007 14:27:36 -0400 Received: from avilys.eik.lt (avilys.local [127.0.0.1]) by avilys.eik.lt (Postfix) with ESMTP id 375841F514D; Mon, 21 May 2007 21:26:14 +0300 (EEST) Received: from avilys.eik.lt (avilys.local [127.0.0.1]) by avilys.eik.lt (Postfix) with ESMTP id 211DC1F514B; Mon, 21 May 2007 21:26:14 +0300 (EEST) Received: from 88.118.163.159 (NaSMail authenticated user tomas@topolis.lt) by avilys.eik.lt with HTTP; Mon, 21 May 2007 21:26:14 +0300 (EEST) Message-ID: <48916.88.118.163.159.1179771974.squirrel@avilys.eik.lt> In-Reply-To: <335A483A-55B1-4A1D-A2CF-A6DB0EDDFA5F@gravitonic.com> References: <51491.88.118.163.159.1179577357.squirrel@avilys.eik.lt> <464EEF4B.1030002@zend.com> <40865.88.118.163.159.1179583186.squirrel@avilys.eik.lt> <464F090A.9090200@zend.com> <35054.88.118.163.159.1179589687.squirrel@avilys.eik.lt> <464F650B.6090802@zend.com> <59165.88.118.163.159.1179641635.squirrel@avilys.eik.lt> <335A483A-55B1-4A1D-A2CF-A6DB0EDDFA5F@gravitonic.com> Date: Mon, 21 May 2007 21:26:14 +0300 (EEST) To: "Andrei Zmievski" Cc: internals@lists.php.net User-Agent: NaSMail/1.0 MIME-Version: 1.0 Content-Type: text/plain;charset=utf-8 Content-Transfer-Encoding: 8bit X-Priority: 3 (Normal) Importance: Normal X-Virus-Scanned: ClamAV using ClamSMTP Subject: Re: [PHP-DEV] PHP Unicode extension in PHP6 From: tokul@users.sourceforge.net ("Tomas Kuliavas") >> 0xC4 and 0x85 are hex codes for latin small letter a with ogonek in >> utf-8. ą >> >> > var_dump("ą" == "\xC4\x85"); >> echo "ą\n"; >> echo "\xC4\x85"; >> ?> >> >> If script is written in utf-8, I expect bool(true) on var_dump() line. > > var_dump("ą" == b"\xC4\x85"); > > This will give you what you want, if the script is written in UTF-8 > and your runtime encoding is set to UTF-8. > >> > // example uses utf-8. similar code is used in iso-8859-2 - >> // iso-8859-16 decoding. utf-8 decoding does not need mapping tables >> // and is written in pcre. >> $s1 = "ą"; >> $s2 = "\xC4\x85"; >> echo str_replace($s2,'ą',$s1); >> ?> >> >> Expected result: ą >> Got: ą >> >> test setup (php6.0-200705190630) uses trimmed php.ini with only >> unicode.semantics=on setting >> >> unicode.fallback_encoding - no value >> unicode.filesystem_encoding - no value >> unicode.http_input_encoding - no value >> unicode.output_encoding - no value >> unicode.runtime_encoding - no value >> unicode.script_encoding - no value >> unicode.semantics - On >> unicode.stream_encoding - UTF-8 > > Why didn't you set any encoding settings? They are not documented and I am testing configurations that might break scripts. If I test things and want to make code portable, configuration is not supposed to be rational. I can set option with ini_set(), if I understand what option does and it fixes the issue. http://www.php.net/unicode Do you have updated documentation version which explains encoding settings and lists available configuration values? Or am I testing PHP6 too early and you are still months or years away from 6.0.0 betas and rcs? Could you implement pseudo encoding similar to 'pass' encoding used in mbstring? Current implementation does not give controls needed by script writers. SquirrelMail scripts are not written in unicode. They are in ascii. If some 8bit value is used, it is always written in octal or hex notation. These hex values are not written in one character set. In some cases scripts use byte values. For example, locating first utf-8 byte or looking for 0x80-0xFF bytes in string. In other cases they are written in source or target character set. For example, iso-8859-2 decoding function contains array with iso-8859-2 hex values mapped to html codes. Code can't use raw 8bit strings, because they might be corrupted in misconfigured editor used by developer and it is very hard to track such corruption. 8bit data can come only from user input (composed emails and preferences, html forms, one common charset) and imap server (received emails, lots of different charsets and encodings). -- Tomas