Newsgroups: php.internals Path: news.php.net Xref: news.php.net php.internals:30553 Return-Path: Mailing-List: contact internals-help@lists.php.net; run by ezmlm Delivered-To: mailing list internals@lists.php.net Received: (qmail 8320 invoked by uid 1010); 6 Jul 2007 18:16:40 -0000 Delivered-To: ezmlm-scan-internals@lists.php.net Delivered-To: ezmlm-internals@lists.php.net Received: (qmail 8304 invoked from network); 6 Jul 2007 18:16:40 -0000 Received: from unknown (HELO lists.php.net) (127.0.0.1) by localhost with SMTP; 6 Jul 2007 18:16:40 -0000 Authentication-Results: pb1.pair.com smtp.mail=tokul@users.sourceforge.net; spf=permerror; sender-id=unknown Authentication-Results: pb1.pair.com header.from=tokul@users.sourceforge.net; sender-id=unknown Received-SPF: error (pb1.pair.com: domain users.sourceforge.net from 213.197.162.99 cause and error) X-PHP-List-Original-Sender: tokul@users.sourceforge.net X-Host-Fingerprint: 213.197.162.99 avilys.eik.lt Linux 2.6 Received: from [213.197.162.99] ([213.197.162.99:39233] helo=avilys.eik.lt) by pb1.pair.com (ecelerity 2.1.1.9-wez r(12769M)) with ESMTP id E2/F0-50692-7078E864 for ; Fri, 06 Jul 2007 14:16:39 -0400 Received: from avilys.eik.lt (avilys.local [127.0.0.1]) by avilys.eik.lt (Postfix) with ESMTP id 591722488E2 for ; Fri, 6 Jul 2007 21:14:44 +0300 (EEST) Received: from avilys.eik.lt (avilys.local [127.0.0.1]) by avilys.eik.lt (Postfix) with ESMTP id 3C7572488E1 for ; Fri, 6 Jul 2007 21:14:44 +0300 (EEST) Received: from 78.61.224.253 (NaSMail authenticated user tomas@topolis.lt) by avilys.eik.lt with HTTP; Fri, 6 Jul 2007 21:14:44 +0300 (EEST) Message-ID: <60304.78.61.224.253.1183745684.squirrel@avilys.eik.lt> In-Reply-To: <468E7A62.4030703@zend.com> References: <1181829227.3478.3.camel@localhost.localdomain> <7d5a202f0706141844l3c75b556hdbecbcd5a43747c9@mail.gmail.com> <4671F184.2020401@lerdorf.com> <6sof73dj69ldpspfc5ukrc58qr9ckbin2b@4ax.com> <4677E7B1.2080305@lerdorf.com> <4677F5FB.1070206@lerdorf.com> <4678252F.2050803@sci.fi> <46783212.4020900@lerdorf.com> <34654.216.230.84.67.1183064088.squirrel@www.l-i-e.com> <54557.78.61.224.253.1183098089.squirrel@avilys.eik.lt> <2159.24.1.37.132.1183693437.squirrel@www.l-i-e.com> <468DDFEB.3080404@zend.com> <47498.78.61.224.253.1183713764.squirrel@avilys.eik.lt> <468E7A62.4030703@zend.com> Date: Fri, 6 Jul 2007 21:14:44 +0300 (EEST) To: internals@lists.php.net User-Agent: NaSMail/1.2 MIME-Version: 1.0 Content-Type: text/plain;charset=utf-8 Content-Transfer-Encoding: 8bit X-Priority: 3 (Normal) Importance: Normal X-Virus-Scanned: ClamAV using ClamSMTP Subject: Re: [PHP-DEV] What is the use of "unicode.semantics" in PHP 6? From: tokul@users.sourceforge.net ("Tomas Kuliavas") >> --- test.php --- >> > $string1 = "ą"; >> $string2 = "\xC4\x85"; >> var_dump($string1 == $string2) > > How you expect one-character string to be equal to two-character string? In PHP4/5 \xC4 and \x85 are not characters. They are bytes. >> ą is in utf-8 (latin small letter a with ogonek, latin extended-a >> range). It contains two bytes with 0xC4 0x85 values. > > It contains two bytes in the filesystem. It however contains one > character in PHP. In unicode mode, bytes and characters are different > things. You could make $string2 as binary and then convert it from utf-8 > to unicode, but without explicitly saying otherwise that string contains > two characters - U+00C4 (LATIN CAPITAL LETTER A WITH DIAERESIS) and > U+0085 (control character, no name). It doesn't mean escape sequences > stop working, it means characters and bytes are no more the same. That's > the price one has to pay for doing unicode. I can't pay such price. You are reducing available coding options and want me to rely on your functions when existing code was doing fine without unicode support and your functions are not documented (http://www.php.net/unicode) and don't provide the way to see the difference between 7bit and 8bit string. Theoretically I might call unicode_encode() with ascii target, but doing charset conversions just to detect 8bit is a hack and not a solution. If I take a look at ext/unicode/unicode.c, I see more PHP_FUNCTION functions. I don't know PHP6 release schedule. If PHP6 is approaching RC stage, maybe docs can be updated to inform about these functions. PHP provides API for PHP scripts developers. Strongest API part is good documentation. I shouldn't have to dig through C sources in order to learn about available interpreter features. If you write code now and document it later, you won't document it or it will take some time and lots of bug reports to sync sources with manual. I think I'll be able to port scripts to PHP6 unicode.semantics=on. Currently I am not sure only about POP3 and IMAP streams with data encoded in different character sets and MIME Q encoding. -- Tomas