Newsgroups: php.internals Path: news.php.net Xref: news.php.net php.internals:37518 Return-Path: Mailing-List: contact internals-help@lists.php.net; run by ezmlm Delivered-To: mailing list internals@lists.php.net Received: (qmail 90454 invoked from network); 7 May 2008 16:48:19 -0000 Received: from unknown (HELO lists.php.net) (127.0.0.1) by localhost with SMTP; 7 May 2008 16:48:19 -0000 Authentication-Results: pb1.pair.com header.from=andrei@gravitonic.com; sender-id=unknown Authentication-Results: pb1.pair.com smtp.mail=andrei@gravitonic.com; spf=permerror; sender-id=unknown Received-SPF: error (pb1.pair.com: domain gravitonic.com from 74.125.46.30 cause and error) X-PHP-List-Original-Sender: andrei@gravitonic.com X-Host-Fingerprint: 74.125.46.30 yw-out-2324.google.com Received: from [74.125.46.30] ([74.125.46.30:61183] helo=yw-out-2324.google.com) by pb1.pair.com (ecelerity 2.1.1.9-wez r(12769M)) with ESMTP id 7F/62-20715-25DD1284 for ; Wed, 07 May 2008 12:48:18 -0400 Received: by yw-out-2324.google.com with SMTP id 5so213308ywb.83 for ; Wed, 07 May 2008 09:48:15 -0700 (PDT) Received: by 10.114.171.1 with SMTP id t1mr2130438wae.83.1210178894699; Wed, 07 May 2008 09:48:14 -0700 (PDT) Received: from Macintosh-5.local ( [12.51.40.234]) by mx.google.com with ESMTPS id y25sm4493534pod.8.2008.05.07.09.48.12 (version=TLSv1/SSLv3 cipher=RC4-MD5); Wed, 07 May 2008 09:48:13 -0700 (PDT) Message-ID: <4821DD47.9030900@gravitonic.com> Date: Wed, 07 May 2008 09:48:07 -0700 User-Agent: Thunderbird 2.0.0.6 (Macintosh/20070807) MIME-Version: 1.0 To: Tomas Kuliavas CC: internals@lists.php.net References: <4BD5A050-02F2-46BD-B867-FA8CA12FF1BD@macvicar.net> <48988.78.61.224.253.1209918881.nsm@avilys.eik.lt> <60526.78.61.224.253.1209928511.nsm@avilys.eik.lt> In-Reply-To: <60526.78.61.224.253.1209928511.nsm@avilys.eik.lt> Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 7bit Subject: Re: [PHP-DEV] Removal of unicode_semantics From: andrei@gravitonic.com (Andrei Zmievski) Tomas Kuliavas wrote: > If I remain silent, others will have arguments that "everybody agrees on > removal of unicode_semantics". > > I write and maintain charset decoding and encoding functions. > unicode_semantics breaks every mapping table and other functions that > operate with binary 8bit strings. Just curious, do these decoding/encoding functions do something that Unicode support won't do? > In slides by Andrei Zmievski Unicode symbols are written with \u. Why are > they written with \x(hex) and \(octal) in current PHP6? \x and \(octal) inside Unicode strings are assumed to specify Unicode characters. This is one of the contention points, since a few people have said that they should specify individual bytes rather than characters, but in my opinion it's kind of dangerous since it may lead to broken/invalid Unicode strings. > --- > echo "\xC3\200"; > --- > I am not writing U+00C3 and U+0080, I am writing U+00C0 in UTF-8. This should work fine inside binary strings.. > I can bypass it by adding one line to every script that operates with > binary strings, but where are warranties that you won't dump declare() > support just like you dump unicode_semantics. It won't get dumped. Unicode_semantics is a BC/transition switch. declare() is crucial to proper script parsing. > What happens to your new > Unicode aware string functions, if I lie about strings' charset to PHP > interpreter? You will get in trouble. > mb_strlen can't calculate correct $string length even when I > set correct charset in mb_strlen() arguments. If above code works as I > want in PHP6 unicode_semantics=on, mb_strlen($string,'utf-8') returns 2 > and not 1. I don't know what mbstring does or does not with unicode_semantics switch, since it's meant to be deprecated. -Andrei