Newsgroups: php.internals
Path: news.php.net
Xref: news.php.net php.internals:37447
Mailing-List: contact internals-help@lists.php.net; run by ezmlm
Received-SPF: unknown (pb1.pair.com: domain gmail.com does not designate 62.75.137.136 as permitted sender)
Message-ID: <481EBF1A.6040406@gmail.com>
Date: Mon, 05 May 2008 10:02:34 +0200
User-Agent: Thunderbird 2.0.0.14 (Windows/20080421)
MIME-Version: 1.0
To: Lester Caine <lester@lsces.co.uk>
CC: internals@lists.php.net
References: <4BD5A050-02F2-46BD-B867-FA8CA12FF1BD@macvicar.net>    <48988.78.61.224.253.1209918881.nsm@avilys.eik.lt>    <alpine.DEB.0.98.0805041848040.5353@kossu.ez.no> <60526.78.61.224.253.1209928511.nsm@avilys.eik.lt> <481EB410.1090804@lsces.co.uk>
In-Reply-To: <481EB410.1090804@lsces.co.uk>
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Subject: Re: [PHP-DEV] Removal of unicode_semantics
From: stefan.walk@gmail.com (Stefan Walk)

Lester Caine schrieb:
> That sounds like just the sort of edge case that Derick is suggesting 
> needs logging for fixing up. unicode_semantics=on is just another bodge 
> to to make it happen rather than a solution. I think I understand your 
> description, and to my eyes it looks like a unicode bug that needs 
> addressing?

No, it's a misunderstanding of how things work that has been explained 
to Tomas countless times. A unicode string consists of codepoints, not 
of bytes. Having \xXX and \XXX insert bytes instead of codepoints does 
not make sense, because  a) That would require a defined unicode 
encoding to be used, and even if that is the case b) would allow you to 
insert broken data into the unicode string, so it's not a unicode string 
anymore, which is a no-no. If you want to do that sort of fiddling with 
binary details, use binary strings, not unicode strings.

Regards,
Stefan