Newsgroups: php.internals
Path: news.php.net
Xref: news.php.net php.internals:78088
Mailing-List: contact internals-help@lists.php.net; run by ezmlm
Received-SPF: pass (pb1.pair.com: domain gmail.com designates 209.85.212.170 as permitted sender)
User-Agent: K-9 Mail for Android
In-Reply-To: <543DAA29.8040701@gmail.com>
References: <543CE705.7030203@gmail.com> <4575A816-43F4-462D-8150-A2D35516D914@ajf.me> <543D64E5.8000706@gmail.com> <543D8528.1060605@gmail.com> <543D8FFA.8080408@gmail.com> <543DAA29.8040701@gmail.com>
MIME-Version: 1.0
Content-Type: multipart/alternative; boundary="----4S5VX97WLE751FQO47QM9KYAOJLSO2"
Content-Transfer-Encoding: 8bit
Date: Wed, 15 Oct 2014 08:04:48 +0100
To: Aleksey Tulinov <aleksey.tulinov@gmail.com>,internals@lists.php.net
Message-ID: <68E97150-8840-4C31-B271-3E8C8BE933DB@gmail.com>
Subject: Re: [PHP-DEV] Unicode support
From: rowan.collins@gmail.com (Rowan Collins)

------4S5VX97WLE751FQO47QM9KYAOJLSO2
Content-Transfer-Encoding: 8bit
Content-Type: text/plain;
 charset=UTF-8


>Good point. That's what i meant by border-line case. Could you possibly
>
>point me to a specific example of such false positive? I'm interested
>in 
>well-formed UTF-8 string. I believe "noël" test is ill-formed UTF-8
>and 
>doesn't conform to shortest-form requirement.

You're confusing two concepts here: well-formed UTF-8 represents any single code point with the smallest number of bytes, but it makes no requirements about what code points are represented. Representing " ë " as two code points is perfectly valid Unicode, and would in fact be required under NFD.

That "most" input sources would prefer the combined form seems like a weak assumption to base a library on; it only takes one popular third-party to routinely return data in NFD for the problems to start showing up.

>> It's pretty meaningless to say you support Unicode, but only the easy
>> bits. You might as well just tag each string with one of the pages of
>> ISO-8859.
>>
>
>As far as i'm concerned Unicode specification does not require to 
>implement all annexes or even support entire character set to be 
>conformant. I think there are always trade-offs involved, depending on 
>what is more important for you.

Sure, but there are certain user expectations of what "Unicode support" means. Handling Korean characters in a meaningfulmeaningful way would definitely be on that list.

As I said at the top of my first post, the important thing is to capture what those requirements actually are. Just as you'd choose what array functions were needed if you were adding "array support" to a language.

To put it a different way, in what situation would you actively want to know the number of code points in a string, rather than either the number of bytes in its UTF8 representation, or the number of graphemes?
------4S5VX97WLE751FQO47QM9KYAOJLSO2--