Newsgroups: php.internals
Path: news.php.net
Xref: news.php.net php.internals:78095
Mailing-List: contact internals-help@lists.php.net; run by ezmlm
Received-SPF: pass (pb1.pair.com: domain gmail.com designates 74.125.82.48 as permitted sender)
Message-ID: <543E6F6E.8080006@gmail.com>
Date: Wed, 15 Oct 2014 13:58:22 +0100
User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:24.0) Gecko/20100101 Thunderbird/24.6.0
MIME-Version: 1.0
To: internals@lists.php.net
References: <543CE705.7030203@gmail.com> <4575A816-43F4-462D-8150-A2D35516D914@ajf.me> <543D64E5.8000706@gmail.com> <543D8528.1060605@gmail.com> <543D8FFA.8080408@gmail.com> <543DAA29.8040701@gmail.com> <68E97150-8840-4C31-B271-3E8C8BE933DB@gmail.com> <543E4626.3000102@gmail.com>
In-Reply-To: <543E4626.3000102@gmail.com>
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
Subject: Re: [PHP-DEV] Unicode support
From: rowan.collins@gmail.com (Rowan Collins)

Aleksey Tulinov wrote (on 15/10/2014):
> On 15/10/14 10:04, Rowan Collins wrote:
>
> Rowan,
>
>> As I said at the top of my first post, the important thing is to capture
>> what those requirements actually are. Just as you'd choose what array
>> functions were needed if you were adding "array support" to a language.
>>
>
> I'm sorry for not making myself clear. What i'm essentially saying is 
> that i think "noël" test is synthetic and impractical

I remain unconvinced on that, and it's just one example. There are 
plenty of forms which don't have a combined form, otherwise there would 
be no need for combining diacritics to exist in the first place.

> it's also solvable with requirement of NFC strings at input and this 
> is not implementation defect. I also believe that Hangul is most 
> likely to be precomposed and will work alright.

Requiring a particular normal form on input is not something a 
programming language can do. The only way you can guarantee NFC form is 
by performing the normalisation.

> And i have another opinion on UTF-8 shortest-form.

There's no need for opinion there, we can consult the standard. 
http://www.unicode.org/versions/Unicode6.0.0/

 > D76 Unicode scalar value: Any Unicode code point except 
high-surrogate and low-surrogate
code points.
 > D79 A Unicode encoding form assigns each Unicode scalar value to a 
unique code unit
sequence.
 > D77 Code unit: The minimal bit combination that can represent a unit 
of encoded text
for processing or interchange. [...] The Unicode Standard uses 8-bit 
code units in the UTF-8 encoding form [...]
 > D79 A Unicode encoding form assigns each Unicode scalar value to a 
unique code unit
sequence.
 > D85a Minimal well-formed code unit subsequence: A well-formed Unicode 
code unit
sequence that maps to a single Unicode scalar value.
 > D92 UTF-8 encoding form: The Unicode encoding form that assigns each 
Unicode scalar
value to an unsigned byte sequence of one to four bytes in length, as 
specified in
Table 3-6 and Table 3-7.
 > Before the Unicode Standard, Version 3.1, the problematic 
“non-shortest form”
byte sequences in UTF-8 were those where BMP characters could be represented
in more than one way. These sequences are ill-formed, because they are
not allowed by Table 3-7.

In short: UTF-8 defines a mapping between sequences of 8-bit "code 
units" to abstract "Unicode scalar values". Every Unicode scalar value 
maps to a single unique sequence of code units, but all Unicode scalar 
values can be represented. Since "U+0308 COMBINING DIAERESIS" is a valid 
Unicode scalar value, a UTF-8 string representing that value can be 
well-formed. It is only alternative representations of the same Unicode 
scalar value which must be in shortest form.

There may be standards for interchange in particular situations which 
enforce additional constraints, such as that all strings should be in 
NFC, but the applicability or correct implementation of such standards 
is not something that you can use to define handling in an entire 
programming language.

>
> That aside.
>
> I think requirements is what i was asking about, i'm assuming that 
> your standpoint is that string modification routines are at least 
> required to take into account entire characters, not only code points. 
> Am i correct?

Yes, I think that at least some functions should be available which work 
on "characters" as users would define them, such as length and perhaps 
safe truncation.

>
> What is confusing me is that i think you're seeing it as a major 
> implementation defect. To avoid arguable implementations, i've made 
> short example in Java:
>
> System.out.println(new StringBuffer("noël").reverse().toString());
>
> It does produce string "l̈eon" as i would expect. 

Why do you expect that? Is this a result which would ever be useful?

To be clear, I am suggesting that we aim to be the language which gets 
this right, where other languages get it wrong.

> Precomposed "noël" also works as i would expect producing string 
> "lëon". What do you think, is this implementation issue or solely 
> requirements issue?

Well, you can only define an implementation defect with respect to the 
original requirement. If the requirement was to reverse "characters", as 
most users would understand that term, then moving the diacritic to a 
different letter fails that requirement, because a user would not 
consider a diacritic a separate character.

If the requirement was to reverse code points, regardless of their 
meaning, then the implementation is fine, but I would argue that the 
requirement failed to capture what most users would actually want.

Regards,
-- 
Rowan Collins
[IMSoP]