Newsgroups: php.internals
Path: news.php.net
Xref: news.php.net php.internals:78061
Mailing-List: contact internals-help@lists.php.net; run by ezmlm
Received-SPF: pass (pb1.pair.com: domain gmail.com designates 209.85.215.49 as permitted sender)
Message-ID: <543D8528.1060605@gmail.com>
Date: Tue, 14 Oct 2014 23:18:48 +0300
User-Agent: Mozilla/5.0 (X11; Linux i686; rv:31.0) Gecko/20100101 Thunderbird/31.1.2
MIME-Version: 1.0
To: internals@lists.php.net
References: <543CE705.7030203@gmail.com> <4575A816-43F4-462D-8150-A2D35516D914@ajf.me> <543D64E5.8000706@gmail.com>
In-Reply-To: <543D64E5.8000706@gmail.com>
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Transfer-Encoding: 8bit
Subject: Re: [PHP-DEV] Unicode support
From: aleksey.tulinov@gmail.com (Aleksey Tulinov)

On 14/10/14 21:01, Rowan Collins wrote:

Rowan,

> As I've mentioned before, a lot of the time what people actually want to
> deal with is "grapheme clusters" - the kind of thing that you'd think of
> as a character if you were writing by hand. Most people, if asked the
> length of the string "noël", would answer 4, but there may be 5 code
> points. (That's not just a case of normalisation choices; most
> combinations of letter+diacritic have no single code point, that's why
> the combining forms exist.)
>

Very good point. I'll give another example: is there a substring "s" in 
string "Maße"? If it's case-sensitive search, when there is no such 
substring, but if it's case-insensitive search, then "ß" folds into "ss" 
and substring "s" appears.

This works both ways. For instance, if someone wants to split string 
"MASSE" after "ß" in case-insensitive manner, one approach might be: 1) 
find "ß" position, it's +2; 2) split string at +3. Result would be two 
strings: "MAS" and "SE".

Back to combining characters, i dig the idea of introducing graphemes, 
but i think French person would write word "noël" using precomposed 
character. I'm using French keyboard at 
https://translate.google.com/#fr/. "ë" is Shift + "^", then "e", it 
produces precomposed U+00EB.

If script doesn't have precomposed equivalent, then this grapheme will 
always be in the same decomposed form and collation will work. Substring 
search will also work, because needle will be decomposed in the same way 
as haystack. There are some border-line cases possible, but are they 
really practical in a scope of Unicode support in a programming language?

Any ideas?

P.S. Point about documentation taken.