Newsgroups: php.internals Path: news.php.net Xref: news.php.net php.internals:78061 Return-Path: Mailing-List: contact internals-help@lists.php.net; run by ezmlm Delivered-To: mailing list internals@lists.php.net Received: (qmail 84724 invoked from network); 14 Oct 2014 20:18:55 -0000 Received: from unknown (HELO lists.php.net) (127.0.0.1) by localhost with SMTP; 14 Oct 2014 20:18:55 -0000 Authentication-Results: pb1.pair.com header.from=aleksey.tulinov@gmail.com; sender-id=pass Authentication-Results: pb1.pair.com smtp.mail=aleksey.tulinov@gmail.com; spf=pass; sender-id=pass Received-SPF: pass (pb1.pair.com: domain gmail.com designates 209.85.215.49 as permitted sender) X-PHP-List-Original-Sender: aleksey.tulinov@gmail.com X-Host-Fingerprint: 209.85.215.49 mail-la0-f49.google.com Received: from [209.85.215.49] ([209.85.215.49:54372] helo=mail-la0-f49.google.com) by pb1.pair.com (ecelerity 2.1.1.9-wez r(12769M)) with ESMTP id E9/05-18603-D258D345 for ; Tue, 14 Oct 2014 16:18:54 -0400 Received: by mail-la0-f49.google.com with SMTP id q1so9271440lam.36 for ; Tue, 14 Oct 2014 13:18:51 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=message-id:date:from:user-agent:mime-version:to:subject:references :in-reply-to:content-type:content-transfer-encoding; bh=c8ynOK/YDaft99APaW6QqrcP/pkEfxEs1gt7I9blq5s=; b=QuXoD07ScsVXc9wlzNd5dJF2s3cdomNjiwHxMoW4lzviqK2sXdGo1ThW0DU+xKu/QH SdHysbeJC9lZ/hn5sXE02LPkpaRisKMEzynQRtYTGlR18dKd9iy+97ILnGRqm0oKth9Y eCNgae9p+b0eqZahEoc1gZoL0lhh1Jt4FrVLeOXi2rqAe0Au48l5Cz9ANbwH1JAQjcSF J+XJ+gq0WPe28w6I8PLfyWznXh51un+7RmJYw2hMo/j63zUdNBfmG8A99yLEmsB5PG1v gosi4B3WHU3ThfAOVvjPinLd988awupYRF9LUVwOfJ1+ap0uBnIWOjOnXo0PPweg0blF Bwrg== X-Received: by 10.112.28.75 with SMTP id z11mr7582685lbg.49.1413317931339; Tue, 14 Oct 2014 13:18:51 -0700 (PDT) Received: from [172.16.0.137] ([195.177.73.61]) by mx.google.com with ESMTPSA id wj8sm5959895lbb.34.2014.10.14.13.18.49 for (version=TLSv1.2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Tue, 14 Oct 2014 13:18:50 -0700 (PDT) Message-ID: <543D8528.1060605@gmail.com> Date: Tue, 14 Oct 2014 23:18:48 +0300 User-Agent: Mozilla/5.0 (X11; Linux i686; rv:31.0) Gecko/20100101 Thunderbird/31.1.2 MIME-Version: 1.0 To: internals@lists.php.net References: <543CE705.7030203@gmail.com> <4575A816-43F4-462D-8150-A2D35516D914@ajf.me> <543D64E5.8000706@gmail.com> In-Reply-To: <543D64E5.8000706@gmail.com> Content-Type: text/plain; charset=utf-8; format=flowed Content-Transfer-Encoding: 8bit Subject: Re: [PHP-DEV] Unicode support From: aleksey.tulinov@gmail.com (Aleksey Tulinov) On 14/10/14 21:01, Rowan Collins wrote: Rowan, > As I've mentioned before, a lot of the time what people actually want to > deal with is "grapheme clusters" - the kind of thing that you'd think of > as a character if you were writing by hand. Most people, if asked the > length of the string "noël", would answer 4, but there may be 5 code > points. (That's not just a case of normalisation choices; most > combinations of letter+diacritic have no single code point, that's why > the combining forms exist.) > Very good point. I'll give another example: is there a substring "s" in string "Maße"? If it's case-sensitive search, when there is no such substring, but if it's case-insensitive search, then "ß" folds into "ss" and substring "s" appears. This works both ways. For instance, if someone wants to split string "MASSE" after "ß" in case-insensitive manner, one approach might be: 1) find "ß" position, it's +2; 2) split string at +3. Result would be two strings: "MAS" and "SE". Back to combining characters, i dig the idea of introducing graphemes, but i think French person would write word "noël" using precomposed character. I'm using French keyboard at https://translate.google.com/#fr/. "ë" is Shift + "^", then "e", it produces precomposed U+00EB. If script doesn't have precomposed equivalent, then this grapheme will always be in the same decomposed form and collation will work. Substring search will also work, because needle will be decomposed in the same way as haystack. There are some border-line cases possible, but are they really practical in a scope of Unicode support in a programming language? Any ideas? P.S. Point about documentation taken.