Newsgroups: php.internals Path: news.php.net Xref: news.php.net php.internals:78066 Return-Path: Mailing-List: contact internals-help@lists.php.net; run by ezmlm Delivered-To: mailing list internals@lists.php.net Received: (qmail 93010 invoked from network); 14 Oct 2014 21:05:06 -0000 Received: from unknown (HELO lists.php.net) (127.0.0.1) by localhost with SMTP; 14 Oct 2014 21:05:06 -0000 Authentication-Results: pb1.pair.com header.from=rowan.collins@gmail.com; sender-id=pass Authentication-Results: pb1.pair.com smtp.mail=rowan.collins@gmail.com; spf=pass; sender-id=pass Received-SPF: pass (pb1.pair.com: domain gmail.com designates 74.125.82.51 as permitted sender) X-PHP-List-Original-Sender: rowan.collins@gmail.com X-Host-Fingerprint: 74.125.82.51 mail-wg0-f51.google.com Received: from [74.125.82.51] ([74.125.82.51:41907] helo=mail-wg0-f51.google.com) by pb1.pair.com (ecelerity 2.1.1.9-wez r(12769M)) with ESMTP id 0E/B6-18603-0009D345 for ; Tue, 14 Oct 2014 17:05:05 -0400 Received: by mail-wg0-f51.google.com with SMTP id b13so11838318wgh.22 for ; Tue, 14 Oct 2014 14:05:01 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=message-id:date:from:user-agent:mime-version:to:subject:references :in-reply-to:content-type:content-transfer-encoding; bh=NYn1lVgmOAtVTJ63qpoDI6I8XDpTn3w4o+HmSCs5huk=; b=T0okXYWAR8ZaB0E4UEZ/9KxqWjKyZnNHQ1OQUWeBN22lUXrS4brjE+iqwn3cS6zvpc 9PdrxJ/iE1VoZYw7vfu9Pifd7QOf1CrwUugH5WCk2I2wO4Tr9dPC6vGOfUyUOvZn67uM HWFg/6kMtMnfjR9TZpOcZvhS8G/NfujIAtbcFyXY3D2wB884M7VjM3EQqJ63VAzcrrMx eHs/uNr0/CvJu+0XzW1xlMtvNYb/5YlFRUdvzQScV1H3g7nK9mCW0S+Vv0Po3rwk7WaN my6DYlwXdCHqy2i5kP6G5zQzE1TMqKN0JKK0DEfYwE84iMFpgdmfWX+PWe6JqAp0VJC1 TGLw== X-Received: by 10.194.161.232 with SMTP id xv8mr5154698wjb.99.1413320701388; Tue, 14 Oct 2014 14:05:01 -0700 (PDT) Received: from [192.168.0.2] (cpc68956-brig15-2-0-cust215.3-3.cable.virginm.net. [82.6.24.216]) by mx.google.com with ESMTPSA id bi7sm16986034wib.17.2014.10.14.14.05.00 for (version=TLSv1 cipher=ECDHE-RSA-RC4-SHA bits=128/128); Tue, 14 Oct 2014 14:05:00 -0700 (PDT) Message-ID: <543D8FFA.8080408@gmail.com> Date: Tue, 14 Oct 2014 22:04:58 +0100 User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:24.0) Gecko/20100101 Thunderbird/24.6.0 MIME-Version: 1.0 To: internals@lists.php.net References: <543CE705.7030203@gmail.com> <4575A816-43F4-462D-8150-A2D35516D914@ajf.me> <543D64E5.8000706@gmail.com> <543D8528.1060605@gmail.com> In-Reply-To: <543D8528.1060605@gmail.com> Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 8bit Subject: Re: [PHP-DEV] Unicode support From: rowan.collins@gmail.com (Rowan Collins) On 14/10/2014 21:18, Aleksey Tulinov wrote: > Back to combining characters, i dig the idea of introducing graphemes, > but i think French person would write word "noël" using precomposed > character. I'm using French keyboard at > https://translate.google.com/#fr/. "ë" is Shift + "^", then "e", it > produces precomposed U+00EB. You don't even need to rely on the input method using the combined form, Unicode includes an algorithm for normalisation to this form (where such composites are coded), known as NFC. > If script doesn't have precomposed equivalent, then this grapheme will > always be in the same decomposed form and collation will work. > Substring search will also work, because needle will be decomposed in > the same way as haystack. No, it won't. You won't get false negatives as long as both strings are normalised to the same form (whether that is NFC or NFD), but you will get false positives. For instance, searching for the substring "e" would not match a combined ë, but it would match an uncombined sequence with e at its base (e.g. with two diacritics). Normalising to NFD (fully de-composed) would at least mean that "e" consistently matched all graphemes with "e" at their base, but is not a lossless operation, so performing it implicitly is probably not a good idea. All of which ignores the questions of length and string reversal, which I think are much more important in this respect. > There are some border-line cases possible, but are they really > practical in a scope of Unicode support in a programming language? As I understand it, the entirety of the Korean writing system is an "edge case" in this respect - it uses 3 code points for each grapheme, and cutting one of those graphemes apart leaves you with gibberish. It's pretty meaningless to say you support Unicode, but only the easy bits. You might as well just tag each string with one of the pages of ISO-8859. -- Rowan Collins [IMSoP]