Newsgroups: php.internals Path: news.php.net Xref: news.php.net php.internals:78080 Return-Path: Mailing-List: contact internals-help@lists.php.net; run by ezmlm Delivered-To: mailing list internals@lists.php.net Received: (qmail 16373 invoked from network); 14 Oct 2014 22:56:49 -0000 Received: from unknown (HELO lists.php.net) (127.0.0.1) by localhost with SMTP; 14 Oct 2014 22:56:49 -0000 Authentication-Results: pb1.pair.com header.from=aleksey.tulinov@gmail.com; sender-id=pass Authentication-Results: pb1.pair.com smtp.mail=aleksey.tulinov@gmail.com; spf=pass; sender-id=pass Received-SPF: pass (pb1.pair.com: domain gmail.com designates 209.85.212.181 as permitted sender) X-PHP-List-Original-Sender: aleksey.tulinov@gmail.com X-Host-Fingerprint: 209.85.212.181 mail-wi0-f181.google.com Received: from [209.85.212.181] ([209.85.212.181:57181] helo=mail-wi0-f181.google.com) by pb1.pair.com (ecelerity 2.1.1.9-wez r(12769M)) with ESMTP id AE/7B-18603-E2AAD345 for ; Tue, 14 Oct 2014 18:56:48 -0400 Received: by mail-wi0-f181.google.com with SMTP id hi2so398632wib.8 for ; Tue, 14 Oct 2014 15:56:44 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=message-id:date:from:user-agent:mime-version:to:subject:references :in-reply-to:content-type:content-transfer-encoding; bh=p2JXuWH6hkEJebtyv62y6ktLAYXNwvN2wecPk5BEuao=; b=h75iERcrvcNzzqyEKBRehVjId9rOFTaPsaOxFfbtEJtkmEVmRBToz8NruKxCdox4eh vja4s6xSCSy/GGLNzK/sViTocCVLIO+ozcMaPVVxQ9brgXk9wbWIoWYztCv2LWNR0A/d ztQ+3lVKXTBWRG01Sh8etI/FEfeZT8bYFkJfCjTeD1a9wvsF2i+7x9mUBWQdaYz5OxBU Vuy04Yje++GBm41DqCSfgFKLxBfw1gVmcEu3eQKsOlw65A1Hu3kmtSt05kIi9PPBb5Wo yGcUohTbjHCnDuOBHTmti1FFbOG+1D+wM20VyWwInvYxwht+fM2TkZYdQ/ASDOFFhR5q V7rQ== X-Received: by 10.194.249.225 with SMTP id yx1mr8284177wjc.79.1413327404430; Tue, 14 Oct 2014 15:56:44 -0700 (PDT) Received: from [172.16.0.137] ([195.177.73.61]) by mx.google.com with ESMTPSA id l10sm17207823wif.20.2014.10.14.15.56.43 for (version=TLSv1.2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Tue, 14 Oct 2014 15:56:43 -0700 (PDT) Message-ID: <543DAA29.8040701@gmail.com> Date: Wed, 15 Oct 2014 01:56:41 +0300 User-Agent: Mozilla/5.0 (X11; Linux i686; rv:31.0) Gecko/20100101 Thunderbird/31.1.2 MIME-Version: 1.0 To: internals@lists.php.net References: <543CE705.7030203@gmail.com> <4575A816-43F4-462D-8150-A2D35516D914@ajf.me> <543D64E5.8000706@gmail.com> <543D8528.1060605@gmail.com> <543D8FFA.8080408@gmail.com> In-Reply-To: <543D8FFA.8080408@gmail.com> Content-Type: text/plain; charset=utf-8; format=flowed Content-Transfer-Encoding: 8bit Subject: Re: [PHP-DEV] Unicode support From: aleksey.tulinov@gmail.com (Aleksey Tulinov) On 15/10/14 00:04, Rowan Collins wrote: Rowan, >> Back to combining characters, i dig the idea of introducing graphemes, >> but i think French person would write word "noël" using precomposed >> character. I'm using French keyboard at >> https://translate.google.com/#fr/. "ë" is Shift + "^", then "e", it >> produces precomposed U+00EB. > > You don't even need to rely on the input method using the combined form, > Unicode includes an algorithm for normalisation to this form (where such > composites are coded), known as NFC. > The problem with NFC is that it's not only composition, but decomposition + reordering + re-composition. I know about NFC quick check, but the issue is if check fails and string need transformation, this would be very challenging, if not impossible, to do while keeping string immutable and without introducing internal representation of that string. Internal representation and string modifications brings overhead which might eventually render implementation unusable for a range of applications. On the other side, language specific characters which can be precomposed, are likely to be precomposed. >> If script doesn't have precomposed equivalent, then this grapheme will >> always be in the same decomposed form and collation will work. >> Substring search will also work, because needle will be decomposed in >> the same way as haystack. > > No, it won't. You won't get false negatives as long as both strings are > normalised to the same form (whether that is NFC or NFD), but you will > get false positives. For instance, searching for the substring "e" would > not match a combined ë, but it would match an uncombined sequence with e > at its base (e.g. with two diacritics). > > Normalising to NFD (fully de-composed) would at least mean that "e" > consistently matched all graphemes with "e" at their base, but is not a > lossless operation, so performing it implicitly is probably not a good > idea. > Good point. That's what i meant by border-line case. Could you possibly point me to a specific example of such false positive? I'm interested in well-formed UTF-8 string. I believe "noël" test is ill-formed UTF-8 and doesn't conform to shortest-form requirement. > It's pretty meaningless to say you support Unicode, but only the easy > bits. You might as well just tag each string with one of the pages of > ISO-8859. > As far as i'm concerned Unicode specification does not require to implement all annexes or even support entire character set to be conformant. I think there are always trade-offs involved, depending on what is more important for you.