Newsgroups: php.internals Path: news.php.net Xref: news.php.net php.internals:78088 Return-Path: Mailing-List: contact internals-help@lists.php.net; run by ezmlm Delivered-To: mailing list internals@lists.php.net Received: (qmail 51688 invoked from network); 15 Oct 2014 07:15:20 -0000 Received: from unknown (HELO lists.php.net) (127.0.0.1) by localhost with SMTP; 15 Oct 2014 07:15:20 -0000 Authentication-Results: pb1.pair.com smtp.mail=rowan.collins@gmail.com; spf=pass; sender-id=pass Authentication-Results: pb1.pair.com header.from=rowan.collins@gmail.com; sender-id=pass Received-SPF: pass (pb1.pair.com: domain gmail.com designates 209.85.212.170 as permitted sender) X-PHP-List-Original-Sender: rowan.collins@gmail.com X-Host-Fingerprint: 209.85.212.170 mail-wi0-f170.google.com Received: from [209.85.212.170] ([209.85.212.170:55708] helo=mail-wi0-f170.google.com) by pb1.pair.com (ecelerity 2.1.1.9-wez r(12769M)) with ESMTP id 7D/84-21008-70F1E345 for ; Wed, 15 Oct 2014 03:15:20 -0400 Received: by mail-wi0-f170.google.com with SMTP id hi2so13462228wib.1 for ; Wed, 15 Oct 2014 00:15:16 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=user-agent:in-reply-to:references:mime-version:content-type :content-transfer-encoding:subject:from:date:to:message-id; bh=W4hjeRwCw0mKo8zMgb3ivZiQgj21tc+QtaGKSa+ila4=; b=wnyYUglfFO5+MZPPe2ObFX6+sd0ZOG2IRQJVzJU6hnzM7rzunuQmcHzONdJ/NtTW1B bJ+djHJEKsQjM27REIazBFKhLhVIt/lsRuhCKlRPvFvn5vlwz/Tkj6hDNUIwA6jjsGXP vtuQHGfnuumQN9CIWQzSKL6Vr4fk75vdLFr3PoQ5FApzuhAZGBhAySSy2WTKk151/dJ0 TbZvo03TMJzw6ySbA5e+zTfYZDJ4ErBSGyYn9LPijoDvtZ1V0vfA7olN1DOREdo3AOvr B4Hqcael9x9d32UZOIQVrDsWSnPjFARR+BcD1I1aR3nPrcRHy4ICM+BafU3kyLfHkNKW FNFg== X-Received: by 10.194.216.232 with SMTP id ot8mr10344238wjc.74.1413357316362; Wed, 15 Oct 2014 00:15:16 -0700 (PDT) Received: from [192.168.0.3] (cpc68956-brig15-2-0-cust215.3-3.cable.virginm.net. [82.6.24.216]) by mx.google.com with ESMTPSA id k2sm18215815wiz.18.2014.10.15.00.15.14 for (version=TLSv1.2 cipher=ECDHE-RSA-RC4-SHA bits=128/128); Wed, 15 Oct 2014 00:15:15 -0700 (PDT) User-Agent: K-9 Mail for Android In-Reply-To: <543DAA29.8040701@gmail.com> References: <543CE705.7030203@gmail.com> <4575A816-43F4-462D-8150-A2D35516D914@ajf.me> <543D64E5.8000706@gmail.com> <543D8528.1060605@gmail.com> <543D8FFA.8080408@gmail.com> <543DAA29.8040701@gmail.com> MIME-Version: 1.0 Content-Type: multipart/alternative; boundary="----4S5VX97WLE751FQO47QM9KYAOJLSO2" Content-Transfer-Encoding: 8bit Date: Wed, 15 Oct 2014 08:04:48 +0100 To: Aleksey Tulinov ,internals@lists.php.net Message-ID: <68E97150-8840-4C31-B271-3E8C8BE933DB@gmail.com> Subject: Re: [PHP-DEV] Unicode support From: rowan.collins@gmail.com (Rowan Collins) ------4S5VX97WLE751FQO47QM9KYAOJLSO2 Content-Transfer-Encoding: 8bit Content-Type: text/plain; charset=UTF-8 >Good point. That's what i meant by border-line case. Could you possibly > >point me to a specific example of such false positive? I'm interested >in >well-formed UTF-8 string. I believe "noël" test is ill-formed UTF-8 >and >doesn't conform to shortest-form requirement. You're confusing two concepts here: well-formed UTF-8 represents any single code point with the smallest number of bytes, but it makes no requirements about what code points are represented. Representing " ë " as two code points is perfectly valid Unicode, and would in fact be required under NFD. That "most" input sources would prefer the combined form seems like a weak assumption to base a library on; it only takes one popular third-party to routinely return data in NFD for the problems to start showing up. >> It's pretty meaningless to say you support Unicode, but only the easy >> bits. You might as well just tag each string with one of the pages of >> ISO-8859. >> > >As far as i'm concerned Unicode specification does not require to >implement all annexes or even support entire character set to be >conformant. I think there are always trade-offs involved, depending on >what is more important for you. Sure, but there are certain user expectations of what "Unicode support" means. Handling Korean characters in a meaningfulmeaningful way would definitely be on that list. As I said at the top of my first post, the important thing is to capture what those requirements actually are. Just as you'd choose what array functions were needed if you were adding "array support" to a language. To put it a different way, in what situation would you actively want to know the number of code points in a string, rather than either the number of bytes in its UTF8 representation, or the number of graphemes? ------4S5VX97WLE751FQO47QM9KYAOJLSO2--