Newsgroups: php.internals Path: news.php.net Xref: news.php.net php.internals:78095 Return-Path: Mailing-List: contact internals-help@lists.php.net; run by ezmlm Delivered-To: mailing list internals@lists.php.net Received: (qmail 86910 invoked from network); 15 Oct 2014 12:58:29 -0000 Received: from unknown (HELO lists.php.net) (127.0.0.1) by localhost with SMTP; 15 Oct 2014 12:58:29 -0000 Authentication-Results: pb1.pair.com header.from=rowan.collins@gmail.com; sender-id=pass Authentication-Results: pb1.pair.com smtp.mail=rowan.collins@gmail.com; spf=pass; sender-id=pass Received-SPF: pass (pb1.pair.com: domain gmail.com designates 74.125.82.48 as permitted sender) X-PHP-List-Original-Sender: rowan.collins@gmail.com X-Host-Fingerprint: 74.125.82.48 mail-wg0-f48.google.com Received: from [74.125.82.48] ([74.125.82.48:49891] helo=mail-wg0-f48.google.com) by pb1.pair.com (ecelerity 2.1.1.9-wez r(12769M)) with ESMTP id 57/E2-03764-47F6E345 for ; Wed, 15 Oct 2014 08:58:28 -0400 Received: by mail-wg0-f48.google.com with SMTP id k14so1304664wgh.7 for ; Wed, 15 Oct 2014 05:58:25 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=message-id:date:from:user-agent:mime-version:to:subject:references :in-reply-to:content-type:content-transfer-encoding; bh=XiGQJmJ/31D1Og2DQhAYaACDmDdYi8NLILzhXd6vkGg=; b=vK2y1qNF7+WFQUPAae74kqb+uKKqQPMPt6upM04zhUhUmrMySueEcUUDOphDXFBMj9 U9EWdOPcvhnhXE9OJYSx7haSiVW01Ku2VAqXthkCs1p0rFAnSM+hoU5Macrvl5MItRHE V3GIf1o1OkbTaHIzwOAHT9oJE790UQZCn2zDLCUoGvjHp9Y1LjTJ0Hadd0SH0+NQtn85 sCEyMnwXhHcH5aiNcsAfd5VXiczL4v2m13ZvbIXUyNM+O+nSeOdBL3urbG0/wm53Qq/x nLYBAKWxhgddt77HCTPQCSMyZTnwEUJObAWq81KEUf4FT8P5QStsqtkNAbWgAaOHkOdl Q8Tg== X-Received: by 10.180.105.74 with SMTP id gk10mr13801291wib.0.1413377905109; Wed, 15 Oct 2014 05:58:25 -0700 (PDT) Received: from [192.168.0.177] ([62.189.198.114]) by mx.google.com with ESMTPSA id ma8sm23804476wjb.46.2014.10.15.05.58.23 for (version=TLSv1 cipher=ECDHE-RSA-RC4-SHA bits=128/128); Wed, 15 Oct 2014 05:58:24 -0700 (PDT) Message-ID: <543E6F6E.8080006@gmail.com> Date: Wed, 15 Oct 2014 13:58:22 +0100 User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:24.0) Gecko/20100101 Thunderbird/24.6.0 MIME-Version: 1.0 To: internals@lists.php.net References: <543CE705.7030203@gmail.com> <4575A816-43F4-462D-8150-A2D35516D914@ajf.me> <543D64E5.8000706@gmail.com> <543D8528.1060605@gmail.com> <543D8FFA.8080408@gmail.com> <543DAA29.8040701@gmail.com> <68E97150-8840-4C31-B271-3E8C8BE933DB@gmail.com> <543E4626.3000102@gmail.com> In-Reply-To: <543E4626.3000102@gmail.com> Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 8bit Subject: Re: [PHP-DEV] Unicode support From: rowan.collins@gmail.com (Rowan Collins) Aleksey Tulinov wrote (on 15/10/2014): > On 15/10/14 10:04, Rowan Collins wrote: > > Rowan, > >> As I said at the top of my first post, the important thing is to capture >> what those requirements actually are. Just as you'd choose what array >> functions were needed if you were adding "array support" to a language. >> > > I'm sorry for not making myself clear. What i'm essentially saying is > that i think "noël" test is synthetic and impractical I remain unconvinced on that, and it's just one example. There are plenty of forms which don't have a combined form, otherwise there would be no need for combining diacritics to exist in the first place. > it's also solvable with requirement of NFC strings at input and this > is not implementation defect. I also believe that Hangul is most > likely to be precomposed and will work alright. Requiring a particular normal form on input is not something a programming language can do. The only way you can guarantee NFC form is by performing the normalisation. > And i have another opinion on UTF-8 shortest-form. There's no need for opinion there, we can consult the standard. http://www.unicode.org/versions/Unicode6.0.0/ > D76 Unicode scalar value: Any Unicode code point except high-surrogate and low-surrogate code points. > D79 A Unicode encoding form assigns each Unicode scalar value to a unique code unit sequence. > D77 Code unit: The minimal bit combination that can represent a unit of encoded text for processing or interchange. [...] The Unicode Standard uses 8-bit code units in the UTF-8 encoding form [...] > D79 A Unicode encoding form assigns each Unicode scalar value to a unique code unit sequence. > D85a Minimal well-formed code unit subsequence: A well-formed Unicode code unit sequence that maps to a single Unicode scalar value. > D92 UTF-8 encoding form: The Unicode encoding form that assigns each Unicode scalar value to an unsigned byte sequence of one to four bytes in length, as specified in Table 3-6 and Table 3-7. > Before the Unicode Standard, Version 3.1, the problematic “non-shortest form” byte sequences in UTF-8 were those where BMP characters could be represented in more than one way. These sequences are ill-formed, because they are not allowed by Table 3-7. In short: UTF-8 defines a mapping between sequences of 8-bit "code units" to abstract "Unicode scalar values". Every Unicode scalar value maps to a single unique sequence of code units, but all Unicode scalar values can be represented. Since "U+0308 COMBINING DIAERESIS" is a valid Unicode scalar value, a UTF-8 string representing that value can be well-formed. It is only alternative representations of the same Unicode scalar value which must be in shortest form. There may be standards for interchange in particular situations which enforce additional constraints, such as that all strings should be in NFC, but the applicability or correct implementation of such standards is not something that you can use to define handling in an entire programming language. > > That aside. > > I think requirements is what i was asking about, i'm assuming that > your standpoint is that string modification routines are at least > required to take into account entire characters, not only code points. > Am i correct? Yes, I think that at least some functions should be available which work on "characters" as users would define them, such as length and perhaps safe truncation. > > What is confusing me is that i think you're seeing it as a major > implementation defect. To avoid arguable implementations, i've made > short example in Java: > > System.out.println(new StringBuffer("noël").reverse().toString()); > > It does produce string "l̈eon" as i would expect. Why do you expect that? Is this a result which would ever be useful? To be clear, I am suggesting that we aim to be the language which gets this right, where other languages get it wrong. > Precomposed "noël" also works as i would expect producing string > "lëon". What do you think, is this implementation issue or solely > requirements issue? Well, you can only define an implementation defect with respect to the original requirement. If the requirement was to reverse "characters", as most users would understand that term, then moving the diacritic to a different letter fails that requirement, because a user would not consider a diacritic a separate character. If the requirement was to reverse code points, regardless of their meaning, then the implementation is fine, but I would argue that the requirement failed to capture what most users would actually want. Regards, -- Rowan Collins [IMSoP]