Newsgroups: php.internals Path: news.php.net Xref: news.php.net php.internals:78120 Return-Path: Mailing-List: contact internals-help@lists.php.net; run by ezmlm Delivered-To: mailing list internals@lists.php.net Received: (qmail 28101 invoked from network); 16 Oct 2014 17:03:30 -0000 Received: from unknown (HELO lists.php.net) (127.0.0.1) by localhost with SMTP; 16 Oct 2014 17:03:30 -0000 Authentication-Results: pb1.pair.com header.from=nicolas.grekas@gmail.com; sender-id=pass Authentication-Results: pb1.pair.com smtp.mail=nicolas.grekas@gmail.com; spf=pass; sender-id=pass Received-SPF: pass (pb1.pair.com: domain gmail.com designates 209.85.220.171 as permitted sender) X-PHP-List-Original-Sender: nicolas.grekas@gmail.com X-Host-Fingerprint: 209.85.220.171 mail-vc0-f171.google.com Received: from [209.85.220.171] ([209.85.220.171:44322] helo=mail-vc0-f171.google.com) by pb1.pair.com (ecelerity 2.1.1.9-wez r(12769M)) with ESMTP id 58/77-11594-06AFF345 for ; Thu, 16 Oct 2014 13:03:29 -0400 Received: by mail-vc0-f171.google.com with SMTP id hy10so3041233vcb.2 for ; Thu, 16 Oct 2014 10:03:25 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:sender:in-reply-to:references:from:date:message-id :subject:to:cc:content-type; bh=yfRoVvjeEItEkgPg5W8eMWHHHLEGnZCl76kZmfw6Tpc=; b=pPpLvXXboezSUcNG7g3Y8hdimcBkFLTpV3r1JT9Nh1Myd0Oex0fgDnefAPPTsBdYvz XPH2QsbiVNNZ/fxCEKSyW+al29iz1EXi4adGy/QDDXonoU71w5EcMwlZBQPCBxA5b7BD X6F7QJ+F5v/LiyGmhIzIK3lf6pJboU2nyD3evkYaiQ1/Lrspgz2elUnYrbjJqgKi8h4V q4N5ttYNQGY8ESDoM5dY0VFZRKOlAW07BCrU3VCkO7PhdpE8hD3buS62N75zK/SM+7GZ AIvq0+SlBQTo0RZXO/Od/UEgD/b7ZgmE4CegNR3l3lNTOrwPiYbRDQcgaMEIQX1/Icsc 22EQ== X-Received: by 10.52.177.234 with SMTP id ct10mr1927219vdc.85.1413479005185; Thu, 16 Oct 2014 10:03:25 -0700 (PDT) MIME-Version: 1.0 Sender: nicolas.grekas@gmail.com Received: by 10.52.29.78 with HTTP; Thu, 16 Oct 2014 10:03:04 -0700 (PDT) In-Reply-To: <543E8E91.7070805@gmail.com> References: <543CE705.7030203@gmail.com> <4575A816-43F4-462D-8150-A2D35516D914@ajf.me> <543D64E5.8000706@gmail.com> <543D8528.1060605@gmail.com> <543D8FFA.8080408@gmail.com> <543DAA29.8040701@gmail.com> <68E97150-8840-4C31-B271-3E8C8BE933DB@gmail.com> <543E4626.3000102@gmail.com> <543E6F6E.8080006@gmail.com> <543E8E91.7070805@gmail.com> Date: Thu, 16 Oct 2014 19:03:04 +0200 X-Google-Sender-Auth: 6UEJ29RkER9DLPCtT3tK4rZDRug Message-ID: To: Aleksey Tulinov , Rowan Collins Cc: "internals@lists.php.net" Content-Type: multipart/alternative; boundary=bcaec5016663007a4205058d3ab0 Subject: Re: [PHP-DEV] Unicode support From: nicolas.grekas+php@gmail.com (Nicolas Grekas) --bcaec5016663007a4205058d3ab0 Content-Type: text/plain; charset=UTF-8 Hello, I think that Rowan is right: PHP users need to manipulate grapheme clusters first (and code points in some rare situations). The fact that most of us live in a world were NFC composes all our characters only hides this reality. A typical use case is a template engine: nearly all string manipulations there need grapheme awareness: cutting strings for getting excerpt, inserting a space between every "character", changing the case, etc. A typical use case for a PHP app. An other use case is if you want to implement text indexing in PHP: you need to normalize before indexing, handle case folding, and thus think in terms of graphemes. I'm not sure this is frequent in PHP though. Like already said, alongside with grapheme clusters, we should also deals with string matching: collations are out of scope, but normalization and case folding is in. Please do not forget the turkish alphabet also... This is required IMHO to have what user expects for str_replace, strpos, strcmp, etc. I wrote a quite successful PHP lib to deal with this in PHP: https://github.com/nicolas-grekas/Patchwork-UTF8 My experience from this is the following: - dealing with grapheme clusters in current PHP is ok with grapheme_*() functions, but these require intl. It would be great to have them (or an equivalent) in core, - NFC normalization of all input is required to deal with string comparisons, so having Normalizer in core looks required also, - almost everybody uses mbstring when dealing with utf8 strings, but almost all cases should use a grapheme_*() instead. To be clear, I am suggesting that we aim to be the language which gets >> this right, where other languages get it wrong. >> > > Thank you for explaining this. I also think it could do better. I think > Unicode-aware strrev() shouldn't be too complicated to do. Perl 6 identified the subject very well and invented what they call "NFG", which is NFC + dynamic internal code points for non-composable grapheme clusters: http://docs.parrot.org/parrot/latest/html/docs/pdds/pdd28_strings.pod.html Maybe worth looking at? Cheers, Nicolas --bcaec5016663007a4205058d3ab0--