Newsgroups: php.internals Path: news.php.net Xref: news.php.net php.internals:67551 Return-Path: Mailing-List: contact internals-help@lists.php.net; run by ezmlm Delivered-To: mailing list internals@lists.php.net Received: (qmail 18096 invoked from network); 27 May 2013 08:34:16 -0000 Received: from unknown (HELO lists.php.net) (127.0.0.1) by localhost with SMTP; 27 May 2013 08:34:16 -0000 Authentication-Results: pb1.pair.com header.from=nicolas.grekas@gmail.com; sender-id=pass Authentication-Results: pb1.pair.com smtp.mail=nicolas.grekas@gmail.com; spf=pass; sender-id=pass Received-SPF: pass (pb1.pair.com: domain gmail.com designates 209.85.220.178 as permitted sender) X-PHP-List-Original-Sender: nicolas.grekas@gmail.com X-Host-Fingerprint: 209.85.220.178 mail-vc0-f178.google.com Received: from [209.85.220.178] ([209.85.220.178:42828] helo=mail-vc0-f178.google.com) by pb1.pair.com (ecelerity 2.1.1.9-wez r(12769M)) with ESMTP id DC/90-32733-78A13A15 for ; Mon, 27 May 2013 04:34:16 -0400 Received: by mail-vc0-f178.google.com with SMTP id id13so4503550vcb.23 for ; Mon, 27 May 2013 01:34:13 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:sender:in-reply-to:references:from:date :x-google-sender-auth:message-id:subject:to:cc:content-type; bh=XoM4JsCCjVYSFkj/cvzl27XfbQYQ9bbzIRMVIiizdlY=; b=dm3lZFU9B8c7ra2YnfHXhQgRBrQSjWXt/Td2/X9tJgWDExQIgOKyCBUrNSjFGB/YxX c0wdyHUNVK04x1rilUOSDhEngjSqG3cnT2il5aHhvuEKhLP9JLJMC9jw8gJuGMs1yxfu w2gKzVST4GQLK03WTj/bh/Eq0VKFiXpGwh3YpkVWQKbzW900NMAaPLhDjrVhyj99hXX8 aF4nCMZkXZZkN5FpRrqWgU7cwBqTh9+5kYQplTAs2kSWW2vCtOs1MK+aE0TfFPDFQT62 SOxHGYpSjsoGHN3TMGdpZduQpyRHlF10g1bUfsmV1ufHidgOBGzzkvI2M4JIjNp8lkRm Lw2w== X-Received: by 10.52.233.199 with SMTP id ty7mr12277898vdc.122.1369643653007; Mon, 27 May 2013 01:34:13 -0700 (PDT) MIME-Version: 1.0 Sender: nicolas.grekas@gmail.com Received: by 10.52.91.38 with HTTP; Mon, 27 May 2013 01:33:51 -0700 (PDT) In-Reply-To: References: <61BC4F17-86D9-4CBD-B185-58A2D4AFAE5F@rouvenwessling.de> Date: Mon, 27 May 2013 10:33:51 +0200 X-Google-Sender-Auth: S8Fs_NfCqgJiXCKnuashKjZjzdY Message-ID: To: Pierre Joye Cc: =?ISO-8859-1?Q?Rouven_We=DFling?= , PHP internals Content-Type: multipart/alternative; boundary=089e012948f8680d8504ddaf03a2 Subject: Re: [PHP-DEV] Proposal for better UTF-8 handling From: nicolas.grekas+php@gmail.com (Nicolas Grekas) --089e012948f8680d8504ddaf03a2 Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable Btw, I hit a bug on grapheme_substr() that got no attention: https://bugs.php.net/bug.php?id=3D62759 There is also https://bugs.php.net/bug.php?id=3D61860 that waits for a fix. Nicolas On Mon, May 27, 2013 at 8:40 AM, Pierre Joye wrote: > hi! > > On Fri, May 24, 2013 at 3:17 AM, Rouven We=DFling > wrote: > > Hi Internals! > > > > First let me introduce myself, my name is Rouven We=DFling, I'm a stude= nt > at RWTH Aachen University and I'm one of the maintainers of the Joomla! > Framework (n=E9e Platform). I've been following the internals list for a = few > months and started brushing of my C skills for the past couple of months = so > I can start contributing. > > > > To me one of the most annoying things about working with PHP is the > (lack of) unicode support. In Joomla! we've been discussing switching fro= m > PHP UTF-8 to Patchwork UTF-8 for our needs of handling UTF-8. Both are > libraries abstracting the multibyte extension and supplementing it with a > number of functions. They also provide userland replacements for when > multibyte is not available (Patchwork will also use iconv and intl if > available). All of this is a huge pain. > > > > To ease this situation I'd like to make a new start at better unicode > support for PHP, this time focusing on UTF-8 as the dominant web encoding= . > As a first step I'd like to propose adding a set of functions for handlin= g > UTF-8 strings. This should keep applications from implementing these > algorithms in PHP (also many of these are quite a bit faster, see benchma= rk > results below). Once the algorithms are in place I'd like to look into > creating a class for unicode strings and eventually Python like unicode > literals. > > > > Before I write an RFC I'd like to get some feedback what you think abou= t > adding the following functions to PHP 5.6 (possibly more to follow): > utf8_is_valid, utf8_strlen, utf8_substr, utf8_strpos, utf8_strrpos, > utf8_str_split, utf8_strrev, utf8_recover, utf8_chr, utf8_ord, > string_is_ascii. > > > > Most of them (exceptions are utf8_chr, utf8_is_valid, utf8_recover and > string_is_ascii) are currently written in a way that they emit a warning > when they encounter invalid UTF-8 and return with null. This should > encourage applications to check their input with utf8_is_valid and either > stop further processing or to fall back to utf8_recover to get a valid > string. This should improve security since there are attack vectors when > malformed sequences get interpreted as another encoding. > > > > You can find the code I've written so far here: > https://github.com/realityking/pecl-utf8 > > You can find benchmark results here: > http://realityking.github.io/pecl-utf8/results.html > > Without judging your extension, I wonder if you have looked at the > intl extension, for the php core parts. There are also some exts to > deal with non ascii strings in pecl. > > I always promoted intl usage as it handles UTF-8 or other very well > and for everything needed to fully support Unicode, their data is kept > updated and the APIs are very stable. It is also available since PHP > 5.3 which makes it a very good choice to begin with. > > Cheers, > -- > Pierre > > @pierrejoye | http://www.libgd.org > > -- > PHP Internals - PHP Runtime Development Mailing List > To unsubscribe, visit: http://www.php.net/unsub.php > > --089e012948f8680d8504ddaf03a2--