Newsgroups: php.internals Path: news.php.net Xref: news.php.net php.internals:67543 Return-Path: Mailing-List: contact internals-help@lists.php.net; run by ezmlm Delivered-To: mailing list internals@lists.php.net Received: (qmail 990 invoked from network); 27 May 2013 06:40:21 -0000 Received: from unknown (HELO lists.php.net) (127.0.0.1) by localhost with SMTP; 27 May 2013 06:40:21 -0000 Authentication-Results: pb1.pair.com header.from=pierre.php@gmail.com; sender-id=pass Authentication-Results: pb1.pair.com smtp.mail=pierre.php@gmail.com; spf=pass; sender-id=pass Received-SPF: pass (pb1.pair.com: domain gmail.com designates 209.85.217.170 as permitted sender) X-PHP-List-Original-Sender: pierre.php@gmail.com X-Host-Fingerprint: 209.85.217.170 mail-lb0-f170.google.com Received: from [209.85.217.170] ([209.85.217.170:34485] helo=mail-lb0-f170.google.com) by pb1.pair.com (ecelerity 2.1.1.9-wez r(12769M)) with ESMTP id 8D/3D-32733-4DFF2A15 for ; Mon, 27 May 2013 02:40:21 -0400 Received: by mail-lb0-f170.google.com with SMTP id t13so6431904lbd.1 for ; Sun, 26 May 2013 23:40:17 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :cc:content-type:content-transfer-encoding; bh=KNOWuxLT7idtyXxsYcdF36tOLl5OfdzquP+nWKuNbRs=; b=q4jGY5v3hvsvd1jmA7REW7i8fLaomfi8M2q2isIJJrSG7zwjFoxOO2hINHFQYflTwi 6bbw1v2H2J/cdjlW2iWY+ex2tw8sMXUK1bNoJV9lxfZYu3s9weLtqIS7pL9wzqqBgEPq 8BG6lcTsQ4hfbSc2kn9I6dcbZe8eEa74KpPY2O+5RGQRqo2qRTFwe/j22ZbPCNjNxdPu grJVAJ8eeimrqvSX4QzP8wWGGJWKiC3rRDiawvUbqZfjeyDJG2YD8KiqpwEUPf7SyIkk C2yu7e3rMjtEPIwg+Yr/MitXSJLwMcBOUV3MhG58EVXCCozjwX7mJn16EAgvCTX/V3uM Y44w== MIME-Version: 1.0 X-Received: by 10.112.164.34 with SMTP id yn2mr3136533lbb.125.1369636817827; Sun, 26 May 2013 23:40:17 -0700 (PDT) Received: by 10.112.138.135 with HTTP; Sun, 26 May 2013 23:40:17 -0700 (PDT) In-Reply-To: <61BC4F17-86D9-4CBD-B185-58A2D4AFAE5F@rouvenwessling.de> References: <61BC4F17-86D9-4CBD-B185-58A2D4AFAE5F@rouvenwessling.de> Date: Mon, 27 May 2013 08:40:17 +0200 Message-ID: To: =?ISO-8859-1?Q?Rouven_We=DFling?= Cc: PHP internals Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable Subject: Re: [PHP-DEV] Proposal for better UTF-8 handling From: pierre.php@gmail.com (Pierre Joye) hi! On Fri, May 24, 2013 at 3:17 AM, Rouven We=DFling wr= ote: > Hi Internals! > > First let me introduce myself, my name is Rouven We=DFling, I'm a student= at RWTH Aachen University and I'm one of the maintainers of the Joomla! Fr= amework (n=E9e Platform). I've been following the internals list for a few = months and started brushing of my C skills for the past couple of months so= I can start contributing. > > To me one of the most annoying things about working with PHP is the (lack= of) unicode support. In Joomla! we've been discussing switching from PHP U= TF-8 to Patchwork UTF-8 for our needs of handling UTF-8. Both are libraries= abstracting the multibyte extension and supplementing it with a number of = functions. They also provide userland replacements for when multibyte is no= t available (Patchwork will also use iconv and intl if available). All of t= his is a huge pain. > > To ease this situation I'd like to make a new start at better unicode sup= port for PHP, this time focusing on UTF-8 as the dominant web encoding. As = a first step I'd like to propose adding a set of functions for handling UTF= -8 strings. This should keep applications from implementing these algorithm= s in PHP (also many of these are quite a bit faster, see benchmark results = below). Once the algorithms are in place I'd like to look into creating a c= lass for unicode strings and eventually Python like unicode literals. > > Before I write an RFC I'd like to get some feedback what you think about = adding the following functions to PHP 5.6 (possibly more to follow): utf8_i= s_valid, utf8_strlen, utf8_substr, utf8_strpos, utf8_strrpos, utf8_str_spl= it, utf8_strrev, utf8_recover, utf8_chr, utf8_ord, string_is_ascii. > > Most of them (exceptions are utf8_chr, utf8_is_valid, utf8_recover and st= ring_is_ascii) are currently written in a way that they emit a warning when= they encounter invalid UTF-8 and return with null. This should encourage a= pplications to check their input with utf8_is_valid and either stop further= processing or to fall back to utf8_recover to get a valid string. This sho= uld improve security since there are attack vectors when malformed sequence= s get interpreted as another encoding. > > You can find the code I've written so far here: https://github.com/realit= yking/pecl-utf8 > You can find benchmark results here: http://realityking.github.io/pecl-ut= f8/results.html Without judging your extension, I wonder if you have looked at the intl extension, for the php core parts. There are also some exts to deal with non ascii strings in pecl. I always promoted intl usage as it handles UTF-8 or other very well and for everything needed to fully support Unicode, their data is kept updated and the APIs are very stable. It is also available since PHP 5.3 which makes it a very good choice to begin with. Cheers, -- Pierre @pierrejoye | http://www.libgd.org