Newsgroups: php.internals Path: news.php.net Xref: news.php.net php.internals:67475 Return-Path: Mailing-List: contact internals-help@lists.php.net; run by ezmlm Delivered-To: mailing list internals@lists.php.net Received: (qmail 55956 invoked from network); 24 May 2013 01:17:50 -0000 Received: from unknown (HELO lists.php.net) (127.0.0.1) by localhost with SMTP; 24 May 2013 01:17:50 -0000 Authentication-Results: pb1.pair.com smtp.mail=me@rouvenwessling.de; spf=pass; sender-id=pass Authentication-Results: pb1.pair.com header.from=me@rouvenwessling.de; sender-id=pass Received-SPF: pass (pb1.pair.com: domain rouvenwessling.de designates 5.35.242.46 as permitted sender) X-PHP-List-Original-Sender: me@rouvenwessling.de X-Host-Fingerprint: 5.35.242.46 rouvenwessling.de Linux 2.6 Received: from [5.35.242.46] ([5.35.242.46:58954] helo=lvps5-35-242-46.dedicated.hosteurope.de) by pb1.pair.com (ecelerity 2.1.1.9-wez r(12769M)) with ESMTP id 7A/28-16824-9BFBE915 for ; Thu, 23 May 2013 21:17:47 -0400 Received: from [192.168.0.124] (xdsl-87-78-41-167.netcologne.de [87.78.41.167]) by lvps5-35-242-46.dedicated.hosteurope.de (Postfix) with ESMTPSA id 7B6041AF64013 for ; Fri, 24 May 2013 03:17:42 +0200 (CEST) Content-Type: text/plain; charset=iso-8859-1 Content-Transfer-Encoding: quoted-printable Message-ID: <61BC4F17-86D9-4CBD-B185-58A2D4AFAE5F@rouvenwessling.de> Date: Fri, 24 May 2013 03:17:40 +0200 To: internals@lists.php.net Mime-Version: 1.0 (Mac OS X Mail 6.3 \(1503\)) X-Mailer: Apple Mail (2.1503) Subject: Proposal for better UTF-8 handling From: me@rouvenwessling.de (=?iso-8859-1?Q?Rouven_We=DFling?=) Hi Internals! First let me introduce myself, my name is Rouven We=DFling, I'm a = student at RWTH Aachen University and I'm one of the maintainers of the = Joomla! Framework (n=E9e Platform). I've been following the internals = list for a few months and started brushing of my C skills for the past = couple of months so I can start contributing. To me one of the most annoying things about working with PHP is the = (lack of) unicode support. In Joomla! we've been discussing switching = from PHP UTF-8 to Patchwork UTF-8 for our needs of handling UTF-8. Both = are libraries abstracting the multibyte extension and supplementing it = with a number of functions. They also provide userland replacements for = when multibyte is not available (Patchwork will also use iconv and intl = if available). All of this is a huge pain. To ease this situation I'd like to make a new start at better unicode = support for PHP, this time focusing on UTF-8 as the dominant web = encoding. As a first step I'd like to propose adding a set of functions = for handling UTF-8 strings. This should keep applications from = implementing these algorithms in PHP (also many of these are quite a bit = faster, see benchmark results below). Once the algorithms are in place = I'd like to look into creating a class for unicode strings and = eventually Python like unicode literals. Before I write an RFC I'd like to get some feedback what you think about = adding the following functions to PHP 5.6 (possibly more to follow): = utf8_is_valid, utf8_strlen, utf8_substr, utf8_strpos, utf8_strrpos, = utf8_str_split, utf8_strrev, utf8_recover, utf8_chr, utf8_ord, = string_is_ascii. Most of them (exceptions are utf8_chr, utf8_is_valid, utf8_recover and = string_is_ascii) are currently written in a way that they emit a warning = when they encounter invalid UTF-8 and return with null. This should = encourage applications to check their input with utf8_is_valid and = either stop further processing or to fall back to utf8_recover to get a = valid string. This should improve security since there are attack = vectors when malformed sequences get interpreted as another encoding. You can find the code I've written so far here: = https://github.com/realityking/pecl-utf8 You can find benchmark results here: = http://realityking.github.io/pecl-utf8/results.html Best regards Rouven=