Newsgroups: php.internals Path: news.php.net Xref: news.php.net php.internals:67493 Return-Path: Mailing-List: contact internals-help@lists.php.net; run by ezmlm Delivered-To: mailing list internals@lists.php.net Received: (qmail 39709 invoked from network); 24 May 2013 15:27:03 -0000 Received: from unknown (HELO lists.php.net) (127.0.0.1) by localhost with SMTP; 24 May 2013 15:27:03 -0000 Authentication-Results: pb1.pair.com header.from=tyra3l@gmail.com; sender-id=pass Authentication-Results: pb1.pair.com smtp.mail=tyra3l@gmail.com; spf=pass; sender-id=pass Received-SPF: pass (pb1.pair.com: domain gmail.com designates 209.85.223.171 as permitted sender) X-PHP-List-Original-Sender: tyra3l@gmail.com X-Host-Fingerprint: 209.85.223.171 mail-ie0-f171.google.com Received: from [209.85.223.171] ([209.85.223.171:39155] helo=mail-ie0-f171.google.com) by pb1.pair.com (ecelerity 2.1.1.9-wez r(12769M)) with ESMTP id AF/F9-20943-6C68F915 for ; Fri, 24 May 2013 11:27:02 -0400 Received: by mail-ie0-f171.google.com with SMTP id e11so12570099iej.2 for ; Fri, 24 May 2013 08:27:00 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :cc:content-type; bh=z0kJtkRZIFp22yvY1KqdRpbAQaBLthzK0gytP+Yc5PY=; b=l6ftmPUGWKKBPmManHMfci9sf/NIDSzFvlKUBZWT7UURSC7t5fPO17IGPBLAs4Xyka SJWiIs/45r4mXFdOvCxUJyUusMX7GyFsz6gyFjhxMyy8qgDyZOk56RABYaOUdhLGqIC6 d1wZT/GBM5Yo79Z74afIY+shNMEtMhYlKrSNpuRpTwfVEiBYfhfdeFMPwLt6Au7B9/sk yHq4kUCc2y9ylXFzpJ49UX0l1oKshnnneq5aQpGv4khztjk3Yrzruejqn4QYwQUcf/SQ B+ix+Bu+bUI0O5C88Pr0OGaDQZ34N0RKZNL0Yppgzm6XHp7oDhxLyqpOULVqup6h/0HI I0Yw== MIME-Version: 1.0 X-Received: by 10.50.73.199 with SMTP id n7mr3037125igv.43.1369409219973; Fri, 24 May 2013 08:26:59 -0700 (PDT) Received: by 10.50.209.3 with HTTP; Fri, 24 May 2013 08:26:59 -0700 (PDT) In-Reply-To: References: <61BC4F17-86D9-4CBD-B185-58A2D4AFAE5F@rouvenwessling.de> Date: Fri, 24 May 2013 17:26:59 +0200 Message-ID: To: Nikita Popov Cc: =?UTF-8?Q?Rouven_We=C3=9Fling?= , PHP internals Content-Type: multipart/alternative; boundary=089e011827581bd8e204dd786efa Subject: Re: [PHP-DEV] Proposal for better UTF-8 handling From: tyra3l@gmail.com (Ferenc Kovacs) --089e011827581bd8e204dd786efa Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable On Fri, May 24, 2013 at 3:09 PM, Nikita Popov wrote: > On Fri, May 24, 2013 at 3:17 AM, Rouven We=C3=9Fling >wrote: > > > Hi Internals! > > > > First let me introduce myself, my name is Rouven We=C3=9Fling, I'm a st= udent > at > > RWTH Aachen University and I'm one of the maintainers of the Joomla! > > Framework (n=C3=A9e Platform). I've been following the internals list f= or a > few > > months and started brushing of my C skills for the past couple of month= s > so > > I can start contributing. > > > > To me one of the most annoying things about working with PHP is the (la= ck > > of) unicode support. In Joomla! we've been discussing switching from PH= P > > UTF-8 to Patchwork UTF-8 for our needs of handling UTF-8. Both are > > libraries abstracting the multibyte extension and supplementing it with= a > > number of functions. They also provide userland replacements for when > > multibyte is not available (Patchwork will also use iconv and intl if > > available). All of this is a huge pain. > > > > To ease this situation I'd like to make a new start at better unicode > > support for PHP, this time focusing on UTF-8 as the dominant web > encoding. > > As a first step I'd like to propose adding a set of functions for > handling > > UTF-8 strings. This should keep applications from implementing these > > algorithms in PHP (also many of these are quite a bit faster, see > benchmark > > results below). Once the algorithms are in place I'd like to look into > > creating a class for unicode strings and eventually Python like unicode > > literals. > > > > Before I write an RFC I'd like to get some feedback what you think abou= t > > adding the following functions to PHP 5.6 (possibly more to follow): > > utf8_is_valid, utf8_strlen, utf8_substr, utf8_strpos, utf8_strrpos, > > utf8_str_split, utf8_strrev, utf8_recover, utf8_chr, utf8_ord, > > string_is_ascii. > > > > Most of them (exceptions are utf8_chr, utf8_is_valid, utf8_recover and > > string_is_ascii) are currently written in a way that they emit a warnin= g > > when they encounter invalid UTF-8 and return with null. This should > > encourage applications to check their input with utf8_is_valid and eith= er > > stop further processing or to fall back to utf8_recover to get a valid > > string. This should improve security since there are attack vectors whe= n > > malformed sequences get interpreted as another encoding. > > > > You can find the code I've written so far here: > > https://github.com/realityking/pecl-utf8 > > You can find benchmark results here: > > http://realityking.github.io/pecl-utf8/results.html > > > > Best regards > > Rouven > > > > We already have a lot of functions for multibyte string handling. Let me > list a few: > > * The str* functions. Most of them are safe for usage with UTF8. > Exceptions are basically everything where you manually provide an offset, > e.g. writing substr($str, 0, 100) is not safe. substr($str, 0, strpos($st= r, > 'xyz')) on the other hand is. > * The mb* functions. They work with various encodings and usually make o= f > of character offsets and lengths rather than byte offsets and lengths. Th= ey > are not necessary most of the time, but useful for the aforementioned > substr call with hardcoded offsets. > * The Intl extension. This give you *real* unicode support, as in > collations, locales, transliteration, etc. > * The grapheme* functions which are also part of intl. The work with > grapheme cluster offsets and lengths. > > Anyway, my point is that just adding *yet another* set of string function= s > won't solve anything, just make things even more complicated than they > already are. I'm not strictly opposed to adding more functions if they ar= e > necessary, but one has to be aware of what there already is and how the n= ew > functions will integrate. > > Nikita > did you just forgot the pcre functions with the /u modifier?!?! :P --=20 Ferenc Kov=C3=A1cs @Tyr43l - http://tyrael.hu --089e011827581bd8e204dd786efa--