Newsgroups: php.internals
Path: news.php.net
Xref: news.php.net php.internals:67493
Mailing-List: contact internals-help@lists.php.net; run by ezmlm
Received-SPF: pass (pb1.pair.com: domain gmail.com designates 209.85.223.171 as permitted sender)
MIME-Version: 1.0
In-Reply-To: <CAF+90c__m2zsi9ag32TAP+2yCFOj8gf_z4oNUxEoNCsJi-K1OQ@mail.gmail.com>
References: <61BC4F17-86D9-4CBD-B185-58A2D4AFAE5F@rouvenwessling.de>
	<CAF+90c__m2zsi9ag32TAP+2yCFOj8gf_z4oNUxEoNCsJi-K1OQ@mail.gmail.com>
Date: Fri, 24 May 2013 17:26:59 +0200
Message-ID: <CAH-PCH4w2YTwvHdPFFC-TygSBQpssv6woQJaoa3JVks1egq44g@mail.gmail.com>
To: Nikita Popov <nikita.ppv@gmail.com>
Cc: =?UTF-8?Q?Rouven_We=C3=9Fling?= <me@rouvenwessling.de>, 
	PHP internals <internals@lists.php.net>
Content-Type: multipart/alternative; boundary=089e011827581bd8e204dd786efa
Subject: Re: [PHP-DEV] Proposal for better UTF-8 handling
From: tyra3l@gmail.com (Ferenc Kovacs)

--089e011827581bd8e204dd786efa
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

On Fri, May 24, 2013 at 3:09 PM, Nikita Popov <nikita.ppv@gmail.com> wrote:

> On Fri, May 24, 2013 at 3:17 AM, Rouven We=C3=9Fling <me@rouvenwessling.d=
e
> >wrote:
>
> > Hi Internals!
> >
> > First let me introduce myself, my name is Rouven We=C3=9Fling, I'm a st=
udent
> at
> > RWTH Aachen University and I'm one of the maintainers of the Joomla!
> > Framework (n=C3=A9e Platform). I've been following the internals list f=
or a
> few
> > months and started brushing of my C skills for the past couple of month=
s
> so
> > I can start contributing.
> >
> > To me one of the most annoying things about working with PHP is the (la=
ck
> > of) unicode support. In Joomla! we've been discussing switching from PH=
P
> > UTF-8 to Patchwork UTF-8 for our needs of handling UTF-8. Both are
> > libraries abstracting the multibyte extension and supplementing it with=
 a
> > number of functions. They also provide userland replacements for when
> > multibyte is not available (Patchwork will also use iconv and intl if
> > available). All of this is a huge pain.
> >
> > To ease this situation I'd like to make a new start at better unicode
> > support for PHP, this time focusing on UTF-8 as the dominant web
> encoding.
> > As a first step I'd like to propose adding a set of functions for
> handling
> > UTF-8 strings. This should keep applications from implementing these
> > algorithms in PHP (also many of these are quite a bit faster, see
> benchmark
> > results below). Once the algorithms are in place I'd like to look into
> > creating a class for unicode strings and eventually Python like unicode
> > literals.
> >
> > Before I write an RFC I'd like to get some feedback what you think abou=
t
> > adding the following functions to PHP 5.6 (possibly more to follow):
> > utf8_is_valid, utf8_strlen,  utf8_substr, utf8_strpos, utf8_strrpos,
> > utf8_str_split, utf8_strrev, utf8_recover, utf8_chr, utf8_ord,
> > string_is_ascii.
> >
> > Most of them (exceptions are utf8_chr, utf8_is_valid, utf8_recover and
> > string_is_ascii) are currently written in a way that they emit a warnin=
g
> > when they encounter invalid UTF-8 and return with null. This should
> > encourage applications to check their input with utf8_is_valid and eith=
er
> > stop further processing or to fall back to utf8_recover to get a valid
> > string. This should improve security since there are attack vectors whe=
n
> > malformed sequences get interpreted as another encoding.
> >
> > You can find the code I've written so far here:
> > https://github.com/realityking/pecl-utf8
> > You can find benchmark results here:
> > http://realityking.github.io/pecl-utf8/results.html
> >
> > Best regards
> > Rouven
> >
>
> We already have a lot of functions for multibyte string handling. Let me
> list a few:
>
>  * The str* functions. Most of them are safe for usage with UTF8.
> Exceptions are basically everything where you manually provide an offset,
> e.g. writing substr($str, 0, 100) is not safe. substr($str, 0, strpos($st=
r,
> 'xyz')) on the other hand is.
>  * The mb* functions. They work with various encodings and usually make o=
f
> of character offsets and lengths rather than byte offsets and lengths. Th=
ey
> are not necessary most of the time, but useful for the aforementioned
> substr call with hardcoded offsets.
>  * The Intl extension. This give you *real* unicode support, as in
> collations, locales, transliteration, etc.
>  * The grapheme* functions which are also part of intl. The work with
> grapheme cluster offsets and lengths.
>
> Anyway, my point is that just adding *yet another* set of string function=
s
> won't solve anything, just make things even more complicated than they
> already are. I'm not strictly opposed to adding more functions if they ar=
e
> necessary, but one has to be aware of what there already is and how the n=
ew
> functions will integrate.
>
> Nikita
>

did you just forgot the pcre functions with the /u modifier?!?!
:P

--=20
Ferenc Kov=C3=A1cs
@Tyr43l - http://tyrael.hu

--089e011827581bd8e204dd786efa--