Proposal for better UTF-8 handling

12 years ago by me@rouvenwessling.de — view source

unread

Hi Internals!

First let me introduce myself, my name is Rouven Weßling, I'm a student at RWTH Aachen University and I'm one of the maintainers of the Joomla! Framework (née Platform). I've been following the internals list for a few months and started brushing of my C skills for the past couple of months so I can start contributing.

To me one of the most annoying things about working with PHP is the (lack of) unicode support. In Joomla! we've been discussing switching from PHP UTF-8 to Patchwork UTF-8 for our needs of handling UTF-8. Both are libraries abstracting the multibyte extension and supplementing it with a number of functions. They also provide userland replacements for when multibyte is not available (Patchwork will also use iconv and intl if available). All of this is a huge pain.

To ease this situation I'd like to make a new start at better unicode support for PHP, this time focusing on UTF-8 as the dominant web encoding. As a first step I'd like to propose adding a set of functions for handling UTF-8 strings. This should keep applications from implementing these algorithms in PHP (also many of these are quite a bit faster, see benchmark results below). Once the algorithms are in place I'd like to look into creating a class for unicode strings and eventually Python like unicode literals.

Before I write an RFC I'd like to get some feedback what you think about adding the following functions to PHP 5.6 (possibly more to follow): utf8_is_valid, utf8_strlen, utf8_substr, utf8_strpos, utf8_strrpos, utf8_str_split, utf8_strrev, utf8_recover, utf8_chr, utf8_ord, string_is_ascii.

Most of them (exceptions are utf8_chr, utf8_is_valid, utf8_recover and string_is_ascii) are currently written in a way that they emit a warning when they encounter invalid UTF-8 and return with null. This should encourage applications to check their input with utf8_is_valid and either stop further processing or to fall back to utf8_recover to get a valid string. This should improve security since there are attack vectors when malformed sequences get interpreted as another encoding.

You can find the code I've written so far here: https://github.com/realityking/pecl-utf8
You can find benchmark results here: http://realityking.github.io/pecl-utf8/results.html

Best regards
Rouven

12 years ago by Martin Keckeis — view source

unread

Hello Rouven,

the lack of "good" UTF-8 support is a long topic in PHP and improvement (at
least i think) are very welcome at this place!

Before I write an RFC I'd like to get some feedback what you think about

adding the following functions to PHP 5.6 (possibly more to follow):
utf8_is_valid, utf8_strlen, utf8_substr, utf8_strpos, utf8_strrpos,
utf8_str_split, utf8_strrev, utf8_recover, utf8_chr, utf8_ord,
string_is_ascii.

Most of them (exceptions are utf8_chr, utf8_is_valid, utf8_recover and
string_is_ascii) are currently written in a way that they emit a warning
when they encounter invalid UTF-8 and return with null. This should
encourage applications to check their input with utf8_is_valid and either
stop further processing or to fall back to utf8_recover to get a valid
string. This should improve security since there are attack vectors when
malformed sequences get interpreted as another encoding.

I'm currently using the multibyte from the "mb_" functions and i'm
generally happy with it. For me it's no problem with a custom webserver to
use this extension. The biggest problem with the extension i had is that
there is no each function from the standard string functions available.
I think most famous: mb_str_replace

Maybe to think off:
Why not combine your things with the mb_ extension? For emmiting a warning
you could use a configuration either in ini file or calling a function to
set it.

I would rather like one complete "mb/utf-8" lib that even one more. Like
you have already written, there are already some out there....and for core
i would currently preferr "mb_" because they are available since PHP4 and
stable.

12 years ago by Nikita Popov — view source

unread

On Fri, May 24, 2013 at 3:17 AM, Rouven Weßling me@rouvenwessling.dewrote:

Hi Internals!

First let me introduce myself, my name is Rouven Weßling, I'm a student at
RWTH Aachen University and I'm one of the maintainers of the Joomla!
Framework (née Platform). I've been following the internals list for a few
months and started brushing of my C skills for the past couple of months so
I can start contributing.

To me one of the most annoying things about working with PHP is the (lack
of) unicode support. In Joomla! we've been discussing switching from PHP
UTF-8 to Patchwork UTF-8 for our needs of handling UTF-8. Both are
libraries abstracting the multibyte extension and supplementing it with a
number of functions. They also provide userland replacements for when
multibyte is not available (Patchwork will also use iconv and intl if
available). All of this is a huge pain.

To ease this situation I'd like to make a new start at better unicode
support for PHP, this time focusing on UTF-8 as the dominant web encoding.
As a first step I'd like to propose adding a set of functions for handling
UTF-8 strings. This should keep applications from implementing these
algorithms in PHP (also many of these are quite a bit faster, see benchmark
results below). Once the algorithms are in place I'd like to look into
creating a class for unicode strings and eventually Python like unicode
literals.

Before I write an RFC I'd like to get some feedback what you think about
adding the following functions to PHP 5.6 (possibly more to follow):
utf8_is_valid, utf8_strlen, utf8_substr, utf8_strpos, utf8_strrpos,
utf8_str_split, utf8_strrev, utf8_recover, utf8_chr, utf8_ord,
string_is_ascii.

Most of them (exceptions are utf8_chr, utf8_is_valid, utf8_recover and
string_is_ascii) are currently written in a way that they emit a warning
when they encounter invalid UTF-8 and return with null. This should
encourage applications to check their input with utf8_is_valid and either
stop further processing or to fall back to utf8_recover to get a valid
string. This should improve security since there are attack vectors when
malformed sequences get interpreted as another encoding.

You can find the code I've written so far here:
https://github.com/realityking/pecl-utf8
You can find benchmark results here:
http://realityking.github.io/pecl-utf8/results.html

Best regards
Rouven

We already have a lot of functions for multibyte string handling. Let me
list a few:

The str* functions. Most of them are safe for usage with UTF8.
Exceptions are basically everything where you manually provide an offset,
e.g. writing substr($str, 0, 100) is not safe. substr($str, 0, strpos($str,
'xyz')) on the other hand is.
The mb* functions. They work with various encodings and usually make of
of character offsets and lengths rather than byte offsets and lengths. They
are not necessary most of the time, but useful for the aforementioned
substr call with hardcoded offsets.
The Intl extension. This give you real unicode support, as in
collations, locales, transliteration, etc.
The grapheme* functions which are also part of intl. The work with
grapheme cluster offsets and lengths.

Anyway, my point is that just adding yet another set of string functions
won't solve anything, just make things even more complicated than they
already are. I'm not strictly opposed to adding more functions if they are
necessary, but one has to be aware of what there already is and how the new
functions will integrate.

Nikita

12 years ago by Ferenc Kovacs — view source

unread

On Fri, May 24, 2013 at 3:17 AM, Rouven Weßling <me@rouvenwessling.de

wrote:

Hi Internals!

First let me introduce myself, my name is Rouven Weßling, I'm a student
at
RWTH Aachen University and I'm one of the maintainers of the Joomla!
Framework (née Platform). I've been following the internals list for a
few
months and started brushing of my C skills for the past couple of months
so
I can start contributing.

To me one of the most annoying things about working with PHP is the (lack
of) unicode support. In Joomla! we've been discussing switching from PHP
UTF-8 to Patchwork UTF-8 for our needs of handling UTF-8. Both are
libraries abstracting the multibyte extension and supplementing it with a
number of functions. They also provide userland replacements for when
multibyte is not available (Patchwork will also use iconv and intl if
available). All of this is a huge pain.

To ease this situation I'd like to make a new start at better unicode
support for PHP, this time focusing on UTF-8 as the dominant web
encoding.
As a first step I'd like to propose adding a set of functions for
handling
UTF-8 strings. This should keep applications from implementing these
algorithms in PHP (also many of these are quite a bit faster, see
benchmark
results below). Once the algorithms are in place I'd like to look into
creating a class for unicode strings and eventually Python like unicode
literals.

Before I write an RFC I'd like to get some feedback what you think about
adding the following functions to PHP 5.6 (possibly more to follow):
utf8_is_valid, utf8_strlen, utf8_substr, utf8_strpos, utf8_strrpos,
utf8_str_split, utf8_strrev, utf8_recover, utf8_chr, utf8_ord,
string_is_ascii.

Most of them (exceptions are utf8_chr, utf8_is_valid, utf8_recover and
string_is_ascii) are currently written in a way that they emit a warning
when they encounter invalid UTF-8 and return with null. This should
encourage applications to check their input with utf8_is_valid and either
stop further processing or to fall back to utf8_recover to get a valid
string. This should improve security since there are attack vectors when
malformed sequences get interpreted as another encoding.

You can find the code I've written so far here:
https://github.com/realityking/pecl-utf8
You can find benchmark results here:
http://realityking.github.io/pecl-utf8/results.html

Best regards
Rouven

We already have a lot of functions for multibyte string handling. Let me
list a few:

The str* functions. Most of them are safe for usage with UTF8.
Exceptions are basically everything where you manually provide an offset,
e.g. writing substr($str, 0, 100) is not safe. substr($str, 0, strpos($str,
'xyz')) on the other hand is.

The mb* functions. They work with various encodings and usually make of
of character offsets and lengths rather than byte offsets and lengths. They
are not necessary most of the time, but useful for the aforementioned
substr call with hardcoded offsets.

The Intl extension. This give you real unicode support, as in
collations, locales, transliteration, etc.

The grapheme* functions which are also part of intl. The work with
grapheme cluster offsets and lengths.

Anyway, my point is that just adding yet another set of string functions
won't solve anything, just make things even more complicated than they
already are. I'm not strictly opposed to adding more functions if they are
necessary, but one has to be aware of what there already is and how the new
functions will integrate.

Nikita

did you just forgot the pcre functions with the /u modifier?!?!
:P

--
Ferenc Kovács
@Tyr43l - http://tyrael.hu

12 years ago by Adam Harvey — view source

unread

We already have a lot of functions for multibyte string handling. Let me
list a few:

The str* functions. Most of them are safe for usage with UTF8.
Exceptions are basically everything where you manually provide an offset,
e.g. writing substr($str, 0, 100) is not safe. substr($str, 0, strpos($str,
'xyz')) on the other hand is.

The mb* functions. They work with various encodings and usually make of
of character offsets and lengths rather than byte offsets and lengths. They
are not necessary most of the time, but useful for the aforementioned
substr call with hardcoded offsets.

The Intl extension. This give you real unicode support, as in
collations, locales, transliteration, etc.

The grapheme* functions which are also part of intl. The work with
grapheme cluster offsets and lengths.

Anyway, my point is that just adding yet another set of string functions
won't solve anything, just make things even more complicated than they
already are. I'm not strictly opposed to adding more functions if they are
necessary, but one has to be aware of what there already is and how the new
functions will integrate.

Nikita

did you just forgot the pcre functions with the /u modifier?!?!
:P

And that's without even touching PECL. :)

I agree with Nikita — I'm not against adding more Unicode/charset
handling functions if they make sense (and I haven't looked at the
code for this particular proposal yet), particularly if they'd be part
of a default build, but enough water has hopefully passed under the
bridge since the PHP 6 days that it might be time to canvass ideas on
a less piecemeal approach to character set handling and
internationalisation for PHP 5.5+1 or PHP 5.5+2.

Adam

12 years ago by Stas Malyshev — view source

unread

Hi!

I agree with Nikita — I'm not against adding more Unicode/charset
handling functions if they make sense (and I haven't looked at the
code for this particular proposal yet), particularly if they'd be part
of a default build, but enough water has hopefully passed under the

Did you mean "would not be part of the default build"? Because having
yet another way of handling utf-8 (also basing on yet another separate
library so with potential for incompatibilities and quirks) doesn't look
like a good idea. Having yet another PECL ext is not a big deal, but
having yet another way by default certainly would only create confusion.

--
Stanislav Malyshev, Software Architect
SugarCRM: http://www.sugarcrm.com/
(408)454-6900 ext. 227

12 years ago by Adam Harvey — view source

unread

I agree with Nikita — I'm not against adding more Unicode/charset
handling functions if they make sense (and I haven't looked at the
code for this particular proposal yet), particularly if they'd be part
of a default build, but enough water has hopefully passed under the

Did you mean "would not be part of the default build"? Because having
yet another way of handling utf-8 (also basing on yet another separate
library so with potential for incompatibilities and quirks) doesn't look
like a good idea. Having yet another PECL ext is not a big deal, but
having yet another way by default certainly would only create confusion.

I did mean would — one issue with much of our internationalisation
code is that it's in extensions (intl, iconv, mbstring) that are
inconsistently deployed by shared hosting providers. Having some basic
conversion and string handling functions that could be available in
ext/standard might not be a bad thing.

I do agree that having yet another set of functions with their own
behaviours isn't ideal, though.

Adam

12 years ago by Pierre Joye — view source

unread

hi Adam!

I agree with Nikita — I'm not against adding more Unicode/charset
handling functions if they make sense (and I haven't looked at the
code for this particular proposal yet), particularly if they'd be part
of a default build, but enough water has hopefully passed under the

Did you mean "would not be part of the default build"? Because having
yet another way of handling utf-8 (also basing on yet another separate
library so with potential for incompatibilities and quirks) doesn't look
like a good idea. Having yet another PECL ext is not a big deal, but
having yet another way by default certainly would only create confusion.

I did mean would — one issue with much of our internationalisation
code is that it's in extensions (intl, iconv, mbstring) that are
inconsistently deployed by shared hosting providers. Having some basic
conversion and string handling functions that could be available in
ext/standard might not be a bad thing.

We will still require to use some external Unicode data. I won't go
with our own data set, ever, that's something we won't be able to
maintain. By the way, it is why intl is very good, same APIs, it only
adds new functions, no change in existing APIs, but you get free
Unicode data update while updating the library. And I would like to
enable it by default, or make it a required extension at some point.

Cheers,

Pierre

@pierrejoye | http://www.libgd.org

12 years ago by Stas Malyshev — view source

unread

Hi!

I did mean would — one issue with much of our internationalisation
code is that it's in extensions (intl, iconv, mbstring) that are
inconsistently deployed by shared hosting providers. Having some basic

Shared hosting providers are completely capable of building their PHP
offerings the way they want. Adding yet another - fourth? fifth? sixth?

way of doing string operations is not going to change anything. If the
problem is hosting providers, it should be handled at that point, not in
PHP core.

conversion and string handling functions that could be available in
ext/standard might not be a bad thing.

Same argument would apply to any functionality that is useful for
anybody - it all should be in ext/standard or some shared hosting
provider could build PHP without it. Obviously, it's not a good
argument, and if hosting provider does not provide common modules,
choose another provider - there are hundreds of others just a simple
search away.

Stanislav Malyshev, Software Architect
SugarCRM: http://www.sugarcrm.com/
(408)454-6900 ext. 227

12 years ago by Pierre Joye — view source

unread

hi!

Hi Internals!

First let me introduce myself, my name is Rouven Weßling, I'm a student at RWTH Aachen University and I'm one of the maintainers of the Joomla! Framework (née Platform). I've been following the internals list for a few months and started brushing of my C skills for the past couple of months so I can start contributing.

To me one of the most annoying things about working with PHP is the (lack of) unicode support. In Joomla! we've been discussing switching from PHP UTF-8 to Patchwork UTF-8 for our needs of handling UTF-8. Both are libraries abstracting the multibyte extension and supplementing it with a number of functions. They also provide userland replacements for when multibyte is not available (Patchwork will also use iconv and intl if available). All of this is a huge pain.

To ease this situation I'd like to make a new start at better unicode support for PHP, this time focusing on UTF-8 as the dominant web encoding. As a first step I'd like to propose adding a set of functions for handling UTF-8 strings. This should keep applications from implementing these algorithms in PHP (also many of these are quite a bit faster, see benchmark results below). Once the algorithms are in place I'd like to look into creating a class for unicode strings and eventually Python like unicode literals.

Before I write an RFC I'd like to get some feedback what you think about adding the following functions to PHP 5.6 (possibly more to follow): utf8_is_valid, utf8_strlen, utf8_substr, utf8_strpos, utf8_strrpos, utf8_str_split, utf8_strrev, utf8_recover, utf8_chr, utf8_ord, string_is_ascii.

Most of them (exceptions are utf8_chr, utf8_is_valid, utf8_recover and string_is_ascii) are currently written in a way that they emit a warning when they encounter invalid UTF-8 and return with null. This should encourage applications to check their input with utf8_is_valid and either stop further processing or to fall back to utf8_recover to get a valid string. This should improve security since there are attack vectors when malformed sequences get interpreted as another encoding.

You can find the code I've written so far here: https://github.com/realityking/pecl-utf8
You can find benchmark results here: http://realityking.github.io/pecl-utf8/results.html

Without judging your extension, I wonder if you have looked at the
intl extension, for the php core parts. There are also some exts to
deal with non ascii strings in pecl.

I always promoted intl usage as it handles UTF-8 or other very well
and for everything needed to fully support Unicode, their data is kept
updated and the APIs are very stable. It is also available since PHP
5.3 which makes it a very good choice to begin with.

Cheers,

Pierre

@pierrejoye | http://www.libgd.org

12 years ago by Nicolas Grekas — view source

unread

Btw, I hit a bug on grapheme_substr() that got no attention:
https://bugs.php.net/bug.php?id=62759

There is also https://bugs.php.net/bug.php?id=61860 that waits for a fix.

Nicolas

hi!

On Fri, May 24, 2013 at 3:17 AM, Rouven Weßling me@rouvenwessling.de
wrote:

Hi Internals!

First let me introduce myself, my name is Rouven Weßling, I'm a student
at RWTH Aachen University and I'm one of the maintainers of the Joomla!
Framework (née Platform). I've been following the internals list for a few
months and started brushing of my C skills for the past couple of months so
I can start contributing.

To me one of the most annoying things about working with PHP is the
(lack of) unicode support. In Joomla! we've been discussing switching from
PHP UTF-8 to Patchwork UTF-8 for our needs of handling UTF-8. Both are
libraries abstracting the multibyte extension and supplementing it with a
number of functions. They also provide userland replacements for when
multibyte is not available (Patchwork will also use iconv and intl if
available). All of this is a huge pain.

To ease this situation I'd like to make a new start at better unicode
support for PHP, this time focusing on UTF-8 as the dominant web encoding.
As a first step I'd like to propose adding a set of functions for handling
UTF-8 strings. This should keep applications from implementing these
algorithms in PHP (also many of these are quite a bit faster, see benchmark
results below). Once the algorithms are in place I'd like to look into
creating a class for unicode strings and eventually Python like unicode
literals.

Before I write an RFC I'd like to get some feedback what you think about
adding the following functions to PHP 5.6 (possibly more to follow):
utf8_is_valid, utf8_strlen, utf8_substr, utf8_strpos, utf8_strrpos,
utf8_str_split, utf8_strrev, utf8_recover, utf8_chr, utf8_ord,
string_is_ascii.

Most of them (exceptions are utf8_chr, utf8_is_valid, utf8_recover and
string_is_ascii) are currently written in a way that they emit a warning
when they encounter invalid UTF-8 and return with null. This should
encourage applications to check their input with utf8_is_valid and either
stop further processing or to fall back to utf8_recover to get a valid
string. This should improve security since there are attack vectors when
malformed sequences get interpreted as another encoding.

You can find the code I've written so far here:
https://github.com/realityking/pecl-utf8
You can find benchmark results here:
http://realityking.github.io/pecl-utf8/results.html

Without judging your extension, I wonder if you have looked at the
intl extension, for the php core parts. There are also some exts to
deal with non ascii strings in pecl.

I always promoted intl usage as it handles UTF-8 or other very well
and for everything needed to fully support Unicode, their data is kept
updated and the APIs are very stable. It is also available since PHP
5.3 which makes it a very good choice to begin with.

Cheers,

Pierre

@pierrejoye | http://www.libgd.org