Newsgroups: php.internals
Path: news.php.net
Xref: news.php.net php.internals:72697
Mailing-List: contact internals-help@lists.php.net; run by ezmlm
Received-SPF: pass (pb1.pair.com: domain gmail.com designates 209.85.216.54 as permitted sender)
MIME-Version: 1.0
Date: Thu, 20 Feb 2014 06:54:21 +0100
Message-ID: <CAEZPtU4zzN=Xxs03XbBGEM2UTVe3jXH5TAaXUqjsVkzm3kOOyg@mail.gmail.com>
To: PHP internals <internals@lists.php.net>
Content-Type: text/plain; charset=UTF-8
Subject: [php6] Unicode support, options?
From: pierre.php@gmail.com (Pierre Joye)

hi,

Unicode still remains one of the top requested features in PHP.

However as Rasmus and other stated earlier, it is not a trivial job.
Some of the keys point we need to take care of are:

- UTF-8 storage
- UTF-8 support for almost (if not all) existing string APIs
- Performance

As of today, I did not find any library covering at least two of these
key points.

Please keep in mind that I am by no mean a Unicode expert, and this
summary is what I gather by reading the ICU and other projects
documentation and discussions archives. Experiments still have to be
done. However I rather prefer to discuss the options prior to go wild
with an implementation (huge task, even for basic features coverage).

If one of the following statement is wrong or not accurate, please fix
it. I will keep a dedicated wiki page to summarize the discussions and
options about unicode support.

* ICU:
U_CHARSET_IS_UTF8 allows to force ICU to use UTF-8 by default. It is a
ICU compile time setting.It is is not possible to set it at PHP
configure time. It means that users will have to create their own
build. Alternatively we can bundle ICU but this will be awkward, a
maintenance nightmare for both php and the distros.

Alternatively UText can be used to create UTF-8 string. APIs accepting
UText allow almost everything we need. However the counterpart is that
a UTF-8 UText is readonly. Any operation altering its content will
require duplication, clones or conversions. That may kill all gains we
got from using UTF-8 only.

The  U_CHARSET_IS_UTF8 is very appealing but to bundle ICU is actually
 show stopper. Asking users to custom build ICU is not an option
either. I do not know if the distros will be ready to provide two
different builds of ICU either, it may add a lot of issues with all
projects using ICU.

* UTF8proc
utf8proc is very attractive, small and relatively fast. I see it as a
good starting point. However its features cover a very little part of
what PHP needs.It is easy to bundle but will require a fork and a lot
of work to add all missing features.

librope
Same comments than utf8proc, with even less features.

I would like to begin to discuss our option now already. I am not
asking to get in all implementation details from a userland point of
view (like u"some text" or addng new APIs or not) but only to see what
we can do internally to work with UTF-8 string.

Thoughts, comments or ideas?



Links&reference
https://github.com/josephg/librope
https://github.com/josephg/librope
http://userguide.icu-project.org/strings/utf-8


Cheers,
-- 
Pierre

@pierrejoye | http://www.libgd.org