Newsgroups: php.internals
Path: news.php.net
Xref: news.php.net php.internals:72746
Mailing-List: contact internals-help@lists.php.net; run by ezmlm
Received-SPF: error (pb1.pair.com: domain marc-bennewitz.de from 80.237.132.171 cause and error)
Message-ID: <5307ADB4.6010608@marc-bennewitz.de>
Date: Fri, 21 Feb 2014 20:49:08 +0100
User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:24.0) Gecko/20100101 Thunderbird/24.2.0
MIME-Version: 1.0
To: internals@lists.php.net
References: <CAEZPtU4zzN=Xxs03XbBGEM2UTVe3jXH5TAaXUqjsVkzm3kOOyg@mail.gmail.com>
In-Reply-To: <CAEZPtU4zzN=Xxs03XbBGEM2UTVe3jXH5TAaXUqjsVkzm3kOOyg@mail.gmail.com>
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 7bit
Subject: Re: [PHP-DEV] [php6] Unicode support, options?
From: php@marc-bennewitz.de (Marc Bennewitz)

hi,

I'm a PHP developer a long time by have only a little knowledge in C/C++
so I can't know some really internal parts of the engine.

From my perspective the internal datatype "string" should be a binary
string (byte array) and only in specific context this binary string can
be interpreted as a more specialized string. In my understanding in 90%
it's already the case.

Unicode support (and other) could be done as a String class like it's
done in Java and implementing a magic method "__toString" to get the raw
binary string. - We already have "(binary)" as an alias for "(string)".

This should be almost compatible with current behavior and provide a
very clean API as sugar.

Only things were the current string type will not be handled as a binary
string without context needs to be updated.
... like var_dump("1e1" == "10"); but var_dump("1e1" == 10); should work
as before because the integer type would switch the binary string into
the context of a numeric (ascii) string.

Thoughts?

Marc

On 20.02.2014 06:54, Pierre Joye wrote:
> hi,
> 
> Unicode still remains one of the top requested features in PHP.
> 
> However as Rasmus and other stated earlier, it is not a trivial job.
> Some of the keys point we need to take care of are:
> 
> - UTF-8 storage
> - UTF-8 support for almost (if not all) existing string APIs
> - Performance
> 
> As of today, I did not find any library covering at least two of these
> key points.
> 
> Please keep in mind that I am by no mean a Unicode expert, and this
> summary is what I gather by reading the ICU and other projects
> documentation and discussions archives. Experiments still have to be
> done. However I rather prefer to discuss the options prior to go wild
> with an implementation (huge task, even for basic features coverage).
> 
> If one of the following statement is wrong or not accurate, please fix
> it. I will keep a dedicated wiki page to summarize the discussions and
> options about unicode support.
> 
> * ICU:
> U_CHARSET_IS_UTF8 allows to force ICU to use UTF-8 by default. It is a
> ICU compile time setting.It is is not possible to set it at PHP
> configure time. It means that users will have to create their own
> build. Alternatively we can bundle ICU but this will be awkward, a
> maintenance nightmare for both php and the distros.
> 
> Alternatively UText can be used to create UTF-8 string. APIs accepting
> UText allow almost everything we need. However the counterpart is that
> a UTF-8 UText is readonly. Any operation altering its content will
> require duplication, clones or conversions. That may kill all gains we
> got from using UTF-8 only.
> 
> The  U_CHARSET_IS_UTF8 is very appealing but to bundle ICU is actually
>  show stopper. Asking users to custom build ICU is not an option
> either. I do not know if the distros will be ready to provide two
> different builds of ICU either, it may add a lot of issues with all
> projects using ICU.
> 
> * UTF8proc
> utf8proc is very attractive, small and relatively fast. I see it as a
> good starting point. However its features cover a very little part of
> what PHP needs.It is easy to bundle but will require a fork and a lot
> of work to add all missing features.
> 
> librope
> Same comments than utf8proc, with even less features.
> 
> I would like to begin to discuss our option now already. I am not
> asking to get in all implementation details from a userland point of
> view (like u"some text" or addng new APIs or not) but only to see what
> we can do internally to work with UTF-8 string.
> 
> Thoughts, comments or ideas?
> 
> 
> 
> Links&reference
> https://github.com/josephg/librope
> https://github.com/josephg/librope
> http://userguide.icu-project.org/strings/utf-8
> 
> 
> Cheers,
>