Newsgroups: php.internals Path: news.php.net Xref: news.php.net php.internals:72746 Return-Path: Mailing-List: contact internals-help@lists.php.net; run by ezmlm Delivered-To: mailing list internals@lists.php.net Received: (qmail 35275 invoked from network); 21 Feb 2014 19:49:17 -0000 Received: from unknown (HELO lists.php.net) (127.0.0.1) by localhost with SMTP; 21 Feb 2014 19:49:17 -0000 Authentication-Results: pb1.pair.com header.from=php@marc-bennewitz.de; sender-id=unknown Authentication-Results: pb1.pair.com smtp.mail=php@marc-bennewitz.de; spf=permerror; sender-id=unknown Received-SPF: error (pb1.pair.com: domain marc-bennewitz.de from 80.237.132.171 cause and error) X-PHP-List-Original-Sender: php@marc-bennewitz.de X-Host-Fingerprint: 80.237.132.171 wp164.webpack.hosteurope.de Received: from [80.237.132.171] ([80.237.132.171:40597] helo=wp164.webpack.hosteurope.de) by pb1.pair.com (ecelerity 2.1.1.9-wez r(12769M)) with ESMTP id 18/13-17822-ABDA7035 for ; Fri, 21 Feb 2014 14:49:15 -0500 Received: from dslb-188-102-010-137.pools.arcor-ip.net ([188.102.10.137] helo=[192.168.178.30]); authenticated by wp164.webpack.hosteurope.de running ExIM with esmtpsa (TLS1.0:DHE_RSA_AES_256_CBC_SHA1:32) id 1WGw5v-0006Dk-Um; Fri, 21 Feb 2014 20:49:12 +0100 Message-ID: <5307ADB4.6010608@marc-bennewitz.de> Date: Fri, 21 Feb 2014 20:49:08 +0100 User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:24.0) Gecko/20100101 Thunderbird/24.2.0 MIME-Version: 1.0 To: internals@lists.php.net References: In-Reply-To: Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 7bit X-bounce-key: webpack.hosteurope.de;php@marc-bennewitz.de;1393012155;1e930a03; Subject: Re: [PHP-DEV] [php6] Unicode support, options? From: php@marc-bennewitz.de (Marc Bennewitz) hi, I'm a PHP developer a long time by have only a little knowledge in C/C++ so I can't know some really internal parts of the engine. From my perspective the internal datatype "string" should be a binary string (byte array) and only in specific context this binary string can be interpreted as a more specialized string. In my understanding in 90% it's already the case. Unicode support (and other) could be done as a String class like it's done in Java and implementing a magic method "__toString" to get the raw binary string. - We already have "(binary)" as an alias for "(string)". This should be almost compatible with current behavior and provide a very clean API as sugar. Only things were the current string type will not be handled as a binary string without context needs to be updated. ... like var_dump("1e1" == "10"); but var_dump("1e1" == 10); should work as before because the integer type would switch the binary string into the context of a numeric (ascii) string. Thoughts? Marc On 20.02.2014 06:54, Pierre Joye wrote: > hi, > > Unicode still remains one of the top requested features in PHP. > > However as Rasmus and other stated earlier, it is not a trivial job. > Some of the keys point we need to take care of are: > > - UTF-8 storage > - UTF-8 support for almost (if not all) existing string APIs > - Performance > > As of today, I did not find any library covering at least two of these > key points. > > Please keep in mind that I am by no mean a Unicode expert, and this > summary is what I gather by reading the ICU and other projects > documentation and discussions archives. Experiments still have to be > done. However I rather prefer to discuss the options prior to go wild > with an implementation (huge task, even for basic features coverage). > > If one of the following statement is wrong or not accurate, please fix > it. I will keep a dedicated wiki page to summarize the discussions and > options about unicode support. > > * ICU: > U_CHARSET_IS_UTF8 allows to force ICU to use UTF-8 by default. It is a > ICU compile time setting.It is is not possible to set it at PHP > configure time. It means that users will have to create their own > build. Alternatively we can bundle ICU but this will be awkward, a > maintenance nightmare for both php and the distros. > > Alternatively UText can be used to create UTF-8 string. APIs accepting > UText allow almost everything we need. However the counterpart is that > a UTF-8 UText is readonly. Any operation altering its content will > require duplication, clones or conversions. That may kill all gains we > got from using UTF-8 only. > > The U_CHARSET_IS_UTF8 is very appealing but to bundle ICU is actually > show stopper. Asking users to custom build ICU is not an option > either. I do not know if the distros will be ready to provide two > different builds of ICU either, it may add a lot of issues with all > projects using ICU. > > * UTF8proc > utf8proc is very attractive, small and relatively fast. I see it as a > good starting point. However its features cover a very little part of > what PHP needs.It is easy to bundle but will require a fork and a lot > of work to add all missing features. > > librope > Same comments than utf8proc, with even less features. > > I would like to begin to discuss our option now already. I am not > asking to get in all implementation details from a userland point of > view (like u"some text" or addng new APIs or not) but only to see what > we can do internally to work with UTF-8 string. > > Thoughts, comments or ideas? > > > > Links&reference > https://github.com/josephg/librope > https://github.com/josephg/librope > http://userguide.icu-project.org/strings/utf-8 > > > Cheers, >