Newsgroups: php.internals Path: news.php.net Xref: news.php.net php.internals:72697 Return-Path: Mailing-List: contact internals-help@lists.php.net; run by ezmlm Delivered-To: mailing list internals@lists.php.net Received: (qmail 48504 invoked from network); 20 Feb 2014 05:54:27 -0000 Received: from unknown (HELO lists.php.net) (127.0.0.1) by localhost with SMTP; 20 Feb 2014 05:54:27 -0000 Authentication-Results: pb1.pair.com header.from=pierre.php@gmail.com; sender-id=pass Authentication-Results: pb1.pair.com smtp.mail=pierre.php@gmail.com; spf=pass; sender-id=pass Received-SPF: pass (pb1.pair.com: domain gmail.com designates 209.85.216.54 as permitted sender) X-PHP-List-Original-Sender: pierre.php@gmail.com X-Host-Fingerprint: 209.85.216.54 mail-qa0-f54.google.com Received: from [209.85.216.54] ([209.85.216.54:53201] helo=mail-qa0-f54.google.com) by pb1.pair.com (ecelerity 2.1.1.9-wez r(12769M)) with ESMTP id 9D/EA-13936-09895035 for ; Thu, 20 Feb 2014 00:54:25 -0500 Received: by mail-qa0-f54.google.com with SMTP id i13so2378602qae.13 for ; Wed, 19 Feb 2014 21:54:21 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:date:message-id:subject:from:to:content-type; bh=eTbdNCVbetbgNOjogwDQZa9baU9763fvpmb2WurPJ/4=; b=cXTSjcN1RyKq8rS/VowfPArIBgEddBBK6LokVqCbBj91AXxK+BeoLYdi9FFHGhZd3E TAI/9sEJZfpEq282EnnLAOKQeibX5a1dxcwSXU7pp3r/evGQE9q1GwyAK9wjRjMk9lKy HK0PyeO/KRStrh272ewOe+Vo509vUxV7eyicsORwYpErtebg7jnOyYZeYxI0WY1dWlRY e0ATzBjgvOpzwD2E99DAp23cVg7JbxDc528rhzbh7tWnnn4TppDrysoJYdWaiG4Hpo+Z JLRiWhWH7xREZd/IHYQkYkyG4gf7rNJOuclTGyUi5EH7exjw5qZRWxJr0UJR+iJRS+Li 6CKg== MIME-Version: 1.0 X-Received: by 10.140.31.75 with SMTP id e69mr6722540qge.76.1392875661721; Wed, 19 Feb 2014 21:54:21 -0800 (PST) Received: by 10.140.18.145 with HTTP; Wed, 19 Feb 2014 21:54:21 -0800 (PST) Date: Thu, 20 Feb 2014 06:54:21 +0100 Message-ID: To: PHP internals Content-Type: text/plain; charset=UTF-8 Subject: [php6] Unicode support, options? From: pierre.php@gmail.com (Pierre Joye) hi, Unicode still remains one of the top requested features in PHP. However as Rasmus and other stated earlier, it is not a trivial job. Some of the keys point we need to take care of are: - UTF-8 storage - UTF-8 support for almost (if not all) existing string APIs - Performance As of today, I did not find any library covering at least two of these key points. Please keep in mind that I am by no mean a Unicode expert, and this summary is what I gather by reading the ICU and other projects documentation and discussions archives. Experiments still have to be done. However I rather prefer to discuss the options prior to go wild with an implementation (huge task, even for basic features coverage). If one of the following statement is wrong or not accurate, please fix it. I will keep a dedicated wiki page to summarize the discussions and options about unicode support. * ICU: U_CHARSET_IS_UTF8 allows to force ICU to use UTF-8 by default. It is a ICU compile time setting.It is is not possible to set it at PHP configure time. It means that users will have to create their own build. Alternatively we can bundle ICU but this will be awkward, a maintenance nightmare for both php and the distros. Alternatively UText can be used to create UTF-8 string. APIs accepting UText allow almost everything we need. However the counterpart is that a UTF-8 UText is readonly. Any operation altering its content will require duplication, clones or conversions. That may kill all gains we got from using UTF-8 only. The U_CHARSET_IS_UTF8 is very appealing but to bundle ICU is actually show stopper. Asking users to custom build ICU is not an option either. I do not know if the distros will be ready to provide two different builds of ICU either, it may add a lot of issues with all projects using ICU. * UTF8proc utf8proc is very attractive, small and relatively fast. I see it as a good starting point. However its features cover a very little part of what PHP needs.It is easy to bundle but will require a fork and a lot of work to add all missing features. librope Same comments than utf8proc, with even less features. I would like to begin to discuss our option now already. I am not asking to get in all implementation details from a userland point of view (like u"some text" or addng new APIs or not) but only to see what we can do internally to work with UTF-8 string. Thoughts, comments or ideas? Links&reference https://github.com/josephg/librope https://github.com/josephg/librope http://userguide.icu-project.org/strings/utf-8 Cheers, -- Pierre @pierrejoye | http://www.libgd.org