Newsgroups: php.internals Path: news.php.net Xref: news.php.net php.internals:18611 Return-Path: Mailing-List: contact internals-help@lists.php.net; run by ezmlm Delivered-To: mailing list internals@lists.php.net Received: (qmail 59222 invoked by uid 1010); 31 Aug 2005 15:00:54 -0000 Delivered-To: ezmlm-scan-internals@lists.php.net Delivered-To: ezmlm-internals@lists.php.net Received: (qmail 59207 invoked from network); 31 Aug 2005 15:00:54 -0000 Received: from unknown (HELO lists.php.net) (127.0.0.1) by localhost with SMTP; 31 Aug 2005 15:00:54 -0000 X-Host-Fingerprint: 82.94.239.5 jdi.jdi-ict.nl Linux 2.5 (sometimes 2.4) (4) Received: from ([82.94.239.5:46516] helo=jdi.jdi-ict.nl) by pb1.pair.com (ecelerity 2.0 beta r(6323M)) with SMTP id 1B/57-15098-426C5134 for ; Wed, 31 Aug 2005 11:00:52 -0400 Received: from localhost (localhost [127.0.0.1]) by jdi.jdi-ict.nl (8.12.11/8.12.11) with ESMTP id j7VF0nBd015172 for ; Wed, 31 Aug 2005 17:00:49 +0200 Received: from localhost (localhost [127.0.0.1]) by jdi.jdi-ict.nl (8.12.11/8.12.11) with ESMTP id j7VF0hbH015163 for ; Wed, 31 Aug 2005 17:00:44 +0200 Date: Wed, 31 Aug 2005 17:00:43 +0200 (CEST) X-X-Sender: derick@localhost To: PHP Developers Mailing List Message-ID: MIME-Version: 1.0 Content-Type: MULTIPART/MIXED; BOUNDARY="8323328-1298772177-1125500443=:23333" X-Virus-Scanned: by amavisd-new at jdi-ict.nl Subject: ICU and Locale/Collations From: derick@php.net (Derick Rethans) --8323328-1298772177-1125500443=:23333 Content-Type: TEXT/PLAIN; charset=utf-8 Content-Transfer-Encoding: QUOTED-PRINTABLE Hello! I've been looking at using locale and collation now we have ICU. Please=20 let me know if you have any comments: Locale Functions with Unicode =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D Introduction ------------ The Unicode design document lists that all current functions that can make = use of a locale (such as strtoupper) are not going to be implemented in a local= e aware way. Although this will work for most situations, it might break BC f= or a few situations. One (popular) example is: :: =09 In PHP 4 and 5 this returns (when viewing in iso-8859-9): :: =09HANS BL=C4=B0X =09 Where in PHP 6 this currently returns: :: =09HANS BLIX The string returned for PHP 4 and 5 is the correct one for Turkish. See also note 1. Locale Dependent Functions -------------------------- There are other functions that deal with the locale settings, some in a different way. A list of functions and how they use the system locale. Array Sorting Functions ~~~~~~~~~~~~~~~~~~~~~~~ All the array sorting functions accept a flag "SORT_LOCALE_STRING" that cha= nges the sorting of array keys/value from a binary compare, to a locale based compare. This uses the function strcoll(), which relies on the system's loc= ale. String Functions ~~~~~~~~~~~~~~~~ str_word_count =09Uses the system locale to determine which characters make up a word. strnatcasecmp, strnatcmp =09Use the locale to upper and lower case letters, and to determine if =09something is a digit or not. strcmp, strncmp =09Do currently not use any locale, but perhaps they can make use of it, f.= e. =09in the =C3=9F vs ss case. strcasecmp, strncasecmp =09Uses the system locale to do lower casing on letters so that they can ma= tch =09case-insensitive. See also note 2. strtolower, strtoupper =09Make both use of locale properties for characters to lower/upper case th= em =09properly. ucfirst, ucwords =09Use character properties to upper and lower case the first letters of =09words. Other Functions ~~~~~~~~~~~~~~~ localeconv =09Uses the system locale to return information about this locale. money_format =09Uses the system locale to format a number as monetary number. Problems with System Locales ---------------------------- There are a number of problems with having to rely on the locale informatio= n that is available on different platforms / installations. Locale informati= on: - can be different for each platform - might not available depending on platform and installation - does not have a common identifier on different platforms ICU Locales and Collators ------------------------- As ICU provides us with a platform and installation independent way of deal= ing with locales and collation rules, we can use this to get rid of the current dependency on system locales. There are three ways how we can upgrade our functions to use ICU locales: 1. We simply make them use the default locale, as set by icu_loc_set_defaul= t() and default collator (as set by a future icu_coll_set_default()). 2. We add a new parameter to the functions specifying which locale to use. 3. Create new functions that are locale and collation dependent (by using t= he default locale/collation). Each of those three options have pro's and con's.=20 Modifying Functions to Use ICU Locales ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ pro: - No additional programming needed by users as the current functions would = "just work like expected". For people that do not care about locales, nothing w= ill really change, as the current default locale should be "C" or "POSIX". - No ugly API for our string handling functions. con: - It might break BC in some cases. Adding a New Argument to Functions ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ pro: - Doesn't break BC con: - Additional work for programmers for every function call. - Ugly API because of the passing of the locale name. Create New Functions ~~~~~~~~~~~~~~~~~~~~ pro: - Doesn't break BC - No ugly API con: - Additional work for programmers as they need to replace the current funct= ions with the upgraded ones. - It is crucial that the new functions can not be disabled, because of portability. - We need to come up with a good prefix for those. - The new functions need to work when Unicode semantics are turned off. Discussion ---------- Both the first and third options would in my opinion be acceptable, where I would prefer the first one, as it gives as little headache as possible for users to start using locales. This approach would well work for the String Functions. For the array sorting function, I would prefer that the current "SORT_LOCALE_STRING" simply starts using the ICU collation functionality, a= s it's a relatively new flag. Another solution would be to create a new flag = for this, "SORT_ICU_LOCALE_STRING" that make the sorting functions use the collation functionality provided by ICU. For the Other Functions we should create a new function to format numbers i= n a locale-aware way, as it would be very hard to make the current money_format compatible with ICU and still give the full possibilities of ICU's numberin= g formatting functionality. Other Functions' Implementation ------------------------------- i18n_format_number($number, $type [, $custom_format]) =09A wrapper around ICU's unum.h C-API =09(http://icu.sourceforge.net/apiref/icu4c/unum_8h.html) that allows you t= o =09format numbers in locale specific ways. i18n_parse_number($number, $type [, $custom_format]) =09A wrapper around the number parsing routines from unum.h Notes: ------ 1. For some reason, in PHP 6, the strtoupper() function *does* make use of = the locale though: By setting the locale with icu_loc_set_default("tr_TR") the PHP 6 exampl= e gives the correct result: :: =09=09 =09Shows: :: =09=09HANS BL=C4=B0X 2. the function zend_u_binary_strncmp doesn't compare anything binary, as i= t uses U16_NEXT. Why do we still call it u_binary_strncmp? regards, Derick --=20 Derick Rethans http://derickrethans.nl | http://ez.no | http://xdebug.org --8323328-1298772177-1125500443=:23333--