Newsgroups: php.internals Path: news.php.net Xref: news.php.net php.internals:45115 Return-Path: Mailing-List: contact internals-help@lists.php.net; run by ezmlm Delivered-To: mailing list internals@lists.php.net Received: (qmail 24171 invoked from network); 28 Jul 2009 08:42:15 -0000 Received: from unknown (HELO lists.php.net) (127.0.0.1) by localhost with SMTP; 28 Jul 2009 08:42:15 -0000 Authentication-Results: pb1.pair.com smtp.mail=mozo@mozo.jp; spf=permerror; sender-id=permerror Authentication-Results: pb1.pair.com header.from=mozo@mozo.jp; sender-id=permerror Received-SPF: error (pb1.pair.com: domain mozo.jp from 209.85.210.183 cause and error) X-PHP-List-Original-Sender: mozo@mozo.jp X-Host-Fingerprint: 209.85.210.183 mail-yx0-f183.google.com Received: from [209.85.210.183] ([209.85.210.183:40136] helo=mail-yx0-f183.google.com) by pb1.pair.com (ecelerity 2.1.1.9-wez r(12769M)) with ESMTP id 81/E3-01210-5E9BE6A4 for ; Tue, 28 Jul 2009 04:42:15 -0400 Received: by yxe13 with SMTP id 13so3354365yxe.29 for ; Tue, 28 Jul 2009 01:42:11 -0700 (PDT) MIME-Version: 1.0 Received: by 10.90.120.14 with SMTP id s14mr6842448agc.94.1248770531281; Tue, 28 Jul 2009 01:42:11 -0700 (PDT) In-Reply-To: <4A6C6496.7060603@mozo.jp> References: <4A6C6496.7060603@mozo.jp> Date: Tue, 28 Jul 2009 17:41:51 +0900 Message-ID: To: php-dev Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable Subject: Re: Alternative mbstring implementation using ICU From: mozo@mozo.jp (Moriyoshi Koizumi) I set up a RFC page for this in wiki.php.net. Here it goes: http://wiki.php.net/rfc/altmbstring Moriyoshi 2009/7/26 Moriyoshi Koizumi : > Hi there, > > I almost finished an alternative implementation of mbstring that uses > ICU instead of the exotic libmbfl in hope of replacing the current one > for 5.4 (and possibly, 6.0.) > > Although there are admittingly some known incompatibilities that need > extra libraries to resolve them besides a number of missing functions > that are intentionally removed for simplicity's sake, frequently used > functions are fully usable, and more compliant with the standard (e.g. > case insensitive matches). > > Any comments are appreciated. > > The source is ready in the following location: > > http://github.com/moriyoshi/mbstring-ng/ > > > Implemented functions: > > - mb_convert_encoding() > - mb_detect_encoding() > - mb_ereg() > - mb_ereg_replace() > - mb_internal_encoding() > - mb_list_encodings() > - mb_output_handler() > - mb_parse_str() > - mb_preferred_mime_name() > - mb_regex_set_options() > - mb_split() > - mb_strcut() > - mb_strimwidth() > - mb_stripos() > - mb_stristr() > - mb_strlen() > - mb_strpos() > - mb_strripos() > - mb_strrpos() > - mb_strstr() > - mb_strtolower() > - mb_strtotitle() > - mb_strtoupper() > - mb_strwidth() > - mb_substr() > - mb_substr_count() > > Removed functions and reasons behind it: > > - mb_check_encoding() > =A0Not that usable as it is advertised, period. =A0First of all, validati= on > =A0in terms of encoding is just as same as filtering through the > =A0converter supplied with the same value for the input and output > =A0encoding. =A0Thus just use mb_convert_encoding(). > > - mb_convert_case() > =A0Use mb_strtoupper(), mb_strtolower() and mb_strtotitle() > > - mb_convert_kana() > =A0This can't be standard-compliant. In addition, part of the > =A0functionality is already covered by Normalizer of intl extension, so > =A0we need to carefully consider what is actually needed here again. > > - mb_convert_variables() > =A0This can be implemented as a script. > > - mb_decode_mimeheader(), mb_encode_mimeheader() > =A0Non-standard compliancy. > > - mb_decode_numericentity() > =A0Removed in favor of html_entity_decode(). > > - mb_encode_numericentity() > =A0Removed in favor of htmlentities() and htmlspecialchars(). > > - mb_encoding_aliases() > =A0Just unnecessary. > > - mb_ereg_match() > =A0Use mb_ereg(). > > - mb_ereg_search(), mb_ereg_search_getpos(), mb_ereg_search_getregs(), > =A0mb_ereg_search_init(), mb_ereg_search_pos(), mb_ereg_search_regs() and > =A0mb_ereg_search_setpos() > =A0I rarely heard a script that actively uses these functions. They > =A0involve an internal state that is not visible to users, and thus it > =A0most likely causes confusion when used across the function calls. > =A0Need to be reimplemented as a class. > > - mb_eregi() > =A0Use mb_regex_options() and mb_ereg() > > - mb_eregi_replace() > =A0I wonder why this function was added in the first place because giving > =A0'i' option to mb_ereg_replace() works in the same way. > > - mb_detect_order(), mb_get_info(), mb_http_input(), mb_http_output(), > =A0mb_language() and mb_substitute_character() > =A0ini_set() and ini_get() are your friend, I guess... > > - mb_regex_encoding() > =A0It is really confusing that the current mbstring allows two different > =A0encoding defaults that are applied to regex functions and the rest. > =A0Those settings are unified in the alternative version and so this is > =A0no longer necessary. > > - mb_send_mail() > =A0The behavior of this function relies on the pseudo-locale setting > =A0called "mbstring.language" that supports just a limited set of > =A0possible locales. As not everyone can benefit from the function and > =A0most significant applications implement their own mail functions, I > =A0suppose this is no longer wanted. > > - mb_strrchr() > =A0Use mb_strrpos(). > > - mb_strrichr() > =A0Use mb_strripos(). > > > Known limitations and incompatibilities: > > - mb_detect_encoding() doesn't work well anymore due to the > =A0inaccuracy of ICU's encoding detection facility. > > - Request encoding translator now takes advantage of SAPI filter, > =A0therefore the name parts of the query components are not to be > =A0converted anymore. > > - The group reference placeholders for mb_ereg_replace() is now > =A0$0, $1, $2... instead of \0, \1, \2. =A0This can be avoided if we > =A0don't use uregex_replaceAll() and implement our own. > > - ILP64 :-p > > > Regards, > Moriyoshi > > >