Newsgroups: php.internals Path: news.php.net Xref: news.php.net php.internals:45098 Return-Path: Mailing-List: contact internals-help@lists.php.net; run by ezmlm Delivered-To: mailing list internals@lists.php.net Received: (qmail 30199 invoked from network); 26 Jul 2009 14:13:50 -0000 Received: from unknown (HELO lists.php.net) (127.0.0.1) by localhost with SMTP; 26 Jul 2009 14:13:50 -0000 Authentication-Results: pb1.pair.com header.from=moriyoshi@gmail.com; sender-id=pass; domainkeys=bad Authentication-Results: pb1.pair.com smtp.mail=moriyoshi@gmail.com; spf=pass; sender-id=pass Received-SPF: pass (pb1.pair.com: domain gmail.com designates 209.85.216.194 as permitted sender) DomainKey-Status: bad X-DomainKeys: Ecelerity dk_validate implementing draft-delany-domainkeys-base-01 X-PHP-List-Original-Sender: moriyoshi@gmail.com X-Host-Fingerprint: 209.85.216.194 mail-px0-f194.google.com Received: from [209.85.216.194] ([209.85.216.194:43640] helo=mail-px0-f194.google.com) by pb1.pair.com (ecelerity 2.1.1.9-wez r(12769M)) with ESMTP id B3/59-08024-C946C6A4 for ; Sun, 26 Jul 2009 10:13:49 -0400 Received: by pxi32 with SMTP id 32so1219266pxi.29 for ; Sun, 26 Jul 2009 07:13:46 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:received:received:sender:message-id:date:from :user-agent:mime-version:to:subject:x-enigmail-version:content-type :content-transfer-encoding; bh=mCWyZ0+4wZpS07b5h4MspyoUdyGqAlF34/y+8XXPmYs=; b=itNAMNEheVuz2RFC3kv2+oOc2q0qULBFTDGDgCkBPL085J/cytf8zKD5ky7JmJc8N0 ZU43d7tLK7miOUdNNJYLW3vOsKfC0FFvv8VwldRtw2cuS3eiv6Gddrgvxw/whndDOhCI g/dX+NB2lmtxeKsgoZKREQhywOw/HAJoUzpVI= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=sender:message-id:date:from:user-agent:mime-version:to:subject :x-enigmail-version:content-type:content-transfer-encoding; b=lDi/P7q29KR3gu2YZN909a2MqICM6j9e/GzW7DRNuKz4tP5TfBJErFlRKZyAJJ93t9 0xexdDDJUWGzyfu7PXTJCq/dNouk7uPRpFokkUuy7fV7TZ/7Ay5N2TbAB/LB+2fkYNka Iak3zz/vGRbEv9jveTB0H5aTX92NdU5UZgP7U= Received: by 10.140.126.19 with SMTP id y19mr3308832rvc.59.1248617625999; Sun, 26 Jul 2009 07:13:45 -0700 (PDT) Received: from ?192.168.0.125? (i222-150-69-241.s04.a014.ap.plala.or.jp [222.150.69.241]) by mx.google.com with ESMTPS id k41sm26282220rvb.47.2009.07.26.07.13.43 (version=SSLv3 cipher=RC4-MD5); Sun, 26 Jul 2009 07:13:45 -0700 (PDT) Sender: Moriyoshi Koizumi Message-ID: <4A6C6496.7060603@mozo.jp> Date: Sun, 26 Jul 2009 23:13:42 +0900 User-Agent: Mozilla-Thunderbird 2.0.0.22 (X11/20090701) MIME-Version: 1.0 To: php-dev X-Enigmail-Version: 0.95.0 Content-Type: text/plain; charset=ISO-2022-JP Content-Transfer-Encoding: 7bit Subject: Alternative mbstring implementation using ICU From: mozo@mozo.jp (Moriyoshi Koizumi) Hi there, I almost finished an alternative implementation of mbstring that uses ICU instead of the exotic libmbfl in hope of replacing the current one for 5.4 (and possibly, 6.0.) Although there are admittingly some known incompatibilities that need extra libraries to resolve them besides a number of missing functions that are intentionally removed for simplicity's sake, frequently used functions are fully usable, and more compliant with the standard (e.g. case insensitive matches). Any comments are appreciated. The source is ready in the following location: http://github.com/moriyoshi/mbstring-ng/ Implemented functions: - mb_convert_encoding() - mb_detect_encoding() - mb_ereg() - mb_ereg_replace() - mb_internal_encoding() - mb_list_encodings() - mb_output_handler() - mb_parse_str() - mb_preferred_mime_name() - mb_regex_set_options() - mb_split() - mb_strcut() - mb_strimwidth() - mb_stripos() - mb_stristr() - mb_strlen() - mb_strpos() - mb_strripos() - mb_strrpos() - mb_strstr() - mb_strtolower() - mb_strtotitle() - mb_strtoupper() - mb_strwidth() - mb_substr() - mb_substr_count() Removed functions and reasons behind it: - mb_check_encoding() Not that usable as it is advertised, period. First of all, validation in terms of encoding is just as same as filtering through the converter supplied with the same value for the input and output encoding. Thus just use mb_convert_encoding(). - mb_convert_case() Use mb_strtoupper(), mb_strtolower() and mb_strtotitle() - mb_convert_kana() This can't be standard-compliant. In addition, part of the functionality is already covered by Normalizer of intl extension, so we need to carefully consider what is actually needed here again. - mb_convert_variables() This can be implemented as a script. - mb_decode_mimeheader(), mb_encode_mimeheader() Non-standard compliancy. - mb_decode_numericentity() Removed in favor of html_entity_decode(). - mb_encode_numericentity() Removed in favor of htmlentities() and htmlspecialchars(). - mb_encoding_aliases() Just unnecessary. - mb_ereg_match() Use mb_ereg(). - mb_ereg_search(), mb_ereg_search_getpos(), mb_ereg_search_getregs(), mb_ereg_search_init(), mb_ereg_search_pos(), mb_ereg_search_regs() and mb_ereg_search_setpos() I rarely heard a script that actively uses these functions. They involve an internal state that is not visible to users, and thus it most likely causes confusion when used across the function calls. Need to be reimplemented as a class. - mb_eregi() Use mb_regex_options() and mb_ereg() - mb_eregi_replace() I wonder why this function was added in the first place because giving 'i' option to mb_ereg_replace() works in the same way. - mb_detect_order(), mb_get_info(), mb_http_input(), mb_http_output(), mb_language() and mb_substitute_character() ini_set() and ini_get() are your friend, I guess... - mb_regex_encoding() It is really confusing that the current mbstring allows two different encoding defaults that are applied to regex functions and the rest. Those settings are unified in the alternative version and so this is no longer necessary. - mb_send_mail() The behavior of this function relies on the pseudo-locale setting called "mbstring.language" that supports just a limited set of possible locales. As not everyone can benefit from the function and most significant applications implement their own mail functions, I suppose this is no longer wanted. - mb_strrchr() Use mb_strrpos(). - mb_strrichr() Use mb_strripos(). Known limitations and incompatibilities: - mb_detect_encoding() doesn't work well anymore due to the inaccuracy of ICU's encoding detection facility. - Request encoding translator now takes advantage of SAPI filter, therefore the name parts of the query components are not to be converted anymore. - The group reference placeholders for mb_ereg_replace() is now $0, $1, $2... instead of \0, \1, \2. This can be avoided if we don't use uregex_replaceAll() and implement our own. - ILP64 :-p Regards, Moriyoshi