Newsgroups: php.internals Path: news.php.net Xref: news.php.net php.internals:45188 Return-Path: Mailing-List: contact internals-help@lists.php.net; run by ezmlm Delivered-To: mailing list internals@lists.php.net Received: (qmail 44965 invoked from network); 3 Aug 2009 08:56:04 -0000 Received: from unknown (HELO lists.php.net) (127.0.0.1) by localhost with SMTP; 3 Aug 2009 08:56:04 -0000 Authentication-Results: pb1.pair.com header.from=mozo@mozo.jp; sender-id=permerror Authentication-Results: pb1.pair.com smtp.mail=mozo@mozo.jp; spf=permerror; sender-id=permerror Received-SPF: error (pb1.pair.com: domain mozo.jp from 209.85.217.228 cause and error) X-PHP-List-Original-Sender: mozo@mozo.jp X-Host-Fingerprint: 209.85.217.228 mail-gx0-f228.google.com Received: from [209.85.217.228] ([209.85.217.228:36327] helo=mail-gx0-f228.google.com) by pb1.pair.com (ecelerity 2.1.1.9-wez r(12769M)) with ESMTP id 40/B7-05043-126A67A4 for ; Mon, 03 Aug 2009 04:56:03 -0400 Received: by gxk28 with SMTP id 28so4557987gxk.23 for ; Mon, 03 Aug 2009 01:55:59 -0700 (PDT) MIME-Version: 1.0 Received: by 10.100.172.13 with SMTP id u13mr6666884ane.46.1249289759248; Mon, 03 Aug 2009 01:55:59 -0700 (PDT) In-Reply-To: <4A738624.1@zend.com> References: <4A6C6496.7060603@mozo.jp> <4A71DA47.8080809@zend.com> <4A731DE2.2060206@zend.com> <4A738624.1@zend.com> Date: Mon, 3 Aug 2009 17:55:39 +0900 Message-ID: To: Stanislav Malyshev Cc: php-dev Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable Subject: Re: [PHP-DEV] Re: Alternative mbstring implementation using ICU From: mozo@mozo.jp (Moriyoshi Koizumi) On Sat, Aug 1, 2009 at 9:02 AM, Stanislav Malyshev wrote: > Hi! > >> They calculate the total width of a string based on "east asian width" >> property, which is still valid to give a rough measurement of the >> rendered string. > > OK, I guess if it's some kind of special calculation that doesn't follow > from others it should be preserved, there are tons of such special functi= ons > in PHP. > >>> That's a common problem, IIRC PHP 6 converters have configurable error >>> modes >>> for that. Don't unicode_set_error_handler() and unicode_set_error_mode(= ) >>> do >>> what you want? >> >> I guess it isn't what I want. If my understanding is correct, a >> handler set by unicode_set_error_handler() merely deals with the >> aftermath and cannot interact with the converter. =A0There are good > > That depends. For some error modes, it says to converter to replace inval= id > chars with some other char or skip it. You can't however now specify cust= om > mappings (I'm not sure ICU allows that, but maybe it can be simulated...)= . > Here the question is - is it really worth to keep whole separate conversi= on > system for just this, or can it be done with standard conversion, possibl= y > somewhat tweaked? It can be done through conversion error handlers. You can append an encoded form of a codepoint for such unassigned characters to the buffer within the handler. And yes, it's worth providing separate conversion system. You might not be aware of it, but there are several sets of different character sets, each of which is often represented with a specific encoding scheme. Shift_JIS is one of those. >> In addition to these, shouldn't there be any case where one have to >> manipulate Unicode strings on per-coded-character-basis rather than >> per-grapheme-basis just like substr() in PHP6? > > In PHP 6 right now it's actually the only case, grapheme functions not ev= en > ported to PHP 6 yet (I know, not good) - but that's what regular str* > functions should be doing, right? What I am mainly interested in is 5.4, or something that will come before 6. BTW, it would be much better if there had been a sort of coordination between the developers of mbstring and intl extension. Moriyoshi > -- > Stanislav Malyshev, Zend Software Architect > stas@zend.com =A0 http://www.zend.com/ > (408)253-8829 =A0 MSN: stas@zend.com >