Newsgroups: php.internals
Path: news.php.net
Xref: news.php.net php.internals:45169
Mailing-List: contact internals-help@lists.php.net; run by ezmlm
Received-SPF: error (pb1.pair.com: domain mozo.jp from 209.85.217.228 cause and error)
MIME-Version: 1.0
In-Reply-To: <4A731DE2.2060206@zend.com>
References: <4A6C6496.7060603@mozo.jp> <cd1fb7540907280141t5c1dcc02q7bf05c668f2b74bf@mail.gmail.com> 
	<cd1fb7540907290208u331072dbne20b2ba6c0ee59e4@mail.gmail.com> 
	<4A71DA47.8080809@zend.com> <cd1fb7540907310124q2697e324j930f11f6fef88fee@mail.gmail.com> 
	<4A731DE2.2060206@zend.com>
Date: Sat, 1 Aug 2009 07:57:23 +0900
Message-ID: <cd1fb7540907311557l67e58315u7e498fad441b1c6@mail.gmail.com>
To: Stanislav Malyshev <stas@zend.com>
Cc: php-dev <internals@lists.php.net>
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: 7bit
Subject: Re: [PHP-DEV] Re: Alternative mbstring implementation using ICU
From: mozo@mozo.jp (Moriyoshi Koizumi)

Hi,

On Sat, Aug 1, 2009 at 1:37 AM, Stanislav Malyshev<stas@zend.com> wrote:
> Hi!
>
>>> mb_str* - shouldn't you in 6 just convert them to unicode and do all
>>> string
>>> operations with Unicode strings? Also, in 5 isn't there some intersection
>>> with grapheme_* functions?
>>
>> mb_strwidth() and mb_strimwidth() are not covered.
>
> True. I wonder what this function is useful for?

They calculate the total width of a string based on "east asian width"
property, which is still valid to give a rough measurement of the
rendered string.

>
>>> mb_output_handler - shouldn't setting the proper encoding in 6 do the
>>> same job?
>>> mb_convert_encoding - don't we already have a number of functions that do
>>> encoding conversions?
>>
>> I don't think It can gracefully handle characters that have no
>> corresponding entries in the target character set. I'm even thinking
>
> That's a common problem, IIRC PHP 6 converters have configurable error modes
> for that. Don't unicode_set_error_handler() and unicode_set_error_mode() do
> what you want?

I guess it isn't what I want. If my understanding is correct, a
handler set by unicode_set_error_handler() merely deals with the
aftermath and cannot interact with the converter.  There are good
reasons to support user-supplied mappings of characters in PUA to one
of legacy encodings such as Shift_JIS, not just replacing such
characters by placeholders.

In addition to these, shouldn't there be any case where one have to
manipulate Unicode strings on per-coded-character-basis rather than
per-grapheme-basis just like substr() in PHP6?

Regards,
Moriyoshi