Hi there,
I almost finished an alternative implementation of mbstring that uses
ICU instead of the exotic libmbfl in hope of replacing the current one
for 5.4 (and possibly, 6.0.)
Although there are admittingly some known incompatibilities that need
extra libraries to resolve them besides a number of missing functions
that are intentionally removed for simplicity's sake, frequently used
functions are fully usable, and more compliant with the standard (e.g.
case insensitive matches).
Any comments are appreciated.
The source is ready in the following location:
http://github.com/moriyoshi/mbstring-ng/
Implemented functions:
-
mb_convert_encoding()
-
mb_detect_encoding()
-
mb_ereg()
-
mb_ereg_replace()
-
mb_internal_encoding()
-
mb_list_encodings()
-
mb_output_handler()
-
mb_parse_str()
-
mb_preferred_mime_name()
-
mb_regex_set_options()
-
mb_split()
-
mb_strcut()
-
mb_strimwidth()
-
mb_stripos()
-
mb_stristr()
-
mb_strlen()
-
mb_strpos()
-
mb_strripos()
-
mb_strrpos()
-
mb_strstr()
-
mb_strtolower()
- mb_strtotitle()
-
mb_strtoupper()
-
mb_strwidth()
-
mb_substr()
-
mb_substr_count()
Removed functions and reasons behind it:
-
mb_check_encoding()
Not that usable as it is advertised, period. First of all, validation
in terms of encoding is just as same as filtering through the
converter supplied with the same value for the input and output
encoding. Thus just usemb_convert_encoding()
. -
mb_convert_case()
Usemb_strtoupper()
,mb_strtolower()
and mb_strtotitle() -
mb_convert_kana()
This can't be standard-compliant. In addition, part of the
functionality is already covered by Normalizer of intl extension, so
we need to carefully consider what is actually needed here again. -
mb_convert_variables()
This can be implemented as a script. -
mb_decode_mimeheader()
,mb_encode_mimeheader()
Non-standard compliancy. -
mb_decode_numericentity()
Removed in favor ofhtml_entity_decode()
. -
mb_encode_numericentity()
Removed in favor ofhtmlentities()
andhtmlspecialchars()
. -
mb_encoding_aliases()
Just unnecessary. -
mb_ereg_match()
Usemb_ereg()
. -
mb_ereg_search()
,mb_ereg_search_getpos()
,mb_ereg_search_getregs()
,
mb_ereg_search_init()
,mb_ereg_search_pos()
,mb_ereg_search_regs()
and
mb_ereg_search_setpos()
I rarely heard a script that actively uses these functions. They
involve an internal state that is not visible to users, and thus it
most likely causes confusion when used across the function calls.
Need to be reimplemented as a class. -
mb_eregi()
Use mb_regex_options() andmb_ereg()
-
mb_eregi_replace()
I wonder why this function was added in the first place because giving
'i' option tomb_ereg_replace()
works in the same way. -
mb_detect_order()
,mb_get_info()
,mb_http_input()
,mb_http_output()
,
mb_language()
andmb_substitute_character()
ini_set()
andini_get()
are your friend, I guess... -
mb_regex_encoding()
It is really confusing that the current mbstring allows two different
encoding defaults that are applied to regex functions and the rest.
Those settings are unified in the alternative version and so this is
no longer necessary. -
mb_send_mail()
The behavior of this function relies on the pseudo-locale setting
called "mbstring.language" that supports just a limited set of
possible locales. As not everyone can benefit from the function and
most significant applications implement their own mail functions, I
suppose this is no longer wanted. -
mb_strrchr()
Usemb_strrpos()
. -
mb_strrichr()
Usemb_strripos()
.
Known limitations and incompatibilities:
-
mb_detect_encoding()
doesn't work well anymore due to the
inaccuracy of ICU's encoding detection facility. -
Request encoding translator now takes advantage of SAPI filter,
therefore the name parts of the query components are not to be
converted anymore. -
The group reference placeholders for
mb_ereg_replace()
is now
$0, $1, $2... instead of \0, \1, \2. This can be avoided if we
don't use uregex_replaceAll() and implement our own. -
ILP64 :-p
Regards,
Moriyoshi
I set up a RFC page for this in wiki.php.net. Here it goes:
http://wiki.php.net/rfc/altmbstring
Moriyoshi
2009/7/26 Moriyoshi Koizumi mozo@mozo.jp:
Hi there,
I almost finished an alternative implementation of mbstring that uses
ICU instead of the exotic libmbfl in hope of replacing the current one
for 5.4 (and possibly, 6.0.)Although there are admittingly some known incompatibilities that need
extra libraries to resolve them besides a number of missing functions
that are intentionally removed for simplicity's sake, frequently used
functions are fully usable, and more compliant with the standard (e.g.
case insensitive matches).Any comments are appreciated.
The source is ready in the following location:
http://github.com/moriyoshi/mbstring-ng/
Implemented functions:
mb_convert_encoding()
mb_detect_encoding()
mb_ereg()
mb_ereg_replace()
mb_internal_encoding()
mb_list_encodings()
mb_output_handler()
mb_parse_str()
mb_preferred_mime_name()
mb_regex_set_options()
mb_split()
mb_strcut()
mb_strimwidth()
mb_stripos()
mb_stristr()
mb_strlen()
mb_strpos()
mb_strripos()
mb_strrpos()
mb_strstr()
mb_strtolower()
- mb_strtotitle()
mb_strtoupper()
mb_strwidth()
mb_substr()
mb_substr_count()
Removed functions and reasons behind it:
mb_check_encoding()
Not that usable as it is advertised, period. First of all, validation
in terms of encoding is just as same as filtering through the
converter supplied with the same value for the input and output
encoding. Thus just usemb_convert_encoding()
.
mb_convert_case()
Usemb_strtoupper()
,mb_strtolower()
and mb_strtotitle()
mb_convert_kana()
This can't be standard-compliant. In addition, part of the
functionality is already covered by Normalizer of intl extension, so
we need to carefully consider what is actually needed here again.
mb_convert_variables()
This can be implemented as a script.
mb_decode_mimeheader()
,mb_encode_mimeheader()
Non-standard compliancy.
mb_decode_numericentity()
Removed in favor ofhtml_entity_decode()
.
mb_encode_numericentity()
Removed in favor ofhtmlentities()
andhtmlspecialchars()
.
mb_encoding_aliases()
Just unnecessary.
mb_ereg_match()
Usemb_ereg()
.
mb_ereg_search()
,mb_ereg_search_getpos()
,mb_ereg_search_getregs()
,
mb_ereg_search_init(),mb_ereg_search_pos()
,mb_ereg_search_regs()
and
mb_ereg_search_setpos()
I rarely heard a script that actively uses these functions. They
involve an internal state that is not visible to users, and thus it
most likely causes confusion when used across the function calls.
Need to be reimplemented as a class.
mb_eregi()
Use mb_regex_options() andmb_ereg()
mb_eregi_replace()
I wonder why this function was added in the first place because giving
'i' option tomb_ereg_replace()
works in the same way.
mb_detect_order()
,mb_get_info()
,mb_http_input()
,mb_http_output()
,
mb_language() andmb_substitute_character()
ini_set() andini_get()
are your friend, I guess...
mb_regex_encoding()
It is really confusing that the current mbstring allows two different
encoding defaults that are applied to regex functions and the rest.
Those settings are unified in the alternative version and so this is
no longer necessary.
mb_send_mail()
The behavior of this function relies on the pseudo-locale setting
called "mbstring.language" that supports just a limited set of
possible locales. As not everyone can benefit from the function and
most significant applications implement their own mail functions, I
suppose this is no longer wanted.
mb_strrchr()
Usemb_strrpos()
.
mb_strrichr()
Usemb_strripos()
.Known limitations and incompatibilities:
mb_detect_encoding()
doesn't work well anymore due to the
inaccuracy of ICU's encoding detection facility.Request encoding translator now takes advantage of SAPI filter,
therefore the name parts of the query components are not to be
converted anymore.The group reference placeholders for
mb_ereg_replace()
is now
$0, $1, $2... instead of \0, \1, \2. This can be avoided if we
don't use uregex_replaceAll() and implement our own.ILP64 :-p
Regards,
Moriyoshi
Aren't there any interests on this? If you think PHP 6 is gonna cover
all of the functionality that allegedly-cruft mbstring currently
provides, that is almost wrong :-p
Moriyoshi
I set up a RFC page for this in wiki.php.net. Here it goes:
http://wiki.php.net/rfc/altmbstringMoriyoshi
I'm just waiting for you to just commit it.. :)
--Jani
Aren't there any interests on this? If you think PHP 6 is gonna cover
all of the functionality that allegedly-cruft mbstring currently
provides, that is almost wrong :-pMoriyoshi
I set up a RFC page for this in wiki.php.net. Here it goes:
http://wiki.php.net/rfc/altmbstringMoriyoshi
Hi!
Aren't there any interests on this? If you think PHP 6 is gonna cover
all of the functionality that allegedly-cruft mbstring currently
provides, that is almost wrong :-p
Could you please explain why PHP6 doesn't provide what mbstring is
doing? I.e, let's go over the functions:
mb_parse_str - since detecting encoding doesn't work per RFC, what is
the usefulness of this function? Wouldn't PHP 6 do the same with correct
charset?
mb_str* - shouldn't you in 6 just convert them to unicode and do all
string operations with Unicode strings? Also, in 5 isn't there some
intersection with grapheme_* functions?
mb_output_handler - shouldn't setting the proper encoding in 6 do the
same job?
mb_convert_encoding - don't we already have a number of functions that
do encoding conversions?
Stanislav Malyshev, Zend Software Architect
stas@zend.com http://www.zend.com/
(408)253-8829 MSN: stas@zend.com
Hi!
Aren't there any interests on this? If you think PHP 6 is gonna cover
all of the functionality that allegedly-cruft mbstring currently
provides, that is almost wrong :-pCould you please explain why PHP6 doesn't provide what mbstring is doing?
I.e, let's go over the functions:mb_parse_str - since detecting encoding doesn't work per RFC, what is the
usefulness of this function? Wouldn't PHP 6 do the same with correct
charset?
As for this you got the point.
mb_str* - shouldn't you in 6 just convert them to unicode and do all string
operations with Unicode strings? Also, in 5 isn't there some intersection
with grapheme_* functions?
mb_strwidth()
and mb_strimwidth()
are not covered.
mb_output_handler - shouldn't setting the proper encoding in 6 do the same job?
mb_convert_encoding - don't we already have a number of functions that do encoding conversions?
I don't think It can gracefully handle characters that have no
corresponding entries in the target character set. I'm even thinking
of adding a class interface that is dedicated to encoding conversion
with which one can deal with such characters in a user-supplied
handler.
Regards,
Moriyoshi
mb_str* - shouldn't you in 6 just convert them to unicode and do all string
operations with Unicode strings? Also, in 5 isn't there some intersection
with grapheme_* functions?
mb_strwidth()
andmb_strimwidth()
are not covered.
I should have also noted that grapheme_* functions. Yes, there might
be intersection among them and I even think grapheme_* provide better
support for Unicode string manipulation, but it would actually be
better a bit if they supported arbitrary encoding as arguments.
Regards,
Moriyoshi
Hi!
support for Unicode string manipulation, but it would actually be
better a bit if they supported arbitrary encoding as arguments.
There's no such thing as "encoding" in PHP 6 - all strings are Unicode.
If you've got binary data in some encoding, I think the recommended way
for PHP 6 would be to convert it to Unicode string and then apply the
string functions. I know that probably would mean some slowdown but I
understand that's how PHP 6 is supposed to work.
Stanislav Malyshev, Zend Software Architect
stas@zend.com http://www.zend.com/
(408)253-8829 MSN: stas@zend.com
Hi!
mb_str* - shouldn't you in 6 just convert them to unicode and do all string
operations with Unicode strings? Also, in 5 isn't there some intersection
with grapheme_* functions?
mb_strwidth()
andmb_strimwidth()
are not covered.
True. I wonder what this function is useful for?
mb_output_handler - shouldn't setting the proper encoding in 6 do the same job?
mb_convert_encoding - don't we already have a number of functions that do encoding conversions?I don't think It can gracefully handle characters that have no
corresponding entries in the target character set. I'm even thinking
That's a common problem, IIRC PHP 6 converters have configurable error
modes for that. Don't unicode_set_error_handler() and
unicode_set_error_mode() do what you want?
Stanislav Malyshev, Zend Software Architect
stas@zend.com http://www.zend.com/
(408)253-8829 MSN: stas@zend.com
Hi,
Hi!
mb_str* - shouldn't you in 6 just convert them to unicode and do all
string
operations with Unicode strings? Also, in 5 isn't there some intersection
with grapheme_* functions?
mb_strwidth()
andmb_strimwidth()
are not covered.True. I wonder what this function is useful for?
They calculate the total width of a string based on "east asian width"
property, which is still valid to give a rough measurement of the
rendered string.
mb_output_handler - shouldn't setting the proper encoding in 6 do the
same job?
mb_convert_encoding - don't we already have a number of functions that do
encoding conversions?I don't think It can gracefully handle characters that have no
corresponding entries in the target character set. I'm even thinkingThat's a common problem, IIRC PHP 6 converters have configurable error modes
for that. Don't unicode_set_error_handler() and unicode_set_error_mode() do
what you want?
I guess it isn't what I want. If my understanding is correct, a
handler set by unicode_set_error_handler() merely deals with the
aftermath and cannot interact with the converter. There are good
reasons to support user-supplied mappings of characters in PUA to one
of legacy encodings such as Shift_JIS, not just replacing such
characters by placeholders.
In addition to these, shouldn't there be any case where one have to
manipulate Unicode strings on per-coded-character-basis rather than
per-grapheme-basis just like substr()
in PHP6?
Regards,
Moriyoshi
Hi!
They calculate the total width of a string based on "east asian width"
property, which is still valid to give a rough measurement of the
rendered string.
OK, I guess if it's some kind of special calculation that doesn't follow
from others it should be preserved, there are tons of such special
functions in PHP.
That's a common problem, IIRC PHP 6 converters have configurable error modes
for that. Don't unicode_set_error_handler() and unicode_set_error_mode() do
what you want?I guess it isn't what I want. If my understanding is correct, a
handler set by unicode_set_error_handler() merely deals with the
aftermath and cannot interact with the converter. There are good
That depends. For some error modes, it says to converter to replace
invalid chars with some other char or skip it. You can't however now
specify custom mappings (I'm not sure ICU allows that, but maybe it can
be simulated...). Here the question is - is it really worth to keep
whole separate conversion system for just this, or can it be done with
standard conversion, possibly somewhat tweaked?
In addition to these, shouldn't there be any case where one have to
manipulate Unicode strings on per-coded-character-basis rather than
per-grapheme-basis just likesubstr()
in PHP6?
In PHP 6 right now it's actually the only case, grapheme functions not
even ported to PHP 6 yet (I know, not good) - but that's what regular
str* functions should be doing, right?
Stanislav Malyshev, Zend Software Architect
stas@zend.com http://www.zend.com/
(408)253-8829 MSN: stas@zend.com
Hi!
They calculate the total width of a string based on "east asian width"
property, which is still valid to give a rough measurement of the
rendered string.OK, I guess if it's some kind of special calculation that doesn't follow
from others it should be preserved, there are tons of such special functions
in PHP.That's a common problem, IIRC PHP 6 converters have configurable error
modes
for that. Don't unicode_set_error_handler() and unicode_set_error_mode()
do
what you want?I guess it isn't what I want. If my understanding is correct, a
handler set by unicode_set_error_handler() merely deals with the
aftermath and cannot interact with the converter. There are goodThat depends. For some error modes, it says to converter to replace invalid
chars with some other char or skip it. You can't however now specify custom
mappings (I'm not sure ICU allows that, but maybe it can be simulated...).
Here the question is - is it really worth to keep whole separate conversion
system for just this, or can it be done with standard conversion, possibly
somewhat tweaked?
It can be done through conversion error handlers. You can append an
encoded form of a codepoint for such unassigned characters to the
buffer within the handler.
And yes, it's worth providing separate conversion system. You might
not be aware of it, but there are several sets of different character
sets, each of which is often represented with a specific encoding
scheme. Shift_JIS is one of those.
In addition to these, shouldn't there be any case where one have to
manipulate Unicode strings on per-coded-character-basis rather than
per-grapheme-basis just likesubstr()
in PHP6?In PHP 6 right now it's actually the only case, grapheme functions not even
ported to PHP 6 yet (I know, not good) - but that's what regular str*
functions should be doing, right?
What I am mainly interested in is 5.4, or something that will come
before 6. BTW, it would be much better if there had been a sort of
coordination between the developers of mbstring and intl extension.
Moriyoshi
--
Stanislav Malyshev, Zend Software Architect
stas@zend.com http://www.zend.com/
(408)253-8829 MSN: stas@zend.com
Hi!
It can be done through conversion error handlers. You can append an
encoded form of a codepoint for such unassigned characters to the
buffer within the handler.
OK, if so we may want to add implementation of this behavior to our ICU
support.
And yes, it's worth providing separate conversion system. You might
not be aware of it, but there are several sets of different character
sets, each of which is often represented with a specific encoding
scheme. Shift_JIS is one of those.
I'm not sure I understand. There are tons of character sets, etc. but as
I understand ICU conversion routines handle them, including Shift_JIS -
isn't it true?
What I am mainly interested in is 5.4, or something that will come
before 6. BTW, it would be much better if there had been a sort of
coordination between the developers of mbstring and intl extension.
I'm not sure what will happen about 5.4 etc. but sure I'd be glad to
help as much as I could with anything regarding intl extension. DO you
have some specific things that need to be done?
Stanislav Malyshev, Zend Software Architect
stas@zend.com http://www.zend.com/
(408)253-8829 MSN: stas@zend.com
And yes, it's worth providing separate conversion system. You might
not be aware of it, but there are several sets of different character
sets, each of which is often represented with a specific encoding
scheme. Shift_JIS is one of those.I'm not sure I understand. There are tons of character sets, etc. but as I
understand ICU conversion routines handle them, including Shift_JIS - isn't
it true?
Coded character sets and character encoding schemes are different
concepts. As for the specific case I mentioned, there are a number of
variants of the character set that is commonly represented as
Shift_JIS, and ICU doesn't support all of those.
What I am mainly interested in is 5.4, or something that will come
before 6. BTW, it would be much better if there had been a sort of
coordination between the developers of mbstring and intl extension.I'm not sure what will happen about 5.4 etc. but sure I'd be glad to help as
much as I could with anything regarding intl extension. DO you have some
specific things that need to be done?
This is just one of my ideas, but If intl extension eventually obtains
enough functionality that allows one to write emulated mbstring
functions in userland, then it would sound very attractive to me.
Moriyoshi
I set up a RFC page for this in wiki.php.net. Here it goes:
http://wiki.php.net/rfc/altmbstring
So.. Shouldn't we try to get this into PHP5.4?
-Hannes
Hi!
I set up a RFC page for this in wiki.php.net. Here it goes:
http://wiki.php.net/rfc/altmbstringSo.. Shouldn't we try to get this into PHP5.4?
Is it ready? Maybe have it as a PECL extension?
Stanislav Malyshev, Software Architect
SugarCRM: http://www.sugarcrm.com/
(408)454-6900 ext. 227
Hi!
I set up a RFC page for this in wiki.php.net. Here it goes:
http://wiki.php.net/rfc/altmbstringSo.. Shouldn't we try to get this into PHP5.4?
Is it ready? Maybe have it as a PECL extension?
I was under the impression it was a drop-in-replacement, so a pecl ext
doesn't make much sense.
As for it being 100%, probably not - but there are still several months to go.
-Hannes
Hi!
I was under the impression it was a drop-in-replacement, so a pecl ext
doesn't make much sense.
I'd say if it's a drop-in PECL ext may still make sense - you don't
compile in the old one but use the PECL one instead. Wouldn't it work?
--
Stanislav Malyshev, Software Architect
SugarCRM: http://www.sugarcrm.com/
(408)454-6900 ext. 227
On Tue, May 31, 2011 at 10:42 AM, Stas Malyshev smalyshev@sugarcrm.comwrote:
Hi!
I was under the impression it was a drop-in-replacement, so a pecl ext
doesn't make much sense.
I'd say if it's a drop-in PECL ext may still make sense - you don't compile
in the old one but use the PECL one instead. Wouldn't it work?
If you have to recompile PHP without mbstring to use the replacement then I
think it's doesn't really matter that you can install this through pecl, or
not.
of course it's better to have this in the PECL repo than not having it at
all.
Tyrael
2009/7/26 Moriyoshi Koizumi mozo@mozo.jp:
mb_ereg_search()
,mb_ereg_search_getpos()
,mb_ereg_search_getregs()
,
mb_ereg_search_init(),mb_ereg_search_pos()
,mb_ereg_search_regs()
and
mb_ereg_search_setpos()
I rarely heard a script that actively uses these functions. They
involve an internal state that is not visible to users, and thus it
most likely causes confusion when used across the function calls.
Need to be reimplemented as a class.
I actually do use these. ;)
Probably, it will make sense to implement a completely new oniguruma
extension instead of keeping it as a part of mb_?
--
Alexey Zakhlestin
http://www.milkfarmsoft.com/
2009/7/26 Moriyoshi Koizumi mozo@mozo.jp:
mb_ereg_search()
,mb_ereg_search_getpos()
,mb_ereg_search_getregs()
,
mb_ereg_search_init(),mb_ereg_search_pos()
,mb_ereg_search_regs()
and
mb_ereg_search_setpos()
I rarely heard a script that actively uses these functions. They
involve an internal state that is not visible to users, and thus it
most likely causes confusion when used across the function calls.
Need to be reimplemented as a class.I actually do use these. ;)
Probably, it will make sense to implement a completely new oniguruma
extension instead of keeping it as a part of mb_?
I'm planng to reimplement them as a single SPL iterator. As I noted,
I also created a separate oniguruma extension that you can browse at
http://github.com/moriyoshi/php-oniguruma/
Regards,
Moriyoshi
--
Alexey Zakhlestin
http://www.milkfarmsoft.com/
I also created a separate oniguruma extension that you can browse at
http://github.com/moriyoshi/php-oniguruma/
cool! will take a look at it
--
Alexey Zakhlestin
http://www.milkfarmsoft.com/
Hi there,
I almost finished an alternative implementation of mbstring that uses
ICU instead of the exotic libmbfl in hope of replacing the current one
for 5.4 (and possibly, 6.0.)Although there are admittingly some known incompatibilities that need
extra libraries to resolve them besides a number of missing functions
that are intentionally removed for simplicity's sake, frequently used
functions are fully usable, and more compliant with the standard (e.g.
case insensitive matches).Any comments are appreciated.
The source is ready in the following location:
http://github.com/moriyoshi/mbstring-ng/
Implemented functions:
mb_ereg()
mb_ereg_replace()
as ereg functions are deprecated in 5.3, are these still needed?
<snipped>Regards,
Moriyoshi--
--
Niel Archer
niel.archer (at) blueyonder.co.uk
Implemented functions:
mb_ereg()
mb_ereg_replace()
as ereg functions are deprecated in 5.3, are these still needed?
these have nothing in common with "those" ereg functions. these are
based on onuguruma regex library
http://www.geocities.jp/kosako3/oniguruma/
--
Alexey Zakhlestin
http://www.milkfarmsoft.com/
Implemented functions:
mb_ereg()
mb_ereg_replace()
as ereg functions are deprecated in 5.3, are these still needed?
these have nothing in common with "those" ereg functions. these are
based on onuguruma regex library
http://www.geocities.jp/kosako3/oniguruma/
I find Oniguruma to be, in general, a pared-down and less-useful
version of the PCRE we already have. Given that PCRE has full support
for UTF-8, and that there's nothing you can do with Oniguruma that you
can't also practically do with PCRE (to the best of my knowledge), I
think it would be best for PHP to standardize on a single regexp
library, rather than offering competing and confusing options. Killing
off POSIX syntax was a step in that direction, and I see no reason not
to take the rest of the steps.
If Oniguruma were offered as a PECL extension, I would think that
perfectly reasonable, but I don't think it belongs in core.
-- Gwynne
Implemented functions:
mb_ereg()
mb_ereg_replace()
as ereg functions are deprecated in 5.3, are these still needed?
mb_ereg_XXX() have nothing to do with the plain ereg functions. They
are named so purely for the historical reasons.
Moriyoshi
--
Niel Archer
niel.archer (at) blueyonder.co.uk