Alternative mbstring implementation using ICU

16 years ago by Moriyoshi Koizumi — view source — reply

unread

Hi there,

I almost finished an alternative implementation of mbstring that uses
ICU instead of the exotic libmbfl in hope of replacing the current one
for 5.4 (and possibly, 6.0.)

Although there are admittingly some known incompatibilities that need
extra libraries to resolve them besides a number of missing functions
that are intentionally removed for simplicity's sake, frequently used
functions are fully usable, and more compliant with the standard (e.g.
case insensitive matches).

Any comments are appreciated.

The source is ready in the following location:

http://github.com/moriyoshi/mbstring-ng/

Implemented functions:

mb_convert_encoding()
mb_detect_encoding()
mb_ereg()
mb_ereg_replace()
mb_internal_encoding()
mb_list_encodings()
mb_output_handler()
mb_parse_str()
mb_preferred_mime_name()
mb_regex_set_options()
mb_split()
mb_strcut()
mb_strimwidth()
mb_stripos()
mb_stristr()
mb_strlen()
mb_strpos()
mb_strripos()
mb_strrpos()
mb_strstr()
mb_strtolower()
mb_strtotitle()
mb_strtoupper()
mb_strwidth()
mb_substr()
mb_substr_count()

Removed functions and reasons behind it:

mb_check_encoding()
Not that usable as it is advertised, period. First of all, validation
in terms of encoding is just as same as filtering through the
converter supplied with the same value for the input and output
encoding. Thus just use mb_convert_encoding().
mb_convert_case()
Use mb_strtoupper(), mb_strtolower() and mb_strtotitle()
mb_convert_kana()
This can't be standard-compliant. In addition, part of the
functionality is already covered by Normalizer of intl extension, so
we need to carefully consider what is actually needed here again.
mb_convert_variables()
This can be implemented as a script.
mb_decode_mimeheader(), mb_encode_mimeheader()
Non-standard compliancy.
mb_decode_numericentity()
Removed in favor of html_entity_decode().
mb_encode_numericentity()
Removed in favor of htmlentities() and htmlspecialchars().
mb_encoding_aliases()
Just unnecessary.
mb_ereg_match()
Use mb_ereg().
mb_ereg_search(), mb_ereg_search_getpos(), mb_ereg_search_getregs(),
mb_ereg_search_init(), mb_ereg_search_pos(), mb_ereg_search_regs() and
mb_ereg_search_setpos()
I rarely heard a script that actively uses these functions. They
involve an internal state that is not visible to users, and thus it
most likely causes confusion when used across the function calls.
Need to be reimplemented as a class.
mb_eregi()
Use mb_regex_options() and mb_ereg()
mb_eregi_replace()
I wonder why this function was added in the first place because giving
'i' option to mb_ereg_replace() works in the same way.
mb_detect_order(), mb_get_info(), mb_http_input(), mb_http_output(),
mb_language() and mb_substitute_character()
ini_set() and ini_get() are your friend, I guess...
mb_regex_encoding()
It is really confusing that the current mbstring allows two different
encoding defaults that are applied to regex functions and the rest.
Those settings are unified in the alternative version and so this is
no longer necessary.
mb_send_mail()
The behavior of this function relies on the pseudo-locale setting
called "mbstring.language" that supports just a limited set of
possible locales. As not everyone can benefit from the function and
most significant applications implement their own mail functions, I
suppose this is no longer wanted.
mb_strrchr()
Use mb_strrpos().
mb_strrichr()
Use mb_strripos().

Known limitations and incompatibilities:

mb_detect_encoding() doesn't work well anymore due to the
inaccuracy of ICU's encoding detection facility.
Request encoding translator now takes advantage of SAPI filter,
therefore the name parts of the query components are not to be
converted anymore.
The group reference placeholders for mb_ereg_replace() is now
$0, $1, $2... instead of \0, \1, \2. This can be avoided if we
don't use uregex_replaceAll() and implement our own.
ILP64 :-p

Regards,
Moriyoshi

16 years ago by Moriyoshi Koizumi — view source — reply

unread

I set up a RFC page for this in wiki.php.net. Here it goes:
http://wiki.php.net/rfc/altmbstring

Moriyoshi

2009/7/26 Moriyoshi Koizumi mozo@mozo.jp:

Hi there,

I almost finished an alternative implementation of mbstring that uses
ICU instead of the exotic libmbfl in hope of replacing the current one
for 5.4 (and possibly, 6.0.)

Although there are admittingly some known incompatibilities that need
extra libraries to resolve them besides a number of missing functions
that are intentionally removed for simplicity's sake, frequently used
functions are fully usable, and more compliant with the standard (e.g.
case insensitive matches).

Any comments are appreciated.

The source is ready in the following location:

http://github.com/moriyoshi/mbstring-ng/

Implemented functions:

mb_convert_encoding()

mb_detect_encoding()

mb_ereg()

mb_ereg_replace()

mb_internal_encoding()

mb_list_encodings()

mb_output_handler()

mb_parse_str()

mb_preferred_mime_name()

mb_regex_set_options()

mb_split()

mb_strcut()

mb_strimwidth()

mb_stripos()

mb_stristr()

mb_strlen()

mb_strpos()

mb_strripos()

mb_strrpos()

mb_strstr()

mb_strtolower()

mb_strtotitle()

mb_strtoupper()

mb_strwidth()

mb_substr()

mb_substr_count()

Removed functions and reasons behind it:

mb_check_encoding()
Not that usable as it is advertised, period. First of all, validation
in terms of encoding is just as same as filtering through the
converter supplied with the same value for the input and output
encoding. Thus just use mb_convert_encoding().

mb_convert_case()
Use mb_strtoupper(), mb_strtolower() and mb_strtotitle()

mb_convert_kana()
This can't be standard-compliant. In addition, part of the
functionality is already covered by Normalizer of intl extension, so
we need to carefully consider what is actually needed here again.

mb_convert_variables()
This can be implemented as a script.

mb_decode_mimeheader(), mb_encode_mimeheader()
Non-standard compliancy.

mb_decode_numericentity()
Removed in favor of html_entity_decode().

mb_encode_numericentity()
Removed in favor of htmlentities() and htmlspecialchars().

mb_encoding_aliases()
Just unnecessary.

mb_ereg_match()
Use mb_ereg().

mb_ereg_search(), mb_ereg_search_getpos(), mb_ereg_search_getregs(),
mb_ereg_search_init(), mb_ereg_search_pos(), mb_ereg_search_regs() and
mb_ereg_search_setpos()
I rarely heard a script that actively uses these functions. They
involve an internal state that is not visible to users, and thus it
most likely causes confusion when used across the function calls.
Need to be reimplemented as a class.

mb_eregi()
Use mb_regex_options() and mb_ereg()

mb_eregi_replace()
I wonder why this function was added in the first place because giving
'i' option to mb_ereg_replace() works in the same way.

mb_detect_order(), mb_get_info(), mb_http_input(), mb_http_output(),
mb_language() and mb_substitute_character()
ini_set() and ini_get() are your friend, I guess...

mb_regex_encoding()
It is really confusing that the current mbstring allows two different
encoding defaults that are applied to regex functions and the rest.
Those settings are unified in the alternative version and so this is
no longer necessary.

mb_send_mail()
The behavior of this function relies on the pseudo-locale setting
called "mbstring.language" that supports just a limited set of
possible locales. As not everyone can benefit from the function and
most significant applications implement their own mail functions, I
suppose this is no longer wanted.

mb_strrchr()
Use mb_strrpos().

mb_strrichr()
Use mb_strripos().

Known limitations and incompatibilities:

mb_detect_encoding() doesn't work well anymore due to the
inaccuracy of ICU's encoding detection facility.

Request encoding translator now takes advantage of SAPI filter,
therefore the name parts of the query components are not to be
converted anymore.

The group reference placeholders for mb_ereg_replace() is now
$0, $1, $2... instead of \0, \1, \2. This can be avoided if we
don't use uregex_replaceAll() and implement our own.

ILP64 :-p

Regards,
Moriyoshi

16 years ago by Moriyoshi Koizumi — view source — reply

unread

Aren't there any interests on this? If you think PHP 6 is gonna cover
all of the functionality that allegedly-cruft mbstring currently
provides, that is almost wrong :-p

Moriyoshi

I set up a RFC page for this in wiki.php.net. Here it goes:
http://wiki.php.net/rfc/altmbstring

Moriyoshi

16 years ago by Jani Taskinen — view source — reply

unread

I'm just waiting for you to just commit it.. :)

--Jani

Aren't there any interests on this? If you think PHP 6 is gonna cover
all of the functionality that allegedly-cruft mbstring currently
provides, that is almost wrong :-p

Moriyoshi

I set up a RFC page for this in wiki.php.net. Here it goes:
http://wiki.php.net/rfc/altmbstring

Moriyoshi

16 years ago by Stanislav Malyshev — view source — reply

unread

Hi!

Aren't there any interests on this? If you think PHP 6 is gonna cover
all of the functionality that allegedly-cruft mbstring currently
provides, that is almost wrong :-p

Could you please explain why PHP6 doesn't provide what mbstring is
doing? I.e, let's go over the functions:

mb_parse_str - since detecting encoding doesn't work per RFC, what is
the usefulness of this function? Wouldn't PHP 6 do the same with correct
charset?
mb_str* - shouldn't you in 6 just convert them to unicode and do all
string operations with Unicode strings? Also, in 5 isn't there some
intersection with grapheme_* functions?
mb_output_handler - shouldn't setting the proper encoding in 6 do the
same job?
mb_convert_encoding - don't we already have a number of functions that
do encoding conversions?

Stanislav Malyshev, Zend Software Architect
stas@zend.com http://www.zend.com/
(408)253-8829 MSN: stas@zend.com

16 years ago by Moriyoshi Koizumi — view source — reply

unread

Hi!

Aren't there any interests on this? If you think PHP 6 is gonna cover
all of the functionality that allegedly-cruft mbstring currently
provides, that is almost wrong :-p

Could you please explain why PHP6 doesn't provide what mbstring is doing?
I.e, let's go over the functions:

mb_parse_str - since detecting encoding doesn't work per RFC, what is the
usefulness of this function? Wouldn't PHP 6 do the same with correct
charset?

As for this you got the point.

mb_str* - shouldn't you in 6 just convert them to unicode and do all string
operations with Unicode strings? Also, in 5 isn't there some intersection
with grapheme_* functions?

mb_strwidth() and mb_strimwidth() are not covered.

mb_output_handler - shouldn't setting the proper encoding in 6 do the same job?
mb_convert_encoding - don't we already have a number of functions that do encoding conversions?

I don't think It can gracefully handle characters that have no
corresponding entries in the target character set. I'm even thinking
of adding a class interface that is dedicated to encoding conversion
with which one can deal with such characters in a user-supplied
handler.

Regards,
Moriyoshi

16 years ago by Moriyoshi Koizumi — view source — reply

unread

mb_str* - shouldn't you in 6 just convert them to unicode and do all string
operations with Unicode strings? Also, in 5 isn't there some intersection
with grapheme_* functions?

mb_strwidth() and mb_strimwidth() are not covered.

I should have also noted that grapheme_* functions. Yes, there might
be intersection among them and I even think grapheme_* provide better
support for Unicode string manipulation, but it would actually be
better a bit if they supported arbitrary encoding as arguments.

Regards,
Moriyoshi

16 years ago by Stanislav Malyshev — view source — reply

unread

Hi!

support for Unicode string manipulation, but it would actually be
better a bit if they supported arbitrary encoding as arguments.

There's no such thing as "encoding" in PHP 6 - all strings are Unicode.
If you've got binary data in some encoding, I think the recommended way
for PHP 6 would be to convert it to Unicode string and then apply the
string functions. I know that probably would mean some slowdown but I
understand that's how PHP 6 is supposed to work.

Stanislav Malyshev, Zend Software Architect
stas@zend.com http://www.zend.com/
(408)253-8829 MSN: stas@zend.com

16 years ago by Stanislav Malyshev — view source — reply

unread

Hi!

mb_str* - shouldn't you in 6 just convert them to unicode and do all string
operations with Unicode strings? Also, in 5 isn't there some intersection
with grapheme_* functions?

mb_strwidth() and mb_strimwidth() are not covered.

True. I wonder what this function is useful for?

mb_output_handler - shouldn't setting the proper encoding in 6 do the same job?
mb_convert_encoding - don't we already have a number of functions that do encoding conversions?

I don't think It can gracefully handle characters that have no
corresponding entries in the target character set. I'm even thinking

That's a common problem, IIRC PHP 6 converters have configurable error
modes for that. Don't unicode_set_error_handler() and
unicode_set_error_mode() do what you want?

Stanislav Malyshev, Zend Software Architect
stas@zend.com http://www.zend.com/
(408)253-8829 MSN: stas@zend.com

16 years ago by Moriyoshi Koizumi — view source — reply

unread

Hi,

Hi!

mb_str* - shouldn't you in 6 just convert them to unicode and do all
string
operations with Unicode strings? Also, in 5 isn't there some intersection
with grapheme_* functions?

mb_strwidth() and mb_strimwidth() are not covered.

True. I wonder what this function is useful for?

They calculate the total width of a string based on "east asian width"
property, which is still valid to give a rough measurement of the
rendered string.

mb_output_handler - shouldn't setting the proper encoding in 6 do the
same job?
mb_convert_encoding - don't we already have a number of functions that do
encoding conversions?

I don't think It can gracefully handle characters that have no
corresponding entries in the target character set. I'm even thinking

That's a common problem, IIRC PHP 6 converters have configurable error modes
for that. Don't unicode_set_error_handler() and unicode_set_error_mode() do
what you want?

I guess it isn't what I want. If my understanding is correct, a
handler set by unicode_set_error_handler() merely deals with the
aftermath and cannot interact with the converter. There are good
reasons to support user-supplied mappings of characters in PUA to one
of legacy encodings such as Shift_JIS, not just replacing such
characters by placeholders.

In addition to these, shouldn't there be any case where one have to
manipulate Unicode strings on per-coded-character-basis rather than
per-grapheme-basis just like substr() in PHP6?

Regards,
Moriyoshi

16 years ago by Stanislav Malyshev — view source — reply

unread

Hi!

They calculate the total width of a string based on "east asian width"
property, which is still valid to give a rough measurement of the
rendered string.

OK, I guess if it's some kind of special calculation that doesn't follow
from others it should be preserved, there are tons of such special
functions in PHP.

That's a common problem, IIRC PHP 6 converters have configurable error modes
for that. Don't unicode_set_error_handler() and unicode_set_error_mode() do
what you want?

I guess it isn't what I want. If my understanding is correct, a
handler set by unicode_set_error_handler() merely deals with the
aftermath and cannot interact with the converter. There are good

That depends. For some error modes, it says to converter to replace
invalid chars with some other char or skip it. You can't however now
specify custom mappings (I'm not sure ICU allows that, but maybe it can
be simulated...). Here the question is - is it really worth to keep
whole separate conversion system for just this, or can it be done with
standard conversion, possibly somewhat tweaked?

In addition to these, shouldn't there be any case where one have to
manipulate Unicode strings on per-coded-character-basis rather than
per-grapheme-basis just like substr() in PHP6?

In PHP 6 right now it's actually the only case, grapheme functions not
even ported to PHP 6 yet (I know, not good) - but that's what regular
str* functions should be doing, right?

Stanislav Malyshev, Zend Software Architect
stas@zend.com http://www.zend.com/
(408)253-8829 MSN: stas@zend.com

16 years ago by Moriyoshi Koizumi — view source — reply

unread

Hi!

They calculate the total width of a string based on "east asian width"
property, which is still valid to give a rough measurement of the
rendered string.

OK, I guess if it's some kind of special calculation that doesn't follow
from others it should be preserved, there are tons of such special functions
in PHP.

That's a common problem, IIRC PHP 6 converters have configurable error
modes
for that. Don't unicode_set_error_handler() and unicode_set_error_mode()
do
what you want?

I guess it isn't what I want. If my understanding is correct, a
handler set by unicode_set_error_handler() merely deals with the
aftermath and cannot interact with the converter. There are good

That depends. For some error modes, it says to converter to replace invalid
chars with some other char or skip it. You can't however now specify custom
mappings (I'm not sure ICU allows that, but maybe it can be simulated...).
Here the question is - is it really worth to keep whole separate conversion
system for just this, or can it be done with standard conversion, possibly
somewhat tweaked?

It can be done through conversion error handlers. You can append an
encoded form of a codepoint for such unassigned characters to the
buffer within the handler.

And yes, it's worth providing separate conversion system. You might
not be aware of it, but there are several sets of different character
sets, each of which is often represented with a specific encoding
scheme. Shift_JIS is one of those.

In addition to these, shouldn't there be any case where one have to
manipulate Unicode strings on per-coded-character-basis rather than
per-grapheme-basis just like substr() in PHP6?

In PHP 6 right now it's actually the only case, grapheme functions not even
ported to PHP 6 yet (I know, not good) - but that's what regular str*
functions should be doing, right?

What I am mainly interested in is 5.4, or something that will come
before 6. BTW, it would be much better if there had been a sort of
coordination between the developers of mbstring and intl extension.

Moriyoshi

--
Stanislav Malyshev, Zend Software Architect
stas@zend.com http://www.zend.com/
(408)253-8829 MSN: stas@zend.com

16 years ago by Stanislav Malyshev — view source — reply

unread

Hi!

It can be done through conversion error handlers. You can append an
encoded form of a codepoint for such unassigned characters to the
buffer within the handler.

OK, if so we may want to add implementation of this behavior to our ICU
support.

And yes, it's worth providing separate conversion system. You might
not be aware of it, but there are several sets of different character
sets, each of which is often represented with a specific encoding
scheme. Shift_JIS is one of those.

I'm not sure I understand. There are tons of character sets, etc. but as
I understand ICU conversion routines handle them, including Shift_JIS -
isn't it true?

What I am mainly interested in is 5.4, or something that will come
before 6. BTW, it would be much better if there had been a sort of
coordination between the developers of mbstring and intl extension.

I'm not sure what will happen about 5.4 etc. but sure I'd be glad to
help as much as I could with anything regarding intl extension. DO you
have some specific things that need to be done?

Stanislav Malyshev, Zend Software Architect
stas@zend.com http://www.zend.com/
(408)253-8829 MSN: stas@zend.com

16 years ago by Moriyoshi Koizumi — view source — reply

unread

And yes, it's worth providing separate conversion system. You might
not be aware of it, but there are several sets of different character
sets, each of which is often represented with a specific encoding
scheme. Shift_JIS is one of those.

I'm not sure I understand. There are tons of character sets, etc. but as I
understand ICU conversion routines handle them, including Shift_JIS - isn't
it true?

Coded character sets and character encoding schemes are different
concepts. As for the specific case I mentioned, there are a number of
variants of the character set that is commonly represented as
Shift_JIS, and ICU doesn't support all of those.

What I am mainly interested in is 5.4, or something that will come
before 6. BTW, it would be much better if there had been a sort of
coordination between the developers of mbstring and intl extension.

I'm not sure what will happen about 5.4 etc. but sure I'd be glad to help as
much as I could with anything regarding intl extension. DO you have some
specific things that need to be done?

This is just one of my ideas, but If intl extension eventually obtains
enough functionality that allows one to write emulated mbstring
functions in userland, then it would sound very attractive to me.

Moriyoshi

14 years ago by Hannes Magnusson — view source — reply

unread

I set up a RFC page for this in wiki.php.net. Here it goes:
http://wiki.php.net/rfc/altmbstring

So.. Shouldn't we try to get this into PHP5.4?

-Hannes

14 years ago by Stas Malyshev — view source — reply

unread

Hi!

I set up a RFC page for this in wiki.php.net. Here it goes:
http://wiki.php.net/rfc/altmbstring

So.. Shouldn't we try to get this into PHP5.4?

Is it ready? Maybe have it as a PECL extension?

Stanislav Malyshev, Software Architect
SugarCRM: http://www.sugarcrm.com/
(408)454-6900 ext. 227

14 years ago by Hannes Magnusson — view source — reply

unread

Hi!

I set up a RFC page for this in wiki.php.net. Here it goes:
http://wiki.php.net/rfc/altmbstring

So.. Shouldn't we try to get this into PHP5.4?

Is it ready? Maybe have it as a PECL extension?

I was under the impression it was a drop-in-replacement, so a pecl ext
doesn't make much sense.
As for it being 100%, probably not - but there are still several months to go.

-Hannes

14 years ago by Stas Malyshev — view source — reply

unread

Hi!

I was under the impression it was a drop-in-replacement, so a pecl ext
doesn't make much sense.

I'd say if it's a drop-in PECL ext may still make sense - you don't
compile in the old one but use the PECL one instead. Wouldn't it work?

--
Stanislav Malyshev, Software Architect
SugarCRM: http://www.sugarcrm.com/
(408)454-6900 ext. 227

14 years ago by Ferenc Kovacs — view source — reply

unread

On Tue, May 31, 2011 at 10:42 AM, Stas Malyshev smalyshev@sugarcrm.comwrote:

Hi!

I was under the impression it was a drop-in-replacement, so a pecl ext

doesn't make much sense.

I'd say if it's a drop-in PECL ext may still make sense - you don't compile
in the old one but use the PECL one instead. Wouldn't it work?

If you have to recompile PHP without mbstring to use the replacement then I
think it's doesn't really matter that you can install this through pecl, or
not.
of course it's better to have this in the PECL repo than not having it at
all.

Tyrael

16 years ago by Alexey Zakhlestin — view source — reply

unread

2009/7/26 Moriyoshi Koizumi mozo@mozo.jp:

mb_ereg_search(), mb_ereg_search_getpos(), mb_ereg_search_getregs(),
mb_ereg_search_init(), mb_ereg_search_pos(), mb_ereg_search_regs() and
mb_ereg_search_setpos()
I rarely heard a script that actively uses these functions. They
involve an internal state that is not visible to users, and thus it
most likely causes confusion when used across the function calls.
Need to be reimplemented as a class.

I actually do use these. ;)
Probably, it will make sense to implement a completely new oniguruma
extension instead of keeping it as a part of mb_?

--
Alexey Zakhlestin
http://www.milkfarmsoft.com/

16 years ago by Moriyoshi Koizumi — view source — reply

unread

2009/7/26 Moriyoshi Koizumi mozo@mozo.jp:

mb_ereg_search(), mb_ereg_search_getpos(), mb_ereg_search_getregs(),
mb_ereg_search_init(), mb_ereg_search_pos(), mb_ereg_search_regs() and
mb_ereg_search_setpos()
I rarely heard a script that actively uses these functions. They
involve an internal state that is not visible to users, and thus it
most likely causes confusion when used across the function calls.
Need to be reimplemented as a class.

I actually do use these. ;)
Probably, it will make sense to implement a completely new oniguruma
extension instead of keeping it as a part of mb_?

I'm planng to reimplement them as a single SPL iterator. As I noted,
I also created a separate oniguruma extension that you can browse at
http://github.com/moriyoshi/php-oniguruma/

Regards,
Moriyoshi

--
Alexey Zakhlestin
http://www.milkfarmsoft.com/

16 years ago by Alexey Zakhlestin — view source — reply

unread

I also created a separate oniguruma extension that you can browse at
http://github.com/moriyoshi/php-oniguruma/

cool! will take a look at it

--
Alexey Zakhlestin
http://www.milkfarmsoft.com/

16 years ago by Niel Archer — view source — reply

unread

Hi there,

I almost finished an alternative implementation of mbstring that uses
ICU instead of the exotic libmbfl in hope of replacing the current one
for 5.4 (and possibly, 6.0.)

Although there are admittingly some known incompatibilities that need
extra libraries to resolve them besides a number of missing functions
that are intentionally removed for simplicity's sake, frequently used
functions are fully usable, and more compliant with the standard (e.g.
case insensitive matches).

Any comments are appreciated.

The source is ready in the following location:

http://github.com/moriyoshi/mbstring-ng/

Implemented functions:

mb_ereg()

mb_ereg_replace()

as ereg functions are deprecated in 5.3, are these still needed?

Regards,
Moriyoshi

--

--
Niel Archer
niel.archer (at) blueyonder.co.uk

16 years ago by Alexey Zakhlestin — view source — reply

unread

Implemented functions:

mb_ereg()

mb_ereg_replace()

as ereg functions are deprecated in 5.3, are these still needed?

these have nothing in common with "those" ereg functions. these are
based on onuguruma regex library
http://www.geocities.jp/kosako3/oniguruma/

--
Alexey Zakhlestin
http://www.milkfarmsoft.com/

16 years ago by Gwynne Raskind — view source — reply

unread

Implemented functions:

mb_ereg()

mb_ereg_replace()
as ereg functions are deprecated in 5.3, are these still needed?
these have nothing in common with "those" ereg functions. these are
based on onuguruma regex library
http://www.geocities.jp/kosako3/oniguruma/

I find Oniguruma to be, in general, a pared-down and less-useful
version of the PCRE we already have. Given that PCRE has full support
for UTF-8, and that there's nothing you can do with Oniguruma that you
can't also practically do with PCRE (to the best of my knowledge), I
think it would be best for PHP to standardize on a single regexp
library, rather than offering competing and confusing options. Killing
off POSIX syntax was a step in that direction, and I see no reason not
to take the rest of the steps.

If Oniguruma were offered as a PECL extension, I would think that
perfectly reasonable, but I don't think it belongs in core.

-- Gwynne

16 years ago by Moriyoshi Koizumi — view source — reply

unread

Implemented functions:

mb_ereg()

mb_ereg_replace()

as ereg functions are deprecated in 5.3, are these still needed?

mb_ereg_XXX() have nothing to do with the plain ereg functions. They
are named so purely for the historical reasons.

Moriyoshi

--
Niel Archer
niel.archer (at) blueyonder.co.uk

Alternative mbstring implementation using ICU

That's a common problem, IIRC PHP 6 converters have configurable error modes for that. Don't unicode_set_error_handler() and unicode_set_error_mode() do what you want?

In PHP 6 right now it's actually the only case, grapheme functions not even ported to PHP 6 yet (I know, not good) - but that's what regular str* functions should be doing, right?

I'm not sure what will happen about 5.4 etc. but sure I'd be glad to help as much as I could with anything regarding intl extension. DO you have some specific things that need to be done?

Is it ready? Maybe have it as a PECL extension?

That's a common problem, IIRC PHP 6 converters have configurable error
modes for that. Don't unicode_set_error_handler() and
unicode_set_error_mode() do what you want?

In PHP 6 right now it's actually the only case, grapheme functions not
even ported to PHP 6 yet (I know, not good) - but that's what regular
str* functions should be doing, right?

I'm not sure what will happen about 5.4 etc. but sure I'd be glad to
help as much as I could with anything regarding intl extension. DO you
have some specific things that need to be done?