iconv vs. mbstring

5 years ago by Christoph M. Becker — view source

unread

Hi all,

we still have 2 bundled extensions for working with strings in different
encodings: ext/mbstring and ext/iconv. While working on bug #79200[1],
I've noticed that the implementation of many of the iconv_*() functions
is rather suboptimal. This is mostly because iconv() is meant just for
character encoding conversion, but ext/iconv puts several other useful
string functions on top of that, but can't have these really optimized,
because the extension doesn't really know anything about those character
encodings.

For instance, iconv_strlen() is basically implemented by converting the
input string to UCS-4, and then simply counting the UCS-4 characters. On
the other hand, mb_strlen() makes use of length tables (where
appropriate), and as such does not even need to convert the string in
many typical cases. Some quick benchmarks on getting the string length
of UTF-8 strings show that mb_strlen() is roughly 10 times faster than
iconv_strlen(). Now it would be trivially possible to improve the
iconv_strlen() implementation by converting a larger number of
characters in one go (instead of currently up to two only[2]), which
would make the function much faster (roughly 3 to 4 times for a 1024
character buffer), but still mb_strlen() would obviously beat that.

The situation for the other iconv_*() functions is similar, more or
less. However, it seems that iconv() can be much faster than
mb_convert_encoding(). Quick benchmarks show a factor of 2 to 3.

So I wonder if we wouldn't be better off if we unbundle ext/iconv, but
move the iconv() function (and possibly the convert.iconv.* stream
filter) into ext/standard. It shouldn't be hard to update code which
uses any of the iconv_() functions to use respective mb_() functions,
and users who couldn't do this, or don't want to for whatever reason,
could still use the iconv package available from PECL. However, users
who would switch to mbstring would likely get better performance for
their applications.

For core developers that would obviously save time to maintain both
extensions.

For users learning PHP, and also for new code, it would be beneficial to
not have to decide which of these extensions to use; if they need
character encoding conversion, iconv() would be preferable; for more
general string functionality, it would be ext/mbstring.

Thoughts?

[1] https://bugs.php.net/79200
[2] https://github.com/php/php-src/blob/php-7.4.3/ext/iconv/iconv.c#L714

--
Christoph M. Becker

5 years ago by Aleksander Machniak — view source

unread

For users learning PHP, and also for new code, it would be beneficial to
not have to decide which of these extensions to use; if they need
character encoding conversion, iconv() would be preferable; for more
general string functionality, it would be ext/mbstring.

From my experience iconv does not support all charsets e.g. UTF7-IMAP or
ISO-2022-JP-MS, that mbstring does.

Also, I have a case in which iconv_* functions were much much slower
than mbstring. See wordwrap implementation in
https://github.com/roundcube/roundcubemail/blob/master/program/lib/Roundcube/rcube_mime.php#L589

I did not do any performance comparison for iconv() function itself and
I'm not sure it should be considered preferable. I saw a lot of
performance improvements in mbstring in the last year or so. Do anyone
have some perf. comparison for charset conversion cases?

--
Aleksander Machniak
Kolab Groupware Developer [https://kolab.org]
Roundcube Webmail Developer [https://roundcube.net]

PGP: 19359DC1 # Blog: https://kolabian.wordpress.com

5 years ago by Rowan Tommins — view source

unread

Also, I have a case in which iconv_* functions were much much slower
than mbstring. See wordwrap implementation in

https://github.com/roundcube/roundcubemail/blob/master/program/lib/Roundcube/rcube_mime.php#L589

Either I misread, or you did: I thought that's exactly what Christoph was
saying, that iconv_* functions will be slower because they basically do
convert-process-unconvert rather than having implementations for each
encoding. So mb_strlen will always be faster than iconv_strlen.

iconv() vs mb_convert_encoding() doesn't have the same penalty. It seems
quite plausible that a library dedicated to converting charsets would be
more optimised for that job than a single function in a larger library
mainly focussed on working with one charset at a time.

Regards,

Rowan Tommins
[IMSoP]

5 years ago by Christoph M. Becker — view source

unread

Also, I have a case in which iconv_* functions were much much slower
than mbstring. See wordwrap implementation in

https://github.com/roundcube/roundcubemail/blob/master/program/lib/Roundcube/rcube_mime.php#L589

Either I misread, or you did: I thought that's exactly what Christoph was
saying, that iconv_* functions will be slower because they basically do
convert-process-unconvert rather than having implementations for each
encoding. So mb_strlen will always be faster than iconv_strlen.

iconv() vs mb_convert_encoding() doesn't have the same penalty. It seems
quite plausible that a library dedicated to converting charsets would be
more optimised for that job than a single function in a larger library
mainly focussed on working with one charset at a time.

I should add that all my testing was done solely with a current libiconv
version (on Windows); I don't know how libc's ivonc() and other
implementations might perform.

--
Christoph M. Becker

5 years ago by Dan Ackroyd — view source

unread

for more
general string functionality, it would be ext/mbstring.

Thoughts?

Related to this discussion, please could someone remind me why the
mbstring extension is an extension and not part of core PHP?

I realise at the time it was introduced, UTF-8 was far less widely
used: https://en.wikipedia.org/wiki/UTF-8#/media/File:Utf8webgrowth.svg

But now UTF-8 is pretty much the default for the vast majority of
projects, so does that decision to keep it as an optional extension
still hold up?

cheers
Dan
Ack

5 years ago by Rowan Tommins — view source

unread

Related to this discussion, please could someone remind me why the
mbstring extension is an extension and not part of core PHP?

I realise at the time it was introduced, UTF-8 was far less widely
used: https://en.wikipedia.org/wiki/UTF-8#/media/File:Utf8webgrowth.svg

But now UTF-8 is pretty much the default for the vast majority of
projects, so does that decision to keep it as an optional extension
still hold up?

From what I can make out, mbstring was not actually built for Unicode
string-handling, but for what we would now consider "legacy encodings".
Its original niche seems to have been support for various Japanese text
encodings, and UTF-8 support was added relatively late.

That has some implications for its design:

every function takes encoding as a parameter, and defaults to a
run-time global setting
on the other hand, there is no support for locales in functions which
would benefit, e.g. mb_convert_case, mb_stripos
Unicode is treated as just another character encoding, so there is no
support for concepts like normalisation, graphemes, character
properties, etc
instead, there are lots of niche functions for CJK languages like
mb_convert_kana and mb_strwidth

It also includes some things which probably wouldn't pass review if
proposed today:

a lot of global state, with combined get-or-set functions like
mb_detect_order(), mb_substitute_character(), etc
mb_send_mail seems oddly specific, and has its own concept of
"language" not shared by anything else
there's an entire regex implementation, with its own API and some
compatibility with the removed ereg_* functions; I believe the preg_*
functions included in core already support UTF-8

For handling of Unicode, ext/intl is generally superior, with a more
structured API based on Unicode-specific concepts, rather than
attempting to map them to concepts used in older character encodings.
There may be a need for a more user-friendly subset of this (a "UString"
class is a common suggestion), but it shouldn't look like ext/mbstring,
IMHO.

I believe both extensions require fairly large external libraries, which
probably justifies them being optional. From what I've read, ICU, which
ext/intl is built on, would have been bundled with PHP 6, but its size
and performance contributed to the failure of that project.

Regards,

--
Rowan Tommins (né Collins)
[IMSoP]

5 years ago by Tomas Kuliavas — view source

unread

2020.03.08 16:08 Dan Ackroyd rašė:

On Tue, 3 Mar 2020 at 22:17, Christoph M. Becker cmbecker69@gmx.de
wrote:

for more
general string functionality, it would be ext/mbstring.

Thoughts?

Related to this discussion, please could someone remind me why the
mbstring extension is an extension and not part of core PHP?

I realise at the time it was introduced, UTF-8 was far less widely
used: https://en.wikipedia.org/wiki/UTF-8#/media/File:Utf8webgrowth.svg

But now UTF-8 is pretty much the default for the vast majority of
projects, so does that decision to keep it as an optional extension
still hold up?

Majority is not 100%. Your German folders in webmails are not in UTF-8.
Even if application operates in utf-8, it still must be able to work not
with utf-8 and PHP functions that changed default from ascii/iso-8859-1 to
utf-8 were not designed to operate in real world.

Reason why mbstring are not base string functions is why PHP6 does not exist.

--
Tomas

iconv vs. mbstring

-- Aleksander Machniak Kolab Groupware Developer [https://kolab.org] Roundcube Webmail Developer [https://roundcube.net]

Regards,

--
Aleksander Machniak
Kolab Groupware Developer [https://kolab.org]
Roundcube Webmail Developer [https://roundcube.net]