[Discussion] Should All String Functions Become Multi-Byte Safe?

11 months ago by tim@bastelstu.be — view source

unread

Hi

It seems like there's still a lot of string functions that assume that
a character is a single byte, and these may actually work as expected
when dealing with Latin characters, but may fail unexpectedly if a
sequence is more than one byte.

PHP's strings are byte-strings containing arbitrary sequences of bytes.
Unless you specifically select functions that interpret the byte-strings
as something else, you get a byte-string interpretation. There is
nothing unexpected about that.

Are there any use cases for PHP where single-byte characters are
the norm?

Dealing with binary formats.

It seems that if everything on the Internet is multi-byte encoded now,
then all of the PHP string functions should be multi-byte safe.

The premise is false. Everything on the Internet is byte-strings (also
called "octet-string").

You might be interested in https://externals.io/message/119149#119149.

Best regards
Tim Düsterhus

11 months ago by Bilge — view source

unread

Are we going back to PHP 6?

11 months ago by Anton Smirnov — view source

unread

To mbstring.func_overload

Are we going back to PHP 6?

11 months ago by Anton Smirnov — view source

unread

Hi Nick,

As a developer who often deals with binary data (like bencode, ipv6
addresses and my own hacks for multibyte arithmetic) I would prefer that
functions and syntaxes that allow me to work with bytes keep working
with bytes, not characters or code points. So the closest solution would
be separate binary/text strings, but then we have PHP6 all over again.
Maybe this time it might work in some form, who knows.

HTML 5 was adopted in 2014, over ten years ago. HTML 5 only supports
the UTF-8 multi-byte character encoding.

It seems like there's still a lot of string functions that assume that
a character is a single byte, and these may actually work as expected
when dealing with Latin characters, but may fail unexpectedly if a
sequence is more than one byte.

Are there any use cases for PHP where single-byte characters are
the norm?

It seems that if everything on the Internet is multi-byte encoded now,
then all of the PHP string functions should be multi-byte safe.

The WHATWG Encoding Standard:

https://encoding.spec.whatwg.org/

Also, according to Mozilla, "[The meta charset] attribute declares the
document's character encoding. If the attribute is present, its value
must be an ASCII case-insensitive match for the string "utf-8", because
UTF-8 is the only valid encoding for HTML5 documents."

https://developer.mozilla.org/en-US/docs/Web/HTML/Element/meta#charset

11 months ago by youkidearitai — view source

unread

2024年8月12日(月) 1:42 Anton Smirnov sandfox@sandfox.me:

Hi Nick,

As a developer who often deals with binary data (like bencode, ipv6
addresses and my own hacks for multibyte arithmetic) I would prefer that
functions and syntaxes that allow me to work with bytes keep working
with bytes, not characters or code points. So the closest solution would
be separate binary/text strings, but then we have PHP6 all over again.
Maybe this time it might work in some form, who knows.

HTML 5 was adopted in 2014, over ten years ago. HTML 5 only supports
the UTF-8 multi-byte character encoding.

It seems like there's still a lot of string functions that assume that
a character is a single byte, and these may actually work as expected
when dealing with Latin characters, but may fail unexpectedly if a
sequence is more than one byte.

Are there any use cases for PHP where single-byte characters are
the norm?

It seems that if everything on the Internet is multi-byte encoded now,
then all of the PHP string functions should be multi-byte safe.

The WHATWG Encoding Standard:

https://encoding.spec.whatwg.org/

Also, according to Mozilla, "[The meta charset] attribute declares the
document's character encoding. If the attribute is present, its value
must be an ASCII case-insensitive match for the string "utf-8", because
UTF-8 is the only valid encoding for HTML5 documents."

https://developer.mozilla.org/en-US/docs/Web/HTML/Element/meta#charset

Hi Nick

I'm confused what is "multibyte safe".

Usually, PHP's string type is binary.
https://www.php.net/manual/en/language.types.string.php

If you want to use multibyte character, you can use mbstring functions.
(Is "multibyte safe" says about mbstring functions?)
https://www.php.net/manual/en/book.mbstring.php

There is no consistent solution I think, because you have to think a
lot about multibyte characters.

Regards
Yuya

--

Yuya Hamada (tekimen)

11 months ago by Alain D D Williams — view source

unread

2024年8月12日(月) 1:42 Anton Smirnov sandfox@sandfox.me:

I'm confused what is "multibyte safe".

I think that he means that the bytes are only valid UTF-8 sequences.

This would mean that some byte sequences would not be allowed.

-1 to this idea.

--
Alain Williams
Linux/GNU Consultant - Mail systems, Web sites, Networking, Programmer, IT Lecturer.
+44 (0) 787 668 0256 https://www.phcomp.co.uk/
Parliament Hill Computers. Registration Information: https://www.phcomp.co.uk/Contact.html
#include <std_disclaimer.h

11 months ago by Larry Garfield — view source

unread

HTML 5 was adopted in 2014, over ten years ago. HTML 5 only supports
the UTF-8 multi-byte character encoding.

It seems like there's still a lot of string functions that assume that
a character is a single byte, and these may actually work as expected
when dealing with Latin characters, but may fail unexpectedly if a
sequence is more than one byte.

Are there any use cases for PHP where single-byte characters are
the norm?

It seems that if everything on the Internet is multi-byte encoded now,
then all of the PHP string functions should be multi-byte safe.

The WHATWG Encoding Standard:

https://encoding.spec.whatwg.org/

Also, according to Mozilla, "[The meta charset] attribute declares the
document's character encoding. If the attribute is present, its value
must be an ASCII case-insensitive match for the string "utf-8", because
UTF-8 is the only valid encoding for HTML5 documents."

https://developer.mozilla.org/en-US/docs/Web/HTML/Element/meta#charset

Some background and history, for those not familiar...

After PHP 5.2, there was a huge effort to move PHP to using Unicode internally. It was to be released as PHP 6. Unfortunately, it ran into a whole host of problems, among them:

It tried to use UTF-16 internally, as there were good libraries for it but it was much much slower than was acceptable.
It required rewriting basically everything.
Trying to support two string variants at the same time (because binary strings are still very useful) in almost the same syntax turned out be, um, kinda hard.

After a number of years of work, it was eventually concluded that it was a dead end. So the non-Unicode-related bits of what would have been PHP 6 got renamed to PHP 5.3 and released to much fanfare, kicking off the PHP Renaissance Era.

When PHP 5.6+1 was released, there was a vote to decide if it should be called 6 or 7. 7 won, mainly on the grounds that a number of very stupid book publishers had released "PHP 6" books in anticipation of PHP 6's release that were now completely useless and misleading. So we skipped 6 entirely, and PHP 6-compatibility is a running joke among those who have been around a while.

Fortunately, the vast majority of single-byte strings are ASCII, and ASCII is, by design, a strict subset of UTF-8, so in practice the lack of native UTF-8 strings rarely causes an issue.

Trying to introduce Unicode strings to the language now as a native type would... probably break just as much if not more. If anything it's probably harder today than it was in 2008, because the engine and existing code to not-break has grown considerably.

A much better approach would be something like this RFC from Derick a few years ago:

https://wiki.php.net/rfc/unicode_text_processing

If you need something today, then Symfony has a user-space approximation of it:

https://symfony.com/doc/current/string.html

--Larry Garfield

11 months ago by Nick Lockheart — view source

unread

Some background and history, for those not familiar...

After PHP 5.2, there was a huge effort to move PHP to using Unicode
internally. It was to be released as PHP 6. Unfortunately, it ran
into a whole host of problems, among them:

It tried to use UTF-16 internally, as there were good libraries
for it but it was much much slower than was acceptable.

It required rewriting basically everything.

Trying to support two string variants at the same time (because
binary strings are still very useful) in almost the same syntax
turned out be, um, kinda hard.

After a number of years of work, it was eventually concluded that it
was a dead end. So the non-Unicode-related bits of what would have
been PHP 6 got renamed to PHP 5.3 and released to much fanfare,
kicking off the PHP Renaissance Era.

When PHP 5.6+1 was released, there was a vote to decide if it should
be called 6 or 7. 7 won, mainly on the grounds that a number of very
stupid book publishers had released "PHP 6" books in anticipation of
PHP 6's release that were now completely useless and misleading. So
we skipped 6 entirely, and PHP 6-compatibility is a running joke
among those who have been around a while.

Fortunately, the vast majority of single-byte strings are ASCII, and
ASCII is, by design, a strict subset of UTF-8, so in practice the
lack of native UTF-8 strings rarely causes an issue.

Trying to introduce Unicode strings to the language now as a native
type would... probably break just as much if not more. If anything
it's probably harder today than it was in 2008, because the engine
and existing code to not-break has grown considerably.

A much better approach would be something like this RFC from Derick a
few years ago:

https://wiki.php.net/rfc/unicode_text_processing

If you need something today, then Symfony has a user-space
approximation of it:

https://symfony.com/doc/current/string.html

--Larry Garfield

I think that when people think of "strings", they think of human
readable text.

I wasn't suggesting that unicode strings be a native type, but rather
that functions that have "string" in the name should be UTF-8 safe.

There's a lot of pitfalls here, and I don't think the documentation
clearly calls out which functions are OK to use with UTF-8 and which
ones may cause unexpected surprises.

The compatibility between ASCII and UTF-8 for Latin characters is both
a curse and a blessing. An application may work fine in testing, but
then break when a user submits an emoji.

It seems like it would be good to have a set of functions, each for an
intended use case, that behave in accordance with their intended usage.

For example:

Math and number functions for calculations; string functions for human
readable text (which are UTF-8 safe), and byte functions for binary
processing that are binary safe.

Using the functions for certain use cases right now requires knowing
the internals of the function, where developers should be able to rely
on the name to know that it will work for a specific use case.

For many functions, the manual doesn't specify if it is safe for multi-
byte characters or not.

ltrim doesn't mention multi-byte:

https://www.php.net/manual/en/function.ltrim.php

The trim page doesn't mention it either, except there is a user
contributed note at the bottom: "Note that trim() is not aware of
Unicode points that represent whitespace (e.g., in the General
Punctuation block), except, of course, for the ones mentioned in this
page. There is no Unicode-specific trim function in PHP at the time of
writing (July 2023), but you can try some examples of trims using
multibyte strings posted on the comments for the mbstring extension:
https://www.php.net/manual/en/ref.mbstring.php";.

So what I would propose is:

(1) All string functions should state in the official man page if they
are safe for UTF-8 or not.

(2) Functions intended for working with text should be made UTF-8 safe.

(3) Functions intended for processing binary should be added if
necessary, and should be named something like "binary" or "byte".

11 months ago by Ayesh Karunaratne — view source

unread

There's a lot of pitfalls here, and I don't think the documentation
clearly calls out which functions are OK to use with UTF-8 and which
ones may cause unexpected surprises.

The compatibility between ASCII and UTF-8 for Latin characters is both
a curse and a blessing. An application may work fine in testing, but
then break when a user submits an emoji.

[snip]

(1) All string functions should state in the official man page if they
are safe for UTF-8 or not.

https://github.com/php/doc-en where our official documentation source.
Open source, and often towards the end of the year before the PHP
major version release, the team and contributors spend a tremendous
amount of work to update the documentation to match the latest new
features, deprecations, etc. Always welcome for contributions,
including the ones that warn about certain functions not being
multi-byte safe.

(2) Functions intended for working with text should be made UTF-8 safe.

Generally speaking, all functions that deal with strings are in fact
UTF-8 safe because UTF-8 strings are also a sequence of bytes, just
like the other strings are. The problems occur only if you try to
modify or inspect the text in a way that expects how it should be
handled as human readable text.

Take the text "å" for example. What is the length of the string?

strlen('å'); // 3
mb_strlen('å'); // 2
grapheme_strlen('å'); // 1

The correct length of the string above (a\xCC\x8A) is... well, all of them:

strlen is useful if you validate the length of a user-input
before saving it to a database field with a varchar limit, or to
avoid exceeding index length.
mb_strlen is useful if you want to count how many human
code-points are used in that string. The mbstring extension knows from
Unicode data shows that "\xCC\x8A" is a single code-point. However, it will
only consider upto 4 bytes per character because UTF-8 representation
limits it to 4 bytes.
grapheme_strlen counts the actual human-perceived characters
(grapheme clusters), which is what you should really be using if you
are formatting text for a specific length.

It's also important to understand and appreciate that a lot of PHP
functionality today has been there for a very long time. You can't
simply change a critical function like strpos this late in a
programming language. See the excellent reply Larry made about what
happened the time PHP tried to do exactly what you are suggesting.

Replacing all strlen calls in a code base mb_strlen or
graphme_strlen is not a good idea because they serve a different
requirement to strlen, and they should only be used intentionally
where necessary. The latter functions also have to inspect the strings
sequentially because UTF-8 is not fixed-length. This is quite slow and
it adds up when you process thousands of strings.

(3) Functions intended for processing binary should be added if
necessary, and should be named something like "binary" or "byte".

We are already doing it, just the other way around. See mb_* and
grapheme_* functions: All of them are purposefully built to support
those features, and are clearly named as such.

The rest of the functions consistently consider all strings as a
sequence of bytes.

This naming pattern is arguably the correct way, because the majority
of functions do not need to care whether the strings they deal with
need to be human-perceived characters or not. For example,
base64_encode/decode functions, file_(get|put)_contents,
pack/unpack, etc will work with any string regardless of their
UTF-8 correctness. Why should those strings need to be UTF-8 formatted
in the first place?

11 months ago by Anton Smirnov — view source

unread

So what I would propose is:

(1) All string functions should state in the official man page if they
are safe for UTF-8 or not.

Reasonable but see below

(2) Functions intended for working with text should be made UTF-8 safe.

Define precisely UTF-8 safe. Also, what about BC breaks here?

(3) Functions intended for processing binary should be added if
necessary, and should be named something like "binary" or "byte".

That would require renaming and deprecating most of the standard string
library, I guess no one would agree to that.

But generally they are already named differently, str* are binary, mb_*
and grapheme_* are text-oriented

11 months ago by Rowan Tommins [IMSoP] — view source

unread

It seems that if everything on the Internet is multi-byte encoded now,
then all of the PHP string functions should be multi-byte safe.

The phrase "multibyte safe" may have made sense about 30 years ago, when it was thought that a "universal character set" could just be a "wide ASCII", encoding a straightforward list of characters, just more of them.

Modern Unicode is so much more than that, because the world's writing systems don't all work the same way. Should strlen() measure bytes, code points, or graphemes? Should strtoupper() accept a locale, so it can handle cases like Turkish "dotless i" where "I" is not the uppercase of "i"? And so on, and so on.

I've seen plenty of languages boast that they are "Unicode aware" but few actually engaging with the question of what that actually means. Often they equate "character" with "code point" and stop there, which leads to results that are just as useless to most of the world as if they'd equated it with "byte".

Regards,
Rowan Tommins
[IMSoP]

11 months ago by Daniel Haber — view source

unread

It seems that if everything on the Internet is multi-byte encoded now,
then all of the PHP string functions should be multi-byte safe.

The phrase "multibyte safe" may have made sense about 30 years ago, when it was thought that a "universal character set" could just be a "wide ASCII", encoding a straightforward list of characters, just more of them.

Modern Unicode is so much more than that, because the world's writing systems don't all work the same way. Should strlen() measure bytes, code points, or graphemes? Should strtoupper() accept a locale, so it can handle cases like Turkish "dotless i" where "I" is not the uppercase of "i"? And so on, and so on.

I've seen plenty of languages boast that they are "Unicode aware" but few actually engaging with the question of what that actually means. Often they equate "character" with "code point" and stop there, which leads to results that are just as useless to most of the world as if they'd equated it with "byte".

Regards,
Rowan Tommins
[IMSoP]

Feels appropriate to link to this:
"The Absolute Minimum Every Software Developer Must Know About Unicode
in 2023 (Still No Excuses!)"
https://tonsky.me/blog/unicode/

11 months ago by youkidearitai — view source

unread

2024年8月12日(月) 18:54 Daniel Haber danielhaber@gmail.com:

It seems that if everything on the Internet is multi-byte encoded now,
then all of the PHP string functions should be multi-byte safe.

The phrase "multibyte safe" may have made sense about 30 years ago, when it was thought that a "universal character set" could just be a "wide ASCII", encoding a straightforward list of characters, just more of them.

Modern Unicode is so much more than that, because the world's writing systems don't all work the same way. Should strlen() measure bytes, code points, or graphemes? Should strtoupper() accept a locale, so it can handle cases like Turkish "dotless i" where "I" is not the uppercase of "i"? And so on, and so on.

I've seen plenty of languages boast that they are "Unicode aware" but few actually engaging with the question of what that actually means. Often they equate "character" with "code point" and stop there, which leads to results that are just as useless to most of the world as if they'd equated it with "byte".

Regards,
Rowan Tommins
[IMSoP]

Feels appropriate to link to this:
"The Absolute Minimum Every Software Developer Must Know About Unicode
in 2023 (Still No Excuses!)"
https://tonsky.me/blog/unicode/

Hi, there

Feels appropriate to link to this:
"The Absolute Minimum Every Software Developer Must Know About Unicode
in 2023 (Still No Excuses!)"
https://tonsky.me/blog/unicode/

I think it's the same as the quoted site.
However, In programming, there are times when you want to operate on
bytes, code points, or grapheme clusters.
UTF-8 can't solve everything, what to program is important for
programmers (byte programming, character programming etc).

Also, other character encodings are also important in mainly CJK.
Character set has a lot of consider of many things.

Regards
Yuya

--

Yuya Hamada (tekimen)