After reading Rowan's last message, it feels appropriate to link to this:
"The Absolute Minimum Every Software Developer Must Know About Unicode in 
2023 (Still No Excuses!)"
After reading Rowan's last message, it feels appropriate to link to this:
"The Absolute Minimum Every Software Developer Must Know About Unicode in
2023 (Still No Excuses!)"
I have only skimmed that article, but it looks good.  For PHP, though, 
the following answer given in the article is very important:
| with a 98% probability, it’s UTF-8.
So we still have to support non UTF-8 encodings. Not all the world's a VAX.
Christoph
Hi,
Currently, PHP strings are binary safe (thus can store any encoding). I 
generally think of PHP strings as being an array of bytes vs. a "string" 
you are familiar with in other languages. The name is unfortunate in that 
regard, but working with them is straightforward (imagine having an actual 
array of bytes in PHP and trying to work on them).
Robert Landers 
Software Engineer 
Utrecht NL
After reading Rowan's last message, it feels appropriate to link to this:
"The Absolute Minimum Every Software Developer Must Know About Unicode in
2023 (Still No Excuses!)"
Currently, PHP strings are binary safe (thus can store any encoding).
I generally think of PHP strings as being an array of bytes vs. a
"string" you are familiar with in other languages. The name is
unfortunate in that regard, but working with them is straightforward
(imagine having an actual array of bytes in PHP and trying to work on
them).
PHP was the first language I leaned to program in, followed by 
JavaScript.
At that point, and for many years thereafter, I never thought of 
strings as anything more than chunks of human readable text.
It wasn't until I started to learn C++ that my understanding of strings 
changed. It was no longer "text", but a sequence of bytes.
The problem is, when people start out learning a higher-level language 
like PHP, they don't start by understanding what the computer actually 
stores in memory, or how data structures are represented internally.
They write:
echo "hello world!';
...and give no thought to how the letters they typed become the letters 
that come out.
Character encoding doesn't cross your mind until a spammer tries to 
paste foreign characters into a contact form and your application 
crashes.
And when you try to learn more, the most you find is advise to slap a 
few "utf-8" stickers in some places like the HTML response header, the 
charset meta tag, and to use some mb_* internal/output encoding 
functions.
I've been looking for the past few weeks now, and I've asked on some 
community groups as well, and I have been unable to find a good, 
comprehensive security-minded guide for dealing with multi-byte 
characters and character attacks in PHP.
There's general guidance from the Unicode Consortium on what should be 
done, but no guides on how to implement their security recommendations 
in PHP.
One report is: https://www.unicode.org/reports/tr36
There's several things in their guide.
They recommend that illegal byte sequences not be deleted as this can 
create an attack vector where two bytes that fit together are split by 
an illegal sequence, that, once removed, puts the two bytes back 
together to make something new, after the program has checked for 
dangerous characters:
https://www.unicode.org/reports/tr36/#SecureEncodingConversion
In PHP, you should be able to do that with:
$ScrubbedBody = mb_scrub($_POST['body'], 'UTF-8');
But there's a pitfall here!
By default, mb_scrub and several other PHP conversion functions 
replace illegal byte sequences with a ? instead of U+FFFD, the 
designated replacement character.
A question mark is an important character with special meaning, and the 
default implementation of mb_scrub will allow an attacker to put a 
? anywhere they want by inserting illegal bytes where they want a 
question mark inserted.
To get the correct behavior, a developer must know to call:
mb_substitute_character(0xFFFD); 
$ScrubbedBody = mb_scrub($_POST['body'], 'UTF-8');
There's also some Unicode Consortium recommendations on sets of 
characters that should be stripped from user input.
https://www.unicode.org/reports/tr36/#Recommendations_General
The report says:
"Private use characters must be avoided in identifiers, except in 
closed environments. There is no predicting what either the visual 
display or the programmatic interpretation will be on any given 
machine, so this can obviously lead to security problems."
They go on to say, "What is true for private use characters is doubly 
true of unassigned code points. Secure systems will not use them: any 
future Unicode Standard could assign those codepoints to any new 
character. This is especially important in the case of certification."
But how do we remove these private use characters and unassigned code 
points using PHP?
You can use mb_ereg or preg with /u to remove character ranges, 
but this is clumsy at best.
The guide warns against trying to restrict characters by language, and 
recommends using a "writing system" instead:
https://www.unicode.org/reports/tr36/#Language_Based_Security
"Creating "safe character sets" is an important goal in a security 
context, and it would appear that the characters used in a language is 
an obvious choice. However, because of the indeterminate set of 
characters used for a language, it is typically more effective to move 
to the higher level, the script, which can be more easily specified and 
tested."
While I could probably hack together an array of regular expressions 
for identifying white-listed (language) scripts, this seems like 
something that should be built-in as a single function.
In any application that reflects text back to other users, securely 
processing incoming Unicode is as important to stopping XSS attacks as 
PDO prepared statements are to stopping SQL injection.
As for the second recommendation, removing "unassigned code points", I 
have not even started to work out how to do this with PHP.
Since Unicode presents a security concern, I think it is important that 
function behavior with regard to Unicode be well documented, and also, 
that we have some functions that are easy to use to properly handle the 
complexities of Unicode security.
One report is: https://www.unicode.org/reports/tr36
There's several things in their guide.
They recommend that illegal byte sequences not be deleted as this can
create an attack vector where two bytes that fit together are split by
an illegal sequence, that, once removed, puts the two bytes back
together to make something new, after the program has checked for
dangerous characters:https://www.unicode.org/reports/tr36/#SecureEncodingConversion
In PHP, you should be able to do that with:
$ScrubbedBody = mb_scrub($_POST['body'], 'UTF-8');
I suggest to validate, not to sanitize.  If a malicious user submits 
illegal UTF-8, just reject the request right away.  Regular users 
shouldn't even notice this.
Christoph
After reading Rowan's last message, it feels appropriate to link to this:
"The Absolute Minimum Every Software Developer Must Know About Unicode in 2023 (Still No Excuses!)"
https://tonsky.me/blog/unicode/ https://tonsky.me/blog/unicode/
Good read! Thank you for mentioning it.
A really standout paragraph from that link is:
"IMO, the whole situation is a shame. Unicode should be 
in the stdlib of every language by default. It’s the lingua 
franca of the internet! It’s not even new: we’ve been living 
with Unicode for 20 years now."
I'll just leave that right here. :-)
-Mike
P.S. Channeling Larry and Derick: https://wiki.php.net/rfc/unicode_text_processing <https://wiki.php.net/rfc/unicode_text_processing
A really standout paragraph from that link is:
"IMO, the whole situation is a shame. Unicode should be
in the stdlib of every language by default. It’s the lingua
franca of the internet! It’s not even new: we’ve been living
with Unicode for 20 years now."
I actually think that paragraph rather ignores everything else the 
article has just explained. "Putting Unicode in the stdlib" is an 
incredibly difficult task, and it's not entirely clear what it should 
even mean.
In PHP, we have ext/intl, built around a library called ICU, developed 
by the Unicode consortium. Unfortunately, it only exposes a small 
selection of ICU's functions, e.g. there's nothing for locale-based case 
folding of whole strings. The ext/intl documentation is also very 
patchy, and the actual ICU documentation isn't always much better.
The main reason it's not mandatory for all builds of PHP, just 
"bundled", is that the sheer complexity of Unicode means that the 
library is rather large - somebody (Rasmus, I think?) joked that relying 
on it for PHP 6 would have made PHP a small library attached to the side 
of ICU.
We also have the "mbstring" extension, which was not designed around 
Unicode, but was originally built for various encodings popular in Japan 
20+ years ago. It doesn't have the databases of codepoint information 
that ICU does, so can't answer questions like "what script does this 
code point belong to?" or "what is the uppercase equivalent of this 
grapheme, assuming a Turkish locale?"
-- 
Rowan Tommins 
[IMSoP]
A really standout paragraph from that link is:
"IMO, the whole situation is a shame. Unicode should be
in the stdlib of every language by default. It’s the lingua
franca of the internet! It’s not even new: we’ve been living
with Unicode for 20 years now."I actually think that paragraph rather ignores everything else the article has just explained.
You and I had different takeaways then.
and it's not entirely clear what it should even mean.
I cannot speak for the author off the article, but I thought I had implied strongly enough what it would mean to me. Evidently I did not, so I will be explicit:
Pursue this RFC: https://wiki.php.net/rfc/unicode_text_processing https://wiki.php.net/rfc/unicode_text_processing
The main reason it's not mandatory for all builds of PHP, just "bundled", is that the sheer complexity of Unicode means that the library is rather large
Let me see if I understand your argument correctly? You are asserting that Unicode is "too complex" to be handled in the standard library so that complexity should instead be shouldered individually by each and every PHP developer who needs to work with Unicode text in PHP, which is many PHP developers if not eventually most. Is that your argument?
Imagine if PHP had taken the position that "It is too complex, so we'll just make userland developers deal with it" regarding cryptography and encryption? Or regular expressions? Or image processing? Or time and date manipulation? Or network and socket programming?
"Putting Unicode in the stdlib" is an incredibly difficult task, and it's not entirely clear what it should even mean.
...
somebody (Rasmus, I think?) joked that relying on it for PHP 6 would have made PHP a small library attached to the side of ICU.
You are comparing apples and oranges.
Putting Unicode into an existing language and integrating with built-in data types in a backward compatible manner is a MUCH bigger lift than "putting Unicode into a standard library." The latter is just providing functions and/or an object and methods for the majority of tasks needed to process Unicode text.
PHP already has some functions for Unicode in the standard library as have been mentioned, but not enough to reasonably handle most Unicode text-related tasks. A Unicode text processing class with the existing RFC as a starting point could unify that functionality and fill in the missing gaps.
BTW, I have done a significant amount of work with Unicode in Go — which handles code points natively, but unfortunately not grafemes — and handling Unicode effectively is not that hard. The rules are many, but they are straightforward. Certainly it is not harder than cryptography and encryption, which PHP addresses in core.
We also have the "mbstring" extension, which was not designed around Unicode, but was originally built for various encodings popular in Japan 20+ years ago. It doesn't have the databases of codepoint information that ICU does, so can't answer questions like "what script does this code point belong to?" or "what is the uppercase equivalent of this grapheme, assuming a Turkish locale?"
Interesting historical factoid, but how is that really relevant to including Unicode into the standard library?
-Mike
Let me see if I understand your argument correctly? You are asserting that Unicode is "too complex" to be handled in the standard library so that complexity should instead be shouldered individually by each and every PHP developer who needs to work with Unicode text in PHP, which is many PHP developers if not eventually most. Is that your argument?
Not really, no. I'm definitely in favour of including more Unicode-based string handling functionality, by improving and extending ext/intl, or coming up with new convenience wrappers for common tasks.
What I'm always sceptical of is the idea that you could ever consider such functionality "complete", or that "Unicode support" can ever be a single deliverable, rather than an ongoing aspiration. (And consequently, I'm sceptical of any language which says it has achieved that.)
I also think "Unicode support" is probably the wrong angle to approach from; it leads to features like IntlChar, which technically provides access to tons of data from the Unicode standard, but practically has no use for 99% of PHP developers. Instead we should be talking about "internationalisation support", of which handling different writing systems is one (fairly big) part.
For instance, I would welcome proposals like "here's some functions for handling locale-specific case folding and normalisation-based matching", "here's some functions for limiting the storage size of a string without producing garbage characters", etc. As well as related things which aren't just about text encoding, like "here's some functions for working with locale-specific date formatting" (or even just "here's some documentation for how you're supposed to use ext/intl's date classes").
We also have the "mbstring" extension, ...
Interesting historical factoid, but how is that really relevant to including Unicode into the standard library?
I was just summarising the current situation, to work out where we could go next. Any attempt to extend string handling functionality is likely to build on either ext/intl or ext/mbstring, so it's useful to understand how they differ.
Regards, 
Rowan Tommins 
[IMSoP]
I wanted to reply generally to this and not to any person in 
particular, as I'm the one who started the thread.
I used the rather broad title "Should All String Functions Become 
Multi-Byte Safe" because there are many smaller related topics, but my 
intention was to discuss multi-byte in general, and see if there was 
some consensus on action items that could have a more limited scope/RFC 
for that task.
My overall intent and goal was to make PHP safer against multi-byte 
attacks by providing developers with tools that could become best 
practices for dealing with user input stings, the same way we had 
mysql_real_escape_string, and then PDO prepared statements for SQL.
There's a lot of potential pitfalls for dealing with Unicode input, and 
there are some best practices per the Unicode Consortium that I'm not 
sure how to implement in PHP, and it seems that since everyone needs 
them, they might be better as a shared library in core.
For example, there should be a function that removes unassigned code 
points.
There should also be a function that removes "scripts" (as defined by 
Unicode).
We should have an easy way to remove private use code points (unless 
you're running a Star Trek fan site and really do need Klingon).
And the default replacement character for mb_scrub shouldn't be ?.
Each of these and other ideas could be part of an RFC, or we could 
brainstorm a Unicode built-in class that handles lots of the common use 
cases.
Having a team-built and audited Unicode class would benefit almost 
everyone using PHP.
Hi Nick,
I wanted to reply generally to this and not to any person in
particular, as I'm the one who started the thread.I used the rather broad title "Should All String Functions Become
Multi-Byte Safe" because there are many smaller related topics, but my
intention was to discuss multi-byte in general, and see if there was
some consensus on action items that could have a more limited scope/RFC
for that task.My overall intent and goal was to make PHP safer against multi-byte
attacks by providing developers with tools that could become best
practices for dealing with user input stings, the same way we had
mysql_real_escape_string, and then PDO prepared statements for SQL.There's a lot of potential pitfalls for dealing with Unicode input, and
there are some best practices per the Unicode Consortium that I'm not
sure how to implement in PHP, and it seems that since everyone needs
them, they might be better as a shared library in core.For example, there should be a function that removes unassigned code
points.There should also be a function that removes "scripts" (as defined by
Unicode).We should have an easy way to remove private use code points (unless
you're running a Star Trek fan site and really do need Klingon).And the default replacement character for
mb_scrubshouldn't be?.Each of these and other ideas could be part of an RFC, or we could
brainstorm a Unicode built-in class that handles lots of the common use
cases.Having a team-built and audited Unicode class would benefit almost
everyone using PHP.
My suggestion — take it or leave it — is to create a GitHub repo for your own RFCs and start writing your RFC there "in the open." Add the code for your implementation to the repo, add a discussion forum to allow really interested parties to participate, and send an invite on this list to those who are really interested to discuss, comment on the RFC, and even offer PRs.
Then when everyone participating at your repo thinks the RFC is fully-baked, bring it back to the list here to discuss.
Doing it that way will — unlike just discussing on the list — enable comments made in the forum a place to be captured and converted into text and implementation visible for everyone to see, and really motivated people can even submit PRs to your RFC in order to spread the load of writing a good RFC.
#jmtcw #fwiw
-Mike
I used the rather broad title "Should All String Functions Become
Multi-Byte Safe" because there are many smaller related topics, but my
intention was to discuss multi-byte in general
I think it was probably not the best choice, because it seems like what you're specifically interested in is mostly not about existing functions, and not particularly about encodings being more than one byte wide.
For instance, even good old 7-bit-per-character ASCII contains control characters you might want help sanitising out; and plenty of 8-bit-per-character encodings include more than one script, even more than one writing direction (e.g. ISO 8859-8 Latin/Hebrew).
But, the specific topic of safe input handling is definitely an interesting one. And focussing on Unicode, rather than every possible encoding (multibyte or not) makes sense in modern usage.
There's a lot of potential pitfalls for dealing with Unicode input, and
there are some best practices per the Unicode Consortium
It's worth looking into whether the ICU library has explicit functions to help with those recommendations (if you can navigate its slightly patchy documentation). Since most of ext/intl is just a thin wrapper on that library, that could make our lives a lot easier.
For example, there should be a function that removes unassigned code
points.There should also be a function that removes "scripts" (as defined by
Unicode).We should have an easy way to remove private use code points (unless
you're running a Star Trek fan site and really do need Klingon).
These all seem like good ideas. I think you can do at least some of it with regular expressions, but dedicated functions have potential to be both easier to use and more efficient.
And the default replacement character for
mb_scrubshouldn't be?.
This is trickier, and where mixing the terms "multibyte" and "Unicode" actually matters. The mbstring extension supports a number of different text encodings, most of which don't have a dedicated replacement character to use. It also has the ability to set the default in global state with mb_substitute_character() so it's not immediately obvious how a different default could be applied based on the specified encoding. (I'm not a fan of that API design, but it's what we've got!)
Each of these and other ideas could be part of an RFC, or we could
brainstorm a Unicode built-in class that handles lots of the common use
cases.
I don't think a single class that tries to "do Unicode" makes sense; it would be like having a "maths class" that contains methods for anything dealing with numbers.
In fact, I think the group of functions you're suggesting are a great illustration of what I was saying in my last message to Rob: they make perfect sense as standalone features, and don't need any grand plan to "have Unicode in core" before we proceed with them.
Regards, 
Rowan Tommins 
[IMSoP]
Let me see if I understand your argument correctly? You are asserting that Unicode is "too complex" to be handled in the standard library so that complexity should instead be shouldered individually by each and every PHP developer who needs to work with Unicode text in PHP, which is many PHP developers if not eventually most. Is that your argument?
Not really, no. I'm definitely in favour of including more Unicode-based string handling functionality, by improving and extending ext/intl, or coming up with new convenience wrappers for common tasks.
Your prior reply came across to me as a closed-end slamming-the-door on the topic because of "what we can't do," which is why I commented on.
This most recent reply OTOH was an example of exploring "what we can do" to improve PHP. So kudos for this reply; big plus.
What I'm always sceptical of is the idea that you could ever consider such functionality "complete", or that "Unicode support" can ever be a single deliverable, rather than an ongoing aspiration. (And consequently, I'm sceptical of any language which says it has achieved that.)
To be clear I don't think anything can ever be considered "complete" unless you are talking something as limited in scope as numeric addition. And even then, supporting imaginary numbers might arise.
So while it is not a problem to be explicit about it even if redundant, having it be an underlying reason to argue against useful functionality — if it is ever used in that way — seems counter-productive. Nothing precludes a follow-up RFC after an initially successful RFC is implemented and shipped.
I also think "Unicode support" is probably the wrong angle to approach from; it leads to features like IntlChar, which technically provides access to tons of data from the Unicode standard, but practically has no use for 99% of PHP developers. Instead we should be talking about "internationalisation support", of which handling different writing systems is one (fairly big) part.
I am not sure I agree with you that adding Unicode support is the wrong angle, per se.
A strong argument could be made that Unicode support is a necessary but not sufficient building block for "internationalization support." IOW, if you want to get to the latter it is probably a lot easier to start with the former as the scope of the latter is by-nature larger. After all, perfect is the enemy of the good and waiting for a full-press effort for internationalization support could well push off Unicode support long down the road.
Still, if "full" internationalization support can be achieved in the shorter term it would be bikeshedding for me to argue against it.
-Mike
I am not sure I agree with you that adding Unicode support is the wrong angle, per se.
A strong argument could be made that Unicode support is a necessary but not sufficient building block for "internationalization support." IOW, if you want to get to the latter it is probably a lot easier to start with the former as the scope of the latter is by-nature larger. After all, perfect is the enemy of the good and waiting for a full-press effort for internationalization support could well push off Unicode support long down the road.
Again, that's not really what I intended to say, but I'm probably not expressing myself clearly.
I was thinking about the way we frame the conversation, the words we focus on, and how that shapes the conversation.
The example that keeps coming to mind is password_hash/password_verify. It seems to me that for years, the conversation was framed around "cryptographically safe hashing functions", and teaching users why and how to use powerful but confusing functions like hash() and crypt(). Then it got reframed from the point of view of a web developer wanting to implement logins, and we ended up with fantastically easy to use functions.
In the same way, I think "Unicode support" should be the awkward background work that we do because we're trying to solve specific problems involving text.
In the case of this thread, I think the actual user story is "I want to allow users to enter a wide range of characters, but restrict them in contextually appropriate ways to ensure various types of safety and security". The implementation of that involves a lot of technicalities about how Unicode works, but ideally we want to find meaningful abstractions of those technicalities, not just require every user to understand them.
(PS I think I accidentally called you Rob just now; sorry!) 
Rowan Tommins 
[IMSoP]
I am not sure I agree with you that adding Unicode support is the wrong angle, per se.
A strong argument could be made that Unicode support is a necessary but not sufficient building block for "internationalization support." IOW, if you want to get to the latter it is probably a lot easier to start with the former as the scope of the latter is by-nature larger. After all, perfect is the enemy of the good and waiting for a full-press effort for internationalization support could well push off Unicode support long down the road.
Again, that's not really what I intended to say, but I'm probably not expressing myself clearly.
I was thinking about the way we frame the conversation, the words we focus on, and how that shapes the conversation.
The example that keeps coming to mind is password_hash/password_verify. It seems to me that for years, the conversation was framed around "cryptographically safe hashing functions", and teaching users why and how to use powerful but confusing functions like
hash()andcrypt(). Then it got reframed from the point of view of a web developer wanting to implement logins, and we ended up with fantastically easy to use functions.In the same way, I think "Unicode support" should be the awkward background work that we do because we're trying to solve specific problems involving text.
In the case of this thread, I think the actual user story is "I want to allow users to enter a wide range of characters, but restrict them in contextually appropriate ways to ensure various types of safety and security". The implementation of that involves a lot of technicalities about how Unicode works, but ideally we want to find meaningful abstractions of those technicalities, not just require every user to understand them.
We are in no real disagreement there.
-Mike
P.S. I do think we could reach the same end-goal by taking either direction since Unicode support is a building block of solving specific problems involving text, and thus needs to happen either way.
But, as I implied earlier, whichever road takes us there works for me so no need for me to further bikeshed it, as long as the road we take will not result in a dead-end.
2024年8月17日(土) 9:17 Mike Schinkel mike@newclarity.net:
I am not sure I agree with you that adding Unicode support is the wrong angle, per se.
A strong argument could be made that Unicode support is a necessary but not sufficient building block for "internationalization support." IOW, if you want to get to the latter it is probably a lot easier to start with the former as the scope of the latter is by-nature larger. After all, perfect is the enemy of the good and waiting for a full-press effort for internationalization support could well push off Unicode support long down the road.
Again, that's not really what I intended to say, but I'm probably not expressing myself clearly.
I was thinking about the way we frame the conversation, the words we focus on, and how that shapes the conversation.
The example that keeps coming to mind is password_hash/password_verify. It seems to me that for years, the conversation was framed around "cryptographically safe hashing functions", and teaching users why and how to use powerful but confusing functions like
hash()andcrypt(). Then it got reframed from the point of view of a web developer wanting to implement logins, and we ended up with fantastically easy to use functions.In the same way, I think "Unicode support" should be the awkward background work that we do because we're trying to solve specific problems involving text.
In the case of this thread, I think the actual user story is "I want to allow users to enter a wide range of characters, but restrict them in contextually appropriate ways to ensure various types of safety and security". The implementation of that involves a lot of technicalities about how Unicode works, but ideally we want to find meaningful abstractions of those technicalities, not just require every user to understand them.
We are in no real disagreement there.
-Mike
P.S. I do think we could reach the same end-goal by taking either direction since Unicode support is a building block of solving specific problems involving text, and thus needs to happen either way.
But, as I implied earlier, whichever road takes us there works for me so no need for me to further bikeshed it, as long as the road we take will not result in a dead-end.
Hi, internals
I added grapheme_str_split in PHP 8.4. 
I added this because I thought the function of grapheme function was missing.
I think grapheme functions is still lacks functionality. 
I would like to add grapheme function.
PHP's Unicode support is still not sufficient. 
I would like to strengthen PHP's Unicode support.
After a while, I have plans I would like add RFC for Unicode functions.
Regards 
Yuya
--
Yuya Hamada (tekimen)