Newsgroups: php.internals
Path: news.php.net
Xref: news.php.net php.internals:124878
Precedence: bulk
MIME-Version: 1.0
References: <8a60a5d76bf3bbdda821160c6141b45914a33b98.camel@ageofdream.com>
 <410a8188-06bf-439f-bdab-c47e73d1db70@app.fastmail.com> <c00a7b2b80e83877ef94ebdf6fb9d5c9a8f977ce.camel@ageofdream.com>
In-Reply-To: <c00a7b2b80e83877ef94ebdf6fb9d5c9a8f977ce.camel@ageofdream.com>
Date: Mon, 12 Aug 2024 05:39:20 +0700
Message-ID: <CANLbj-rRr6WEG3WuOVjmnDjERy0oy3QKF9fiRCY0RR43K0c8NQ@mail.gmail.com>
Subject: Re: [PHP-DEV][Discussion] Should All String Functions Become
 Multi-Byte Safe?
To: Nick Lockheart <lists@ageofdream.com>
Cc: internals@lists.php.net
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
From: ayesh@php.watch (Ayesh Karunaratne)

> There's a lot of pitfalls here, and I don't think the documentation
> clearly calls out which functions are OK to use with UTF-8 and which
> ones may cause unexpected surprises.
>
> The compatibility between ASCII and UTF-8 for Latin characters is both
> a curse and a blessing. An application may work fine in testing, but
> then break when a user submits an emoji.
>
> [snip]
>
> (1) All string functions should state in the official man page if they
> are safe for UTF-8 or not.


https://github.com/php/doc-en where our official documentation source.
Open source, and often towards the end of the year before the PHP
major version release, the team and contributors spend a tremendous
amount of work to update the documentation to match the latest new
features, deprecations, etc. Always welcome for contributions,
including the ones that warn about certain functions not being
multi-byte safe.

>
>
> (2) Functions intended for working with text should be made UTF-8 safe.
>

Generally speaking, all functions that deal with strings are in fact
UTF-8 safe because UTF-8 strings are also a sequence of bytes, just
like the other strings are. The problems occur only if you try to
modify or inspect the text in a way that expects how it should be
handled as human readable text.

Take the _text_ "a=CC=8A" for example. What is the length of the string?

```php
strlen('a=CC=8A'); // 3
mb_strlen('a=CC=8A'); // 2
grapheme_strlen('a=CC=8A'); // 1
```

The correct length of the string above (`a\xCC\x8A`) is... well, all of the=
m:

 - `strlen` is useful if you validate the length of a user-input
before saving it to a database field with a `varchar` limit, or to
avoid exceeding index length.
 - `mb_strlen` is useful if you want to count how many human
code-points are used in that string. The mbstring extension knows from
Unicode data shows that "\xCC\x8A" is a single code-point. However, it will
only consider upto 4 bytes per character because UTF-8 representation
limits it to 4 bytes.
 - `grapheme_strlen` counts the actual human-perceived characters
(grapheme clusters), which is what you should really be using if you
are formatting text for a specific length.

It's also important to understand and appreciate that a lot of PHP
functionality today has been there for a very long time. You can't
simply change a critical function like `strpos` this late in a
programming language. See the excellent reply Larry made about what
happened the time PHP tried to do exactly what you are suggesting.

Replacing all `strlen` calls in a code base `mb_strlen` or
`graphme_strlen` is not a good idea because they serve a different
requirement to `strlen`, and they should only be used intentionally
where necessary. The latter functions also have to inspect the strings
sequentially because UTF-8 is not fixed-length. This is quite slow and
it adds up when you process thousands of strings.

> (3) Functions intended for processing binary should be added if
> necessary, and should be named something like "binary" or "byte".


We are already doing it, just the other way around. See `mb_*`  and
`grapheme_*` functions: All of them are purposefully built to support
those features, and are clearly named as such.

The rest of the functions consistently consider all strings as a
sequence of bytes.

This naming pattern is arguably the correct way, because the majority
of functions do not need to care whether the strings they deal with
need to be human-perceived characters or not. For example,
`base64_encode`/`decode` functions, `file_(get|put)_contents`,
`pack`/`unpack`, etc will work with any string regardless of their
UTF-8 correctness. Why should those strings need to be UTF-8 formatted
in the first place?