Newsgroups: php.internals Path: news.php.net Xref: news.php.net php.internals:124878 X-Original-To: internals@lists.php.net Delivered-To: internals@lists.php.net Received: from php-smtp4.php.net (php-smtp4.php.net [45.112.84.5]) by qa.php.net (Postfix) with ESMTPS id 2A99D1A00B7 for ; Sun, 11 Aug 2024 22:39:55 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=php.net; s=mail; t=1723416099; bh=umfyUtOFi9YX3/FZxntkM1K7S5KUq/laiCtUFll8Mvw=; h=References:In-Reply-To:From:Date:Subject:To:Cc:From; b=E4A1h/0NF2NmNM4tOCCHtB/nMok0q5/cizjSmrCDFwNOfSC8KenEI2g1n6Vxdrn6b /3HZ/BWqOufno/sK312fkO3E8VlwFqfBRw69FWVX/EcQ+heWpXhutNdARNWJzPAHo3 k7wpU9nYaGb1LS2wLYHuLbZSSX9YuxDTFDHhulIGYLjNCewy0JpZc6XLkCDe1HKzdd kD6WiUyaQZsdoyU2mIajGI+dZ/xt9sPX+NAkesCyG9WhI4Er5+9BLX02SjpHqiarHv YxibehZ+yvupBSlhJeUyJ+AyYxCAQp3smqqgZbaycs8ZW3F8dnLF0DP2KKHSHnKXV7 oFUmN3c6Xmscg== Received: from php-smtp4.php.net (localhost [127.0.0.1]) by php-smtp4.php.net (Postfix) with ESMTP id 8881C180069 for ; Sun, 11 Aug 2024 22:41:37 +0000 (UTC) X-Spam-Checker-Version: SpamAssassin 4.0.0 (2022-12-13) on php-smtp4.php.net X-Spam-Level: X-Spam-Status: No, score=0.6 required=5.0 tests=BAYES_50,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,DMARC_PASS,SPF_HELO_NONE, SPF_PASS autolearn=no autolearn_force=no version=4.0.0 X-Spam-Virus: No X-Envelope-From: Received: from forward501d.mail.yandex.net (forward501d.mail.yandex.net [178.154.239.209]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by php-smtp4.php.net (Postfix) with ESMTPS for ; Sun, 11 Aug 2024 22:41:36 +0000 (UTC) Received: from mail-nwsmtp-smtp-production-main-77.klg.yp-c.yandex.net (mail-nwsmtp-smtp-production-main-77.klg.yp-c.yandex.net [IPv6:2a02:6b8:c42:3c4f:0:640:35d4:0]) by forward501d.mail.yandex.net (Yandex) with ESMTPS id D32B560B85 for ; Mon, 12 Aug 2024 01:39:49 +0300 (MSK) Received: by mail-nwsmtp-smtp-production-main-77.klg.yp-c.yandex.net (smtp/Yandex) with ESMTPSA id mdq5gURPm4Y0-FtXtEEBa; Mon, 12 Aug 2024 01:39:49 +0300 X-Yandex-Fwd: 1 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=php.watch; s=mail; t=1723415989; bh=KWEpXBLXfN9fVXubvUDDngSfuOM6moXi99bJfbbdqd8=; h=To:Subject:Message-ID:References:Date:From:In-Reply-To:Cc; b=ilqrV1xlz7wBeLZ1ub//SEx7XWUSFWnURim8PYYXUux929ipo6BF6rrkajgaPdwkw 4FMbbLM2FTP3ryx0SponAoA8gwBN9c47WBEHXdwPkNgpklBvh7AkDyjpAA8zPDXzTa lRslArMPgkVdbhwVHiVuAKZlB1nTNNWsyPBXuzqE= Authentication-Results: mail-nwsmtp-smtp-production-main-77.klg.yp-c.yandex.net; dkim=pass header.i=@php.watch Received: by mail-ed1-f51.google.com with SMTP id 4fb4d7f45d1cf-5a156557026so3716166a12.2 for ; Sun, 11 Aug 2024 15:39:49 -0700 (PDT) X-Gm-Message-State: AOJu0Yxi7wKJHoLRKCSTj7ypVnMDm/8qnDhy0NhlECIwipUOXocbiD4q FGvvTQ6ZEilIxvYlGu3JmkLj4mRgNLQSzsNADUmBKfM91Gbe5FAwBNgk0N7++NhXVmDmR7OWQaN EbG/Jtf/B8VeGshfIm4eqpXyoF20= X-Google-Smtp-Source: AGHT+IHPQBsA+oMWlcAFs9JgtmZ9R3YbIfvzySIWz6dFX2MBG475SdXeLgR9Sm4Vh3aLqVu26JvJE7xrF5zznFuMhwg= X-Received: by 2002:a17:907:e9f:b0:a72:4676:4f8 with SMTP id a640c23a62f3a-a80aa6d3e10mr543045166b.62.1723415988342; Sun, 11 Aug 2024 15:39:48 -0700 (PDT) Precedence: bulk list-help: list-post: List-Id: internals.lists.php.net x-ms-reactions: disallow MIME-Version: 1.0 References: <8a60a5d76bf3bbdda821160c6141b45914a33b98.camel@ageofdream.com> <410a8188-06bf-439f-bdab-c47e73d1db70@app.fastmail.com> In-Reply-To: Date: Mon, 12 Aug 2024 05:39:20 +0700 X-Gmail-Original-Message-ID: Message-ID: Subject: Re: [PHP-DEV][Discussion] Should All String Functions Become Multi-Byte Safe? To: Nick Lockheart Cc: internals@lists.php.net Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable From: ayesh@php.watch (Ayesh Karunaratne) > There's a lot of pitfalls here, and I don't think the documentation > clearly calls out which functions are OK to use with UTF-8 and which > ones may cause unexpected surprises. > > The compatibility between ASCII and UTF-8 for Latin characters is both > a curse and a blessing. An application may work fine in testing, but > then break when a user submits an emoji. > > [snip] > > (1) All string functions should state in the official man page if they > are safe for UTF-8 or not. https://github.com/php/doc-en where our official documentation source. Open source, and often towards the end of the year before the PHP major version release, the team and contributors spend a tremendous amount of work to update the documentation to match the latest new features, deprecations, etc. Always welcome for contributions, including the ones that warn about certain functions not being multi-byte safe. > > > (2) Functions intended for working with text should be made UTF-8 safe. > Generally speaking, all functions that deal with strings are in fact UTF-8 safe because UTF-8 strings are also a sequence of bytes, just like the other strings are. The problems occur only if you try to modify or inspect the text in a way that expects how it should be handled as human readable text. Take the _text_ "a=CC=8A" for example. What is the length of the string? ```php strlen('a=CC=8A'); // 3 mb_strlen('a=CC=8A'); // 2 grapheme_strlen('a=CC=8A'); // 1 ``` The correct length of the string above (`a\xCC\x8A`) is... well, all of the= m: - `strlen` is useful if you validate the length of a user-input before saving it to a database field with a `varchar` limit, or to avoid exceeding index length. - `mb_strlen` is useful if you want to count how many human code-points are used in that string. The mbstring extension knows from Unicode data shows that "\xCC\x8A" is a single code-point. However, it will only consider upto 4 bytes per character because UTF-8 representation limits it to 4 bytes. - `grapheme_strlen` counts the actual human-perceived characters (grapheme clusters), which is what you should really be using if you are formatting text for a specific length. It's also important to understand and appreciate that a lot of PHP functionality today has been there for a very long time. You can't simply change a critical function like `strpos` this late in a programming language. See the excellent reply Larry made about what happened the time PHP tried to do exactly what you are suggesting. Replacing all `strlen` calls in a code base `mb_strlen` or `graphme_strlen` is not a good idea because they serve a different requirement to `strlen`, and they should only be used intentionally where necessary. The latter functions also have to inspect the strings sequentially because UTF-8 is not fixed-length. This is quite slow and it adds up when you process thousands of strings. > (3) Functions intended for processing binary should be added if > necessary, and should be named something like "binary" or "byte". We are already doing it, just the other way around. See `mb_*` and `grapheme_*` functions: All of them are purposefully built to support those features, and are clearly named as such. The rest of the functions consistently consider all strings as a sequence of bytes. This naming pattern is arguably the correct way, because the majority of functions do not need to care whether the strings they deal with need to be human-perceived characters or not. For example, `base64_encode`/`decode` functions, `file_(get|put)_contents`, `pack`/`unpack`, etc will work with any string regardless of their UTF-8 correctness. Why should those strings need to be UTF-8 formatted in the first place?