Newsgroups: php.internals Path: news.php.net Xref: news.php.net php.internals:124989 X-Original-To: internals@lists.php.net Delivered-To: internals@lists.php.net Received: from php-smtp4.php.net (php-smtp4.php.net [45.112.84.5]) by qa.php.net (Postfix) with ESMTPS id 4AD841A00F9 for ; Fri, 16 Aug 2024 18:44:31 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=php.net; s=mail; t=1723833978; bh=u6D2USlBkR+6iitg2ykRlVEe6faHuvbTi4RRndQA4ck=; h=From:Subject:Date:In-Reply-To:Cc:To:References:From; b=Ku8yDL0OefaB6vV/T/9qldDkt6rsH8Dp38p3kAUIIs/moFp++rRH9d4yExD0r4Syp la3Guxc0uVQjE5mpzNOTgznoktkiHULKEKeWv7Dp/Y6RwPDVRWjMmz8QsCgch3dmWc Bzzf0Vqy5YhJOVisXP2q1R7XDe1DHvZtRym9G7rz7aUJueNiI0IcgNHRW4R3jtTLhb fnm5Hmcm6S4pIP/H9bYwlaoq5iqK6sUxZj5wKzeT0D1KalX7r+8CCXbK8W24o4yTTl ZaT4c1yjJGlINcNRS59mlITAHe6kqW2dpMDB4/y+7b9Z5WJwclPC+3UKMXsRWSlqqq HpxMki6tINUNg== Received: from php-smtp4.php.net (localhost [127.0.0.1]) by php-smtp4.php.net (Postfix) with ESMTP id 7AAC918038A for ; Fri, 16 Aug 2024 18:46:15 +0000 (UTC) X-Spam-Checker-Version: SpamAssassin 4.0.0 (2022-12-13) on php-smtp4.php.net X-Spam-Level: X-Spam-Status: No, score=0.8 required=5.0 tests=BAYES_50,DKIM_SIGNED, DKIM_VALID,DMARC_MISSING,HTML_MESSAGE,RCVD_IN_MSPIKE_H2,SPF_HELO_NONE, SPF_NONE autolearn=no autolearn_force=no version=4.0.0 X-Spam-Virus: No X-Envelope-From: Received: from mail-yw1-f174.google.com (mail-yw1-f174.google.com [209.85.128.174]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by php-smtp4.php.net (Postfix) with ESMTPS for ; Fri, 16 Aug 2024 18:46:11 +0000 (UTC) Received: by mail-yw1-f174.google.com with SMTP id 00721157ae682-69df49d92b8so21786877b3.3 for ; Fri, 16 Aug 2024 11:44:23 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=newclarity-net.20230601.gappssmtp.com; s=20230601; t=1723833863; x=1724438663; darn=lists.php.net; h=references:to:cc:in-reply-to:date:subject:mime-version:message-id :from:from:to:cc:subject:date:message-id:reply-to; bh=wLPuJN8oHDTiQDw2UklpYvDQnwfvbEnSAGzg3mzZ+f4=; b=W4dRGfRI2YvPpoFbYfDp8Q+k1R+vpc7Up4Rqt0EH2GlkCm5hVZSArngSNN2yTz5ClO SVyTd8f+fbKrfOnXvJY8fDXkrZ530XitqtiRMgafv66A+IkUTFD/2WfXSnQ+5+UfTpME dA5rxQXB5jnAqGVq8Y3P5CiE1/aYfNiu7/M/qP51G0FGKzK1QJ0Wfl8ymXa3qKFFtsjB aQBa7hvn1QDs99SD+vk16YX1RNw30jnfXIDudNae9CemWTJ5qnpSjqTIc0gaFi6iMrjM KetB2kjvihNqboxjRHEDgi4CoJMXrULkEYaI4WY5RqnPC50gsp9PAdqmsv4/Wiaz33o+ EKSw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1723833863; x=1724438663; h=references:to:cc:in-reply-to:date:subject:mime-version:message-id :from:x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=wLPuJN8oHDTiQDw2UklpYvDQnwfvbEnSAGzg3mzZ+f4=; b=cNU5BZuWqk5cDTQMhRYvRM2rDKpJPFI+C3i9qs5WT3d8pVuYj1hzKPVVzgknPCagci Rpdz0EOZtWqUd+m4L7esTa24Bu3q+BRZ1ygwOZK4lXeZXifO7QqsEPpgnIyOFMEIm0K0 1dByFS0BfxMZkkNZSI3qQV64KbuC0m/CvN7cCEEVhxD5/UVDLEGdXCDoSkvzyxEq82B4 EIX7shNvBlU9pH9BYi20QOxHF2oL6YbKPIvrHUVbr5Pj/JDnPksKaBuhxfOy/Prq7qEH ugqDC1o8Vyr3fiRQTzWaAd6qrEmYK3/ReWYdltlCI7xc0O0blFWSLv04Ws3n9xnaisVC /RaA== X-Gm-Message-State: AOJu0YyAiRYkTCrO1V57nK7glNjGNyJ3LGYGAQIcgIs2YsxSNV0WbkoJ /ykNonzzItOxJMVAk9/FCuVhLcb+XL3BljDxFNKgPJTarcqTREJl7EsQMim7NaqpC8Tl39hxUgi CC/Y= X-Google-Smtp-Source: AGHT+IHAYiyDPGwbCoEMqrd9E3Hs+TEZsmuXQDzsqBEg5KyWzJuFHhNNzv8DRyvik6lOFssdVUxPsA== X-Received: by 2002:a05:690c:5807:b0:6b1:8834:1588 with SMTP id 00721157ae682-6b1bb75e6e5mr31169857b3.35.1723833863188; Fri, 16 Aug 2024 11:44:23 -0700 (PDT) Received: from smtpclient.apple (c-98-252-216-111.hsd1.ga.comcast.net. [98.252.216.111]) by smtp.gmail.com with ESMTPSA id 00721157ae682-6af9a86b99dsm7193557b3.65.2024.08.16.11.44.22 (version=TLS1_2 cipher=ECDHE-ECDSA-AES128-GCM-SHA256 bits=128/128); Fri, 16 Aug 2024 11:44:22 -0700 (PDT) Message-ID: <270D6057-626D-4720-B44A-3CB7A7B9320B@newclarity.net> Content-Type: multipart/alternative; boundary="Apple-Mail=_DE785B85-92D9-4C53-8178-BB20ADB1371D" Precedence: bulk list-help: list-post: List-Id: internals.lists.php.net x-ms-reactions: disallow Mime-Version: 1.0 (Mac OS X Mail 16.0 \(3696.120.41.1.8\)) Subject: Re: [PHP-DEV][Discussion] Should All String Functions Become Multi-Byte Safe? Date: Fri, 16 Aug 2024 14:44:22 -0400 In-Reply-To: Cc: internals@lists.php.net To: "Rowan Tommins [IMSoP]" References: <1AFE8300-D363-43D8-A989-15D001B9879C@newclarity.net> X-Mailer: Apple Mail (2.3696.120.41.1.8) From: mike@newclarity.net (Mike Schinkel) --Apple-Mail=_DE785B85-92D9-4C53-8178-BB20ADB1371D Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset=utf-8 > On Aug 12, 2024, at 4:25 PM, Rowan Tommins [IMSoP] = wrote: >=20 > On 12/08/2024 17:37, Mike Schinkel wrote: >> A really standout paragraph from that link is: >>=20 >> "IMO, the whole situation is a shame. Unicode should be >> in the stdlib of every language by default. It=E2=80=99s the lingua >> franca of the internet! It=E2=80=99s not even new: we=E2=80=99ve been = living >> with Unicode for 20 years now." >=20 > I actually think that paragraph rather ignores everything else the = article has just explained. You and I had different takeaways then. > and it's not entirely clear what it should even mean. I cannot speak for the author off the article, but I thought I had = implied strongly enough what it would mean to me. Evidently I did not, = so I will be explicit: Pursue this RFC: https://wiki.php.net/rfc/unicode_text_processing = > The main reason it's not *mandatory* for all builds of PHP, just = "bundled", is that the sheer complexity of Unicode means that the = library is rather large=20 Let me see if I understand your argument correctly? You are asserting = that Unicode is "too complex" to be handled in the standard library so = that complexity should instead be shouldered individually by each and = every PHP developer who needs to work with Unicode text in PHP, which is = many PHP developers if not eventually most. Is that your argument? Imagine if PHP had taken the position that "It is too complex, so we'll = just make userland developers deal with it" regarding cryptography and = encryption? Or regular expressions? Or image processing? Or time and = date manipulation? Or network and socket programming? > "Putting Unicode in the stdlib" is an incredibly difficult task, and = it's not entirely clear what it should even mean. > ... > somebody (Rasmus, I think?) joked that relying on it for PHP 6 would = have made PHP a small library attached to the side of ICU. You are comparing apples and oranges.=20 Putting Unicode into an existing *language* and integrating with = built-in data types in a backward compatible manner is a MUCH bigger = lift than "putting Unicode into a standard library." The latter is just = providing functions and/or an object and methods for the majority of = tasks needed to process Unicode text.=20 PHP already has some functions for Unicode in the standard library as = have been mentioned, but not enough to reasonably handle most Unicode = text-related tasks. A Unicode text processing class with the existing = RFC as a starting point could unify that functionality and fill in the = missing gaps. BTW, I have done a significant amount of work with Unicode in Go =E2=80=94= which handles code points natively, but unfortunately not grafemes =E2=80= =94 and handling Unicode effectively is not *that* hard. The rules are = many, but they are straightforward. Certainly it is not harder than = cryptography and encryption, which PHP addresses in core. > We also have the "mbstring" extension, which was *not* designed around = Unicode, but was originally built for various encodings popular in Japan = 20+ years ago. It doesn't have the databases of codepoint information = that ICU does, so can't answer questions like "what script does this = code point belong to?" or "what is the uppercase equivalent of this = grapheme, assuming a Turkish locale?" Interesting historical factoid, but how is that really relevant to = including Unicode into the standard library? -Mike= --Apple-Mail=_DE785B85-92D9-4C53-8178-BB20ADB1371D Content-Transfer-Encoding: quoted-printable Content-Type: text/html; charset=utf-8
On = Aug 12, 2024, at 4:25 PM, Rowan Tommins [IMSoP] <imsop.php@rwec.co.uk> wrote:

On = 12/08/2024 17:37, Mike Schinkel wrote:
A really standout paragraph from that link = is:

"IMO, the whole situation is a shame. = Unicode should be
in the stdlib of every language by = default. It=E2=80=99s the lingua
franca of the internet! = It=E2=80=99s not even new: we=E2=80=99ve been living
with = Unicode for 20 years now."

I actually think = that paragraph rather ignores everything else the article has just = explained.

You and I = had different takeaways then.

and it's not = entirely clear what it should even mean.

I cannot speak for the author off the article, but = I thought I had implied strongly enough what it would mean to me. = Evidently I did not, so I will be explicit:

=
The main = reason it's not *mandatory* for all builds of PHP, just "bundled", is = that the sheer complexity of Unicode means that the library is rather = large 

Let me see if I = understand your argument correctly?  You are asserting that Unicode = is "too complex" to be handled in the standard library so that = complexity should instead be shouldered individually by each and every = PHP developer who needs to work with Unicode text in PHP, which is many = PHP developers if not eventually most. Is that your = argument?

Imagine if PHP had taken = the position that "It is too complex, so we'll just make userland = developers deal with it" regarding cryptography and encryption? Or = regular expressions?  Or image processing?  Or time and date = manipulation? Or network and socket programming?

"Putting Unicode in the stdlib" is an incredibly difficult = task, and it's not entirely clear what it should even mean.
...
somebody (Rasmus, I think?) joked that = relying on it for PHP 6 would have made PHP a small library attached to = the side of ICU.

You are comparing apples and = oranges. 

Putting Unicode into = an existing *language* and integrating with built-in data types in a = backward compatible manner is a MUCH bigger lift than "putting Unicode = into a standard library." The latter is just providing functions and/or = an object and methods for the majority of tasks needed to process = Unicode text. 

PHP already has = some functions for Unicode in the standard library as have been = mentioned, but not enough to reasonably handle most Unicode text-related = tasks. A Unicode text processing class with the existing RFC as a = starting point could unify that functionality and fill in the missing = gaps.

BTW, I have done a significant = amount of work with Unicode in Go =E2=80=94 which handles code points = natively, but unfortunately not grafemes =E2=80=94 and handling Unicode = effectively is not *that* hard. The rules are many, but they are = straightforward. Certainly it is not harder than cryptography and = encryption, which PHP addresses in core.

We also have the "mbstring" extension, which = was *not* designed around Unicode, but was originally built for various = encodings popular in Japan 20+ years ago. It doesn't have the databases = of codepoint information that ICU does, so can't answer questions like = "what script does this code point belong to?" or "what is the uppercase = equivalent of this grapheme, assuming a Turkish locale?"

Interesting historical factoid, but how is that = really relevant to including Unicode into the standard = library?

-Mike
= --Apple-Mail=_DE785B85-92D9-4C53-8178-BB20ADB1371D--