Hi,
I have just published an initial draft of the "Unicode Text Processing"
RFC, a proposal to have performant unicode text processing always
available to PHP users, by introducing a new "Text" class.
You can find it at:
https://wiki.php.net/rfc/unicode_text_processing
I'm looking forwards to hearing your opinions, additions, and
suggestions — the RFC specifically asks for these in places.
cheers,
Derick
--
https://derickrethans.nl | https://xdebug.org | https://dram.io
Author of Xdebug. Like it? Consider supporting me: https://xdebug.org/support
Host of PHP Internals News: https://phpinternals.news
mastodon: @derickr@phpc.social @xdebug@phpc.social
twitter: @derickr and @xdebug
Hey Derick, Hey all.
Hi,
I have just published an initial draft of the "Unicode Text Processing"
RFC, a proposal to have performant unicode text processing always
available to PHP users, by introducing a new "Text" class.You can find it at:
https://wiki.php.net/rfc/unicode_text_processingI'm looking forwards to hearing your opinions, additions, and
suggestions — the RFC specifically asks for these in places.
Thanks for tackling this immense topic.
I see a few challenges in the approach. My first question was: Why do we
need a new implementation of the ICU library? Creating a userland
implementation that wraps the currently existing mb-string and ICU
functions into a class that allows better usability shouldn't add that
much of a performance penalty. And including the mb-string and the intl
extension by default wouldn't hurt.
That way there would be no added maintenance burden on the core developers.
In addition to that it looked to me that there are multiple things mixed
up in this Text-class. If we want a Text-class to handle Unicode strings
in a better way, why does the string itself need to be Locale-aware? The
string itself is a collection of Unicode-Codepoints referencing
Characters and Graphemes. Does the string itself need to be aware of a
locale to aid in sorting? It needs to be aware of the internal
normalization form for character-comparison for sure. But I would rather
see a Normalizer handle normalization of the Text-content instead of the
Text-class handling that itself. Similarily I'd see the Transliteration
done by a separate class. Which then strongly looks similar to the
Intl-extension. Which brings me back to the question: Do we really need
a second Intl-extension in the core?
I'm ambivalent about this. On the one hand it could make some things for
sure easier. On the other hand it adds burden onto the core-developers
that could be avoided by providing the intl (and mb-string) extension by
default instead of having to add them separately. And then find a group
if people willing to build a userland implementation.
And yes, I know the intl-extension is everything but easy to use.
Especially in the quirky edge-cases regarding Transliteration and
Normalization. But the issue usually isn't using it but finding the
appropriate documentation on the ICU page. Helping the ICU to improve on
that documentation would also be a huge benefit. To all those trying to
use the Intl-extension right now.
But that's just my 0.02€
Cheers
Andreas
cheers,
Derick
--
,,,
(o o)
+---------------------------------------------------------ooO-(_)-Ooo-+
| Andreas Heigl |
| mailto:andreas@heigl.org N 50°22'59.5" E 08°23'58" |
| https://andreas.heigl.org |
+---------------------------------------------------------------------+
| https://hei.gl/appointmentwithandreas |
+---------------------------------------------------------------------+
| GPG-Key: https://hei.gl/keyandreasheiglorg |
+---------------------------------------------------------------------+
Hi
I see a few challenges in the approach. My first question was: Why do we
need a new implementation of the ICU library? Creating a userland[…]
I'm ambivalent about this. On the one hand it could make some things for
sure easier. On the other hand it adds burden onto the core-developers
that could be avoided by providing the intl (and mb-string) extension by
default instead of having to add them separately. And then find a group
if people willing to build a userland implementation.
Because a programming language needs a standard library, otherwise one
could just use JavaScript and pull in a dependency for 'is-odd' or
left-padding.
The biggest advantage this proposal has compared to ext/intl is that it
adds a new data type. If you receive a 'Text' object then you are
guaranteed to have valid Unicode/UTF-8 inside of it.
It also provides a OO API around text/string processing functionality,
which is something users have desired for quite some time already
("scalar objects").
The addition of a new data type is also a reason why this cannot
usefully be implemented in userland alone: It would require every
developer to standardize on a single userland implementation, as
otherwise you need bridges to convert between the different
representations of various userland libraries (or need to round-trip
through the standard 'string' type), which I consider to be a
non-starter for something as fundamental as text processing. Both
because it adds complexity and because it will kill performance.
As the RFC notes, an explicit design goal is to keep the API simple and
focused, so I don't expect much ongoing maintenance burden here.
Especially if all the heavy lifting is off-loaded to ICU. Any
convenience functionality can then be be provided in userland based on
the building blocks provided by PHP itself, with the benefit that
userland libraries are going to be fully interoperable because they all
use the standard 'Text' type that is guaranteed to be available [1].
Best regards
Tim Düsterhus
[1] The 'Text' class should likely be made final, because folks might
otherwise rely on a specific userland extension, preventing actual
interoperability.
[1] The 'Text' class should likely be made final, because folks might
otherwise rely on a specific userland extension, preventing actual
interoperability.
I'm fond of final classes but in here I think it adds burden to core
developers. As you said it yourself having a Type within PHP will help
interoperability. Having this type be final will hurt interoperability
because everyone's wrapper will be different. This may lead to the
community requesting more changes to core.
Hi
[1] The 'Text' class should likely be made final, because folks might
otherwise rely on a specific userland extension, preventing actual
interoperability.I'm fond of final classes but in here I think it adds burden to core
developers. As you said it yourself having a Type within PHP will help
interoperability. Having this type be final will hurt interoperability
because everyone's wrapper will be different. This may lead to the
community requesting more changes to core.
The wrappers may be different, but they would all be expected to provide
a method to retrieve the underlying 'Text' object that every wrapper
would know how to use, without the need to actually convert between
different representations.
Furthermore if the class is not final, then every future (method)
addition to the class API would be a possible breaking change, if the
method is already in use in a userland subclass. For the kind of methods
you would add to the 'Text' class there usually is only one reasonable
name (there's only so many possible names for "replace", "pad", "split"
or similar). Thus any future additions would result in headaches,
because (1) either userland breaks, because (2) some less-than-great
method name needs to be used or because (3) the addition is not going to
happen at all, because (1) or (2) are both unacceptable.
Then there's also the point about subclasses not necessarily be a
drop-in replacement (either because methods are overriden if not final,
or because the immutability is violated). We already had a similar
discussion recently with regard to readonly classes, but I'm not sure a
gentleman's agreement to not create "incompatible" subclasses is strong
enough of a guarantee for something that is a similarly basic build
block as the 'string' type itself [1].
Best regards
Tim Düsterhus
[1] Java's 'String' class is final, I would assume for the same reasons,
see also: https://softwareengineering.stackexchange.com/q/97437.
[1] The 'Text' class should likely be made final, because folks might
otherwise rely on a specific userland extension, preventing actual
interoperability.
Yes, I intended to do this, but forgot to include it. I've updated the
RFC.
cheers,
Derick
--
https://derickrethans.nl | https://xdebug.org | https://dram.io
Author of Xdebug. Like it? Consider supporting me: https://xdebug.org/support
Host of PHP Internals News: https://phpinternals.news
mastodon: @derickr@phpc.social @xdebug@phpc.social
twitter: @derickr and @xdebug
Hi,
I have just published an initial draft of the "Unicode Text Processing"
RFC, a proposal to have performant unicode text processing always
available to PHP users, by introducing a new "Text" class.You can find it at:
https://wiki.php.net/rfc/unicode_text_processingI'm looking forwards to hearing your opinions, additions, and
suggestions — the RFC specifically asks for these in places.cheers,
Derick
I appreciate the work behind this RFC. Although I can't comment much on the
utility itself as I never really used ext-intl
or much of UTF-8/16 stuff,
I think PHP lacks A LOT of built-in classes to make matters simpler and
welcoming.
Sure, anybody in userland could write a class, but that leads to lots of
implementations and an overwhelming amount of choices to be made. If this
class can cover 80% of the use-case, folks can extend it to build their
remaining 20% advanced use-case and PHP becomes easier. I really look
forward to more basic utility-classes built-in and ready to go.
--
Marco Deleu
Hi
You can find it at:
https://wiki.php.net/rfc/unicode_text_processingI'm looking forwards to hearing your opinions, additions, and
suggestions — the RFC specifically asks for these in places.
Some first remarks:
replaceText():
In the replaceText() section the description refers to a non-existent
parameters '$maxReplacements' and '$collator'. Also the second paragraph
in that section looks like a word is missing there.
getPositionOfFirstOccurrence():
I agree this is too long. How about:
- findOffset()
- findOffsetLast()
And for returnFromFirstOccurence():
- startingWith()
- startingWithLast()
firstToTitle():
I don't see how this differs from toTitle() and/or firstToUpper(). An
example would likely be helpful.
wordsToTitle():
Likewise.
The return of many methods is not explicitly listed. I'd put it
everywhere to make it 100% clear that all those functions return 'Text'.
Best regards
Tim Düsterhus
You can find it at:
https://wiki.php.net/rfc/unicode_text_processingI'm looking forwards to hearing your opinions, additions, and
suggestions — the RFC specifically asks for these in places.Some first remarks:
replaceText():
In the replaceText() section the description refers to a non-existent
parameters '$maxReplacements' and '$collator'. Also the second paragraph in
that section looks like a word is missing there.
I will update this. the ''$collator'' would be an argument to the
constructor of the Text/$search argument, not the method itself.
getPositionOfFirstOccurrence():
I agree this is too long. How about:
- findOffset()
- findOffsetLast()
And for returnFromFirstOccurence():
- startingWith()
- startingWithLast()
I have included these as suggested names. I suspect we'll get more :-)
firstToTitle():
I don't see how this differs from toTitle() and/or firstToUpper(). An example
would likely be helpful.
It doesn't differ from firstToUpper, so I will remove it.
wordsToTitle():
Likewise.
It doesn't differ either from toTitle, so I will remove it.
I have also added examples for the others.
The return of many methods is not explicitly listed. I'd put it everywhere to
make it 100% clear that all those functions return 'Text'.
I will update that.
cheers,
Derick
--
https://derickrethans.nl | https://xdebug.org | https://dram.io
Author of Xdebug. Like it? Consider supporting me: https://xdebug.org/support
Host of PHP Internals News: https://phpinternals.news
mastodon: @derickr@phpc.social @xdebug@phpc.social
twitter: @derickr and @xdebug
Hi
getPositionOfFirstOccurrence():
I agree this is too long. How about:
- findOffset()
- findOffsetLast()
And for returnFromFirstOccurence():
- startingWith()
- startingWithLast()
I have included these as suggested names. I suspect we'll get more :-)
You accidentally put 'startingWithLast' into the 'contains' section. But
thinking about the name a little more: 'At' instead of 'With' might be a
little more appropriate, because $a->startingWith($b) could also mean
'$b . $a' … which brings me to:
- How is concatenation expected to work? Will '.' be overloaded or
should folks use:
\Text::join([
$prefix,
'-'
$suffix,
], '');
or similar? Perhaps an explicit \Text::concat(…) method should be provided?
- As I've noted in the discussion for List\unique: I think iterators
should not be second-class citizen to arrays. Thus I propose to change
\Text::join() to:
/** @param iterable<\Text|string> $elements */
public static function join(iterable $elements, \Text|string $separator,
string $collator = null): \Text
This would then allow stuff like:
Text::join($someText->getWordIterator(), '-')
to insert -
in-between each word.
- Inversely: Should \Text->split() only guarantee 'iterable', instead
of 'array' as its return type?
- How is equality comparisons expected to work? Will '==' be
overloaded? Should users use 'compareWith(…) === 0'? Should an
'equals()' method be provided?
Best regards
Tim Düsterhus
Hi
getPositionOfFirstOccurrence():
I agree this is too long. How about:
- findOffset()
- findOffsetLast()
And for returnFromFirstOccurence():
- startingWith()
- startingWithLast()
I have included these as suggested names. I suspect we'll get more
:-)You accidentally put 'startingWithLast' into the 'contains' section.
Oops, fixed.
But thinking about the name a little more: 'At' instead of 'With'
might be a little more appropriate, because $a->startingWith($b) could
also mean '$b . $a' …
I've added that as a suggested name too.
which brings me to:
- How is concatenation expected to work? Will '.' be overloaded or should
folks use:\Text::join([
$prefix,
'-'
$suffix,
], '');or similar? Perhaps an explicit \Text::concat(…) method should be provided?
I guess we can overload the . operator too, but I would still also add
an explicit method.
I've added that to the RFC.
- As I've noted in the discussion for List\unique: I think iterators should
not be second-class citizen to arrays. Thus I propose to change \Text::join()
to:/** @param iterable<\Text|string> $elements */
public static function join(iterable $elements, \Text|string $separator,
string $collator = null): \TextThis would then allow stuff like:
Text::join($someText->getWordIterator(), '-')
to insert
-
in-between each word.
Indeed, there is no reason why it should be a simple array, and I've
updated the RFC accordingly.
- Inversely: Should \Text->split() only guarantee 'iterable', instead of
'array' as its return type?
That is trickier, as a return value can't be a union. I suppose we could
signal that we return an iterable, but start by always returning an
array. That way if we want to expand this to an actual iterator we can,
without breaking LSP.
Before making that change, I'd like to hear some other opinions on this.
- How is equality comparisons expected to work? Will '==' be overloaded?
Should users use 'compareWith(…) === 0'? Should an 'equals()' method be
provided?
'==' will be overloaded. It's mentioned in the ''compareWith'' method.
I have added an equals method, as I think that makes sense.
cheers,
Derick
--
https://derickrethans.nl | https://xdebug.org | https://dram.io
Author of Xdebug. Like it? Consider supporting me: https://xdebug.org/support
Host of PHP Internals News: https://phpinternals.news
mastodon: @derickr@phpc.social @xdebug@phpc.social
twitter: @derickr and @xdebug
I have just published an initial draft of the "Unicode Text Processing"
RFC, a proposal to have performant unicode text processing always
available to PHP users, by introducing a new "Text" class.You can find it at:
https://wiki.php.net/rfc/unicode_text_processingI'm looking forwards to hearing your opinions, additions, and
suggestions — the RFC specifically asks for these in places.
| As the implementation requires ICU, this would also mean that PHP will
| depend on the ICU library.
Our current stance is that a minimal PHP should be buildable without
requiring any "non-standard" libraries; this is the reason why we bundle
PCRE. If we wanted to stick with that policy, we would need to bundle
ICU, what might not be the best idea – it's generally not great to have
bundled libraries which are still maintained outside of php-src, and
especially for such huge libraries.
--
Christoph M. Becker
On Thu, Dec 15, 2022 at 4:56 PM Christoph M. Becker cmbecker69@gmx.de
wrote:
I have just published an initial draft of the "Unicode Text Processing"
RFC, a proposal to have performant unicode text processing always
available to PHP users, by introducing a new "Text" class.You can find it at:
https://wiki.php.net/rfc/unicode_text_processingI'm looking forwards to hearing your opinions, additions, and
suggestions — the RFC specifically asks for these in places.| As the implementation requires ICU, this would also mean that PHP will
| depend on the ICU library.Our current stance is that a minimal PHP should be buildable without
requiring any "non-standard" libraries; this is the reason why we bundle
PCRE. If we wanted to stick with that policy, we would need to bundle
ICU, what might not be the best idea – it's generally not great to have
bundled libraries which are still maintained outside of php-src, and
especially for such huge libraries.
I agree with this. Bundling ICU doesn't seem like a good idea. Wouldn't be
better to base on something smaller that can be bundled and does the job?
For example NJS and QuickJS use their own implementations which seem to be
fine. Especially https://github.com/bellard/quickjs/blob/master/libunicode.c
seems like something that we could fork and maintain potentially.
Cheers
Jakub
On Thu, Dec 15, 2022 at 4:56 PM Christoph M. Becker cmbecker69@gmx.de
wrote:I have just published an initial draft of the "Unicode Text
Processing" RFC, a proposal to have performant unicode text
processing always available to PHP users, by introducing a new
"Text" class.You can find it at:
https://wiki.php.net/rfc/unicode_text_processingI'm looking forwards to hearing your opinions, additions, and
suggestions — the RFC specifically asks for these in places.| As the implementation requires ICU, this would also mean that PHP
| depend on the ICU library.Our current stance is that a minimal PHP should be buildable without
requiring any "non-standard" libraries; this is the reason why we
bundle PCRE. If we wanted to stick with that policy, we would need
to bundle ICU, what might not be the best idea – it's generally not
great to have bundled libraries which are still maintained outside
of php-src, and especially for such huge libraries.I agree with this. Bundling ICU doesn't seem like a good idea.
Wouldn't be better to base on something smaller that can be bundled
and does the job? For example NJS and QuickJS use their own
implementations which seem to be fine. Especially
https://github.com/bellard/quickjs/blob/master/libunicode.c seems like
something that we could fork and maintain potentially.
I have no intentions of bundling ICU. That'd be a crazy thing to do.
Instead, the current proposal is to make PHP depend on libicu. I realise
that this is against our current stance, but considering that 1. most
(if not all) Linux distributions ignore our bundled libraries any way as
per their policies; 2. libicu is pretty much available everywhere; and
3. I am not proposing to require the latest and greatest, I believe we
can safely rely on it being available.
I'm not opposed to using something else than ICU Most of the other
unicode related libraries that I had a quick look at, either provide a
small subset — either just character properties, or graphemes, none of
them also take care of collation/locales and transliteration. I am also
weary about some of these library's development and future proofness.
ICU won't have these problems.
cheers,
Derick
--
https://derickrethans.nl | https://xdebug.org | https://dram.io
Author of Xdebug. Like it? Consider supporting me: https://xdebug.org/support
Host of PHP Internals News: https://phpinternals.news
mastodon: @derickr@phpc.social @xdebug@phpc.social
twitter: @derickr and @xdebug
A few quick thoughts:
The constructor will also convert the given text to Unicode Canonical Form.
By this do you mean Normalization Form C (NFC)? "Unicode Canonical Form"
isn't a phrase I'm familiar with.
Assuming so, are modified texts (e.g. via join, replaceText, reverse)
re-normalized?
The constructor will also strip out a BOM (Byte-Order-Mark) character, if present.
This is also known as ZWNBSP (Zero Width No-Break Space). Will only a
leading instance be stripped? If so, how can someone search for it (or a
substring beginning with it) given that:
If an argument to any of the methods is listed as string|Text, passing in a string value will have the same semantics as replacing the passed value with new Text($string).
and all the search methods take string|Text $search
.
Why is this being introduced directly into PHP core rather than first an
extension where it's easier to shake out the interface and behavior?
I have just published an initial draft of the "Unicode Text Processing"
RFC, a proposal to have performant unicode text processing always
available to PHP users, by introducing a new "Text" class.You can find it at:
https://wiki.php.net/rfc/unicode_text_processingI'm looking forwards to hearing your opinions, additions, and
suggestions — the RFC specifically asks for these in places.
Obviously, hurdle one is making the ICU library a requirement for building
PHP. I'd almost make that it's own milestone in this project with the
introduction of the Text class as a separate followon. A very casual
(IANAL) read of the ICU license doesn't seem to make this a problem, so it
may be more of a question of whether we put this on people wanting to build
PHP. ICU is pretty widely available and used, so I also don't see this as
a major stumbling block.
Question 2 is that class. I know folks have been clammoring for a String
class for some time and this actually fills that niche quite well. A part
of me wonders if we can overload it a little to provide a psuedo locale of
"binary" so that users can, optionally, treat it like a more generalized
String class in specific cases, storing a normal char*
zend_string under
the hood in that case. Possibly as a specialzation tree.
/* names as examples only /
interface Stringy { / define all those APIs / }
class Text implements Stringy { / ... / }
class BinaryString implements Stringy { / ... */ }
I think you'd get a lot more buy-in from the folks who worry that UTF16 is
overhead they don't want, but who do like the idea of an OOPy string. It
also provides a migration path to avoid having to rethink byte vs grapheme
conversions up front, instead deferring that part of a migration till later.
Overall, I'm more positive on this than negative, and I eagerly await the
rest of this thread.
-Sara
<snip>I have just published an initial draft of the "Unicode Text
Processing" RFC, a proposal to have performant unicode text
processing always available to PHP users, by introducing a new
"Text" class.You can find it at: https://wiki.php.net/rfc/unicode_text_processing
I'm looking forwards to hearing your opinions, additions, and
suggestions — the RFC specifically asks for these in places.
Question 2 is that class. I know folks have been clammoring for a
String
class for some time and this actually fills that niche quite
well. A part of me wonders if we can overload it a little to provide
a psuedo locale of "binary" so that users can, optionally, treat it
like a more generalized String class in specific cases, storing a
normalchar*
zend_string under the hood in that case. Possibly as a
specialzation tree.
An alternative could be to just have this as an implementation detail,
in case the associated locale/collation is C/root. Then nobody needs to
worry about it, but it would mean implementing everything twice. Which
I am not too keen on, especially because we have such a wide array of
operations on strings already.
cheers,
Derick
--
https://derickrethans.nl | https://xdebug.org | https://dram.io
Author of Xdebug. Like it? Consider supporting me: https://xdebug.org/support
Host of PHP Internals News: https://phpinternals.news
mastodon: @derickr@phpc.social @xdebug@phpc.social
Hi
Question 2 is that class. I know folks have been clammoring for a
String
class for some time and this actually fills that niche quite
well. A part of me wonders if we can overload it a little to provide
a psuedo locale of "binary" so that users can, optionally, treat it
like a more generalized String class in specific cases, storing a
normalchar*
zend_string under the hood in that case. Possibly as a
specialzation tree.An alternative could be to just have this as an implementation detail,
in case the associated locale/collation is C/root. Then nobody needs to
worry about it, but it would mean implementing everything twice. Which
I am not too keen on, especially because we have such a wide array of
operations on strings already.
I rather not see this either, because if a 'Text' object may contain
binary data, the type safety is lost and users cannot rely on "'Text'
implies valid UTF-8" (see sibling thread).
Best regards
Tim Düsterhus
Hey
Hi
Question 2 is that class. I know folks have been clammoring for a
String
class for some time and this actually fills that niche quite
well. A part of me wonders if we can overload it a little to provide
a psuedo locale of "binary" so that users can, optionally, treat it
like a more generalized String class in specific cases, storing a
normalchar*
zend_string under the hood in that case. Possibly as a
specialzation tree.An alternative could be to just have this as an implementation detail,
in case the associated locale/collation is C/root. Then nobody needs to
worry about it, but it would mean implementing everything twice. Which
I am not too keen on, especially because we have such a wide array of
operations on strings already.I rather not see this either, because if a 'Text' object may contain
binary data, the type safety is lost and users cannot rely on "'Text'
implies valid UTF-8" (see sibling thread).
Does Text contain valid UTF-8? Or valid Unicode? As IIRC the idea was to
internally use UTF-16 as encoding.
In the end the internal encoding should be irrelevant to the user as
long as we can assert that __toString() returns a Unicode-String in a
valid encoding. And I'm with you that UTF-8 might be the best choice for
that.
Cheers
Andreas
,,,
(o o)
+---------------------------------------------------------ooO-(_)-Ooo-+
| Andreas Heigl |
| mailto:andreas@heigl.org N 50°22'59.5" E 08°23'58" |
| https://andreas.heigl.org |
+---------------------------------------------------------------------+
| https://hei.gl/appointmentwithandreas |
+---------------------------------------------------------------------+
| GPG-Key: https://hei.gl/keyandreasheiglorg |
+---------------------------------------------------------------------+
Hi
I rather not see this either, because if a 'Text' object may contain
binary data, the type safety is lost and users cannot rely on "'Text'
implies valid UTF-8" (see sibling thread).Does Text contain valid UTF-8? Or valid Unicode? As IIRC the idea was to
internally use UTF-16 as encoding.In the end the internal encoding should be irrelevant to the user as
long as we can assert that __toString() returns a Unicode-String in a
valid encoding. And I'm with you that UTF-8 might be the best choice for
that.
The RFC already specifies that the inputs (__construct()) and outputs
(__toString()) must/will be UTF-8 strings in
https://wiki.php.net/rfc/unicode_text_processing#basics.
So for all intents and purposes "'Text' implies valid UTF-8" is what
this guarantees, because the internal representation will not be visible
to the user.
Best regards
Tim Düsterhus
I have just published an initial draft of the "Unicode Text Processing"
RFC, a proposal to have performant unicode text processing always
available to PHP users, by introducing a new "Text" class.You can find it at:
https://wiki.php.net/rfc/unicode_text_processingI'm looking forwards to hearing your opinions, additions, and
suggestions — the RFC specifically asks for these in places.
As others have said already, thank you for taking a stab at this
important topic. I agree that it would be a really useful feature for
the language, but it's also a really difficult one to get right. Here
are my initial thoughts...
Design Process
Rather than designing the whole class "on paper", I think this really
needs to be built as a prototype, where we can build up documentation
and tests, plug variations into some real life scenarios, and have
separate discussions about different details. If we limit ourselves
initially to features already exposed by ext/intl (I think everything
proposed so far is?), a prototype doesn't even need to be an extension,
it can be in pure PHP. Then once the design is finalised, you have a
ready-made polyfill for older PHP versions, and a set of tests for the
native version :)
We might also want to do some general investigation of what other
languages and frameworks provide, and which decisions have proven good
or bad in practice.
Lossy Transforms
Automatic normalisation and stripping of BOMs seems useful, but it
immediately rules out use of this class for anything where you want to
get back what you put in. For instance, if an ORM used Text instances
for strings in data models, it would generate extra Update queries on
the database even when the string wasn't otherwise changed. I think it
would be better to make this easy but explicit.
UTF-8 on the outside, UTF-16 on the inside
I know this will be a very common combination, but it feels odd that an
application which actually wanted to work with UTF-16 would need to
perform round-trips through UTF-8 just to use this class. It should at
least be possible to specify the encoding on input and output.
Ruby takes an interesting approach where strings are tagged with their
current binary encoding, and only converted to another form if actually
required. If your input layer says "$name = new Text($_GET['name'],
'Windows-1252');" and your output layer says "echo
$name->asBytes('Windows-1252');" the overhead of converting to UTF-16
can be skipped entirely, unless something in between says "$name =
$name->wordsToUpper()". This also removes another source of lossy
transformation, since some encoding conversions aren't perfectly
reversible (e.g. the source encoding has more than one byte sequence
mapped to the same Unicode code point).
Internationalisation
Having locale and collation as state on the object, rather than
parameters on relevant methods, feels like muddling responsibilities. It
makes it hard to reason about what exactly some of the methods will do:
Can I trust that this object will give me a sensible result from
compareWith, or has it been assigned a collation somewhere else? What
exactly will be the definition of "replace" or "contains" for this pair
of objects?
How users will work with these also needs careful thought - your first
listed design goal is "keep it simple", but under locales and
Internationalisation is the worrying sentence "This will require
extensive documentation". This is one of those places where "doing it
right" is really hard to combine with "making it easy", because language
is inherently complex, but users will expect a simple answer to "how do
I make it case-insensitive?"
Allowing other abstractions
I 100% approve of your use of grapheme clusters, rather than code
points, as the primary unit; so many implementations get that wrong.
However, when interacting with other systems, reasoning about bytes (or
sometimes even codepoints) is essential.
One function that I would really like to see, for instance, is a
grapheme-aware version of mb_strcut, to solve tasks like: "encode this
abstract Unicode string as UTF-16BE, truncated to at most 200 bytes,
without breaking apart any grapheme clusters".
Thanks again for getting the ball rolling, and I look forward to helping
iterate the design.
Regards,
--
Rowan Tommins
[IMSoP]
I have just published an initial draft of the "Unicode Text Processing"
RFC, a proposal to have performant unicode text processing always
available to PHP users, by introducing a new "Text" class.You can find it at:
https://wiki.php.net/rfc/unicode_text_processingI'm looking forwards to hearing your opinions, additions, and
suggestions — the RFC specifically asks for these in places.As others have said already, thank you for taking a stab at this important
topic. I agree that it would be a really useful feature for the language, but
it's also a really difficult one to get right. Here are my initial thoughts...Design Process
Rather than designing the whole class "on paper", I think this really needs to
be built as a prototype, where we can build up documentation and tests, plug
variations into some real life scenarios, and have separate discussions about
different details. If we limit ourselves initially to features already exposed
by ext/intl (I think everything proposed so far is?), a prototype doesn't even
need to be an extension, it can be in pure PHP. Then once the design is
finalised, you have a ready-made polyfill for older PHP versions, and a set of
tests for the native version :)
I do not want a polyfill. These already exist for intl and friends. I
had no intention to design everything up front though, and it is likely
that I missed useful methods. This is not going to be right in a single
implementation.
UTF-8 on the outside, UTF-16 on the inside
I know this will be a very common combination, but it feels odd that an
application which actually wanted to work with UTF-16 would need to perform
round-trips through UTF-8 just to use this class. It should at least be
possible to specify the encoding on input and output.
I disgree. Users should not care what is used in the implementation.
It's only UTF-16 because that is what ICU's API use. I do not want the
complexity of having different in/ex encodings. Perhaps 15 years ago
that was useful to have, but right now, everything should be UTF-8 on
the interface layer, that is, if you care about internationalisation.
Internationalisation
Having locale and collation as state on the object, rather than
parameters on relevant methods, feels like muddling responsibilities.
It makes it hard to reason about what exactly some of the methods will
do: Can I trust that this object will give me a sensible result from
compareWith, or has it been assigned a collation somewhere else? What
exactly will be the definition of "replace" or "contains" for this
pair of objects?
A locale/collator is an inherent property of Text (we're dealing with
Text here, not strings). I do need to tidy up the wording about what
locales and collations are, as I've so far used them sparingly
interchangably.
How users will work with these also needs careful thought - your first listed
design goal is "keep it simple", but under locales and Internationalisation is
the worrying sentence "This will require extensive documentation".
This phrase is meant to mean that the format of the locale/collator
name needs extensive documentation.
One function that I would really like to see, for instance, is a
grapheme-aware version of mb_strcut, to solve tasks like: "encode this
abstract Unicode string as UTF-16BE, truncated to at most 200 bytes,
without breaking apart any grapheme clusters".
For that to work, you need a methods that instantly returns UTF-8
strings, and not UTF-16. In the RFC, the current subString() uses int
$length to mean grapheme clusters. Adding another methods to do
something else, is of course possible. I'll think about it (and noted in
"Open Issues").
cheers,
Derick
--
https://derickrethans.nl | https://xdebug.org | https://dram.io
Author of Xdebug. Like it? Consider supporting me: https://xdebug.org/support
Host of PHP Internals News: https://phpinternals.news
mastodon: @derickr@phpc.social @xdebug@phpc.social
twitter: @derickr and @xdebug
I do not want a polyfill. These already exist for intl and friends.
I think you misunderstood what I meant by "polyfill"; I meant in the
sense that once the real implementation gets included in, say PHP 8.3,
users needing to support, say, PHP 8.0, will have a drop-in
implementation with exactly the same interface.
Anyway, that was just an aside; my main point is that a single-page
RFC, and a single mailing list thread, are probably not sufficient to
iterate on this design. A prototype, or even just a repo with stubs
for the methods, would give us better ways to track all the different
details and ideas.
I disgree. Users should not care what is used in the implementation.
It's only UTF-16 because that is what ICU's API use. I do not want the
complexity of having different in/ex encodings. Perhaps 15 years ago
that was useful to have, but right now, everything should be UTF-8 on
the interface layer, that is, if you care about internationalisation.
UTF-8 should definitely be the default, but I disagree that all other
encodings can simply be ignored, and that users should be punished for
using them with extra CPU time spent converting to UTF-8 and back
again. All it would need is an optional argument on a couple of
methods to specify that you want some other encoding.
A locale/collator is an inherent property of Text (we're dealing with
Text here, not strings).
Is it though? It makes some sense to say "this is a Turkish Text, so
treat 'i' specially whenever upper-casing". But is there such a thing
as a "case insensitive piece of text"?
If locale is an "inherent property", does it make sense to discard it
when joining Texts together? At the moment, Text::join([$a,
$b])->toUpper() can give a different result from
Text::join([$a->toUpper(), $b->toUpper()]). An implementation that
truly treated locale as inherent would have to track segments within a
larger Text, subject to separate locales. (Similar to how HTML allows
a lang attribute on individual elements.)
For comparisons, I don't see the value at all - if I'm sorting a list
of Texts, the sort order is a property of the sort operation, not of
the individual items. If I have a French Text, a Spanish Text, and an
English Text, there's no meaningful way to use all three sort orders
at once, and no particular reason to choose one over the others. In
the current proposal, using compareWith in a usort callback without
specifying the collation would result in unstable results, because
it's not symmetrical - $a->compareWith($b) can use a different
collation than $b->compareWith($a).
the worrying sentence "This will require extensive documentation".
This phrase is meant to mean that the format of the locale/collator
name needs extensive documentation.
I know, and I think that's a bad sign - why are we exposing this
complexity to users in a class that otherwise holds their hand at
every step of the way? I think the parameters should always be a
user-friendly collation/locale object, with the ICU strings an
optional way for experts to create such an object.
Regards,
Rowan Tommins
[IMSoP]
I do not want a polyfill. These already exist for intl and friends.
I think you misunderstood what I meant by "polyfill"; I meant in the
sense that once the real implementation gets included in, say PHP 8.3,
users needing to support, say, PHP 8.0, will have a drop-in
implementation with exactly the same interface.
I know what a polyfill is, and I still don't want to see this.
Anyway, that was just an aside; my main point is that a single-page
RFC, and a single mailing list thread, are probably not sufficient to
iterate on this design. A prototype, or even just a repo with stubs
for the methods, would give us better ways to track all the different
details and ideas.
I will certainly be prototypign some of this, but not before the general
idea has been reasonably accepted.
I disgree. Users should not care what is used in the implementation.
It's only UTF-16 because that is what ICU's API use. I do not want
the complexity of having different in/ex encodings. Perhaps 15 years
ago that was useful to have, but right now, everything should be
UTF-8 on the interface layer, that is, if you care about
internationalisation.UTF-8 should definitely be the default, but I disagree that all other
encodings can simply be ignored, and that users should be punished for
using them with extra CPU time spent converting to UTF-8 and back
again. All it would need is an optional argument on a couple of
methods to specify that you want some other encoding.
I know what it would entail, but I am rejecting it regardless. "Just an
optional argument on a couple of methods" increases the complexity.
A locale/collator is an inherent property of Text (we're dealing with
Text here, not strings).Is it though? It makes some sense to say "this is a Turkish Text, so
treat 'i' specially whenever upper-casing". But is there such a thing
as a "case insensitive piece of text"?
The locale is inherent, the collator not so much. The collator as set on
a Text object is therefore more of a default. The ''replaceText''
method, and the ''Finding Text in Text'' methods all have a way to
override this default collation.
I have updated the language in the RFC to be more precise.
If locale is an "inherent property", does it make sense to discard it
when joining Texts together? At the moment, Text::join([$a,
$b])->toUpper() can give a different result from
Text::join([$a->toUpper(), $b->toUpper()]). An implementation that
truly treated locale as inherent would have to track segments within a
larger Text, subject to separate locales. (Similar to how HTML allows
a lang attribute on individual elements.)For comparisons, I don't see the value at all - if I'm sorting a list
of Texts, the sort order is a property of the sort operation, not of
the individual items. If I have a French Text, a Spanish Text, and an
English Text, there's no meaningful way to use all three sort orders
at once, and no particular reason to choose one over the others. In
the current proposal, using compareWith in a usort callback without
specifying the collation would result in unstable results, because
it's not symmetrical - $a->compareWith($b) can use a different
collation than $b->compareWith($a).
That sounds like an argument for having a sort()
method where you can
override the collator. I would however expect that most people would not
set a default collation other than "standard" on Text objects though.
And if something more clever needs to be done, this can be overridden in
all methods.
the worrying sentence "This will require extensive documentation".
This phrase is meant to mean that the format of the locale/collator
name needs extensive documentation.I know, and I think that's a bad sign - why are we exposing this
complexity to users in a class that otherwise holds their hand at
every step of the way? I think the parameters should always be a
user-friendly collation/locale object, with the ICU strings an
optional way for experts to create such an object.
Yes, and that is why the RFC includes a ''TextCollator'' object that
does precisely that.
cheers,
Derick
--
https://derickrethans.nl | https://xdebug.org | https://dram.io
Author of Xdebug. Like it? Consider supporting me: https://xdebug.org/support
Host of PHP Internals News: https://phpinternals.news
mastodon: @derickr@phpc.social @xdebug@phpc.social
twitter: @derickr and @xdebug
I know what a polyfill is, and I still don't want to see this.
I can 100% guarantee that you will see it - as soon as this RFC is even
close to being accepted, either the Symfony project or someone else will
work on such a polyfill. But if you're not interested in writing it,
that's fair enough.
I know what it would entail, but I am rejecting it regardless. "Just an
optional argument on a couple of methods" increases the complexity.
Complexity can cut both ways: from the point of view of the user, this...
$value = new Text( UConverter::transcode($utf16_value, 'UTF-8', 'UTF-16BE')
);
// ...
echo UConverter::transcode((string)$value, 'UTF-16BE' , 'UTF-8');
...is significantly more complex than this:
$value = new Text($utf16_value, 'UTF-16BE');
// ...
echo $value->asBytes('UTF-16BE');
An additional problem with the long version is that UConverter is only
available if ext/intl is enabled (mb_convert_encoding is an alternative,
with the same problem) - something which was discussed at length when I
proposed removal of utf8_encode and utf8_decode.
That sounds like an argument for having a
sort()
method where you can
override the collator. I would however expect that most people would not
set a default collation other than "standard" on Text objects though.
And if something more clever needs to be done, this can be overridden in
all methods.
In this case, I think I'm arguing from the same angle as you are on
encodings - if setting a per-object default collator is a rare action, and
not generally useful, let's eliminate the complexity of supporting it, and
just leave the method arguments.
Yes, and that is why the RFC includes a ''TextCollator'' object that
does precisely that.
Indeed; I think mostly what I'm saying is that users should always look at
the object first, and the string format only if they really need it.
Specifically, the current proposal lists parameters as "string $collation"
implying this:
// strings can be passed directly
$a->compareWith($b, 'en-u-ks-level1');
// object is only a way of creating the string
$a->compareWith($b, (new
TextCollator('en'))->setCaseInsensitive()->getCollationString() );
I'm proposing instead that they be "TextCollator $collation", implying this:
// object is the normal argument
$a->compareWith($b, (new TextCollator('en'))->setCaseInsensitive());
// strings are only a way of getting an object
$a->compareWith($b, TextCollator::fromCollationString('en-u-ks-level1'));
A third option would be to make the parameters a union "TextCollator|string
$collation", implying this:
// object is supported directly
$a->compareWith($b, (new TextCollator('en'))->setCaseInsensitive());
// so are strings if you already have one for some reason
$a->compareWith($b, 'en-u-ks-level1');
Regards,
Rowan Tommins
[IMSoP]
Hi,
I have just published an initial draft of the "Unicode Text Processing"
RFC, a proposal to have performant unicode text processing always
available to PHP users, by introducing a new "Text" class.
Using "collator" and "locale" interchangeably seems imprecise. If the
input is an ICU locale string, then I think you should just call it
locale. Then the user will be armed with the correct terminology when
they go looking for more information in the ICU manual. In ICU, case
conversion and BreakIterator need a locale, not a collator.
I'm concerned about the time order of using grapheme offsets. For
example, is subString() O(N) in $offset? If the idea is to be easy to
use and performant, you don't want to have subtle algorithmic
complexity traps.
I'm probably not the target audience for this class, since I'm
generally looking for maximum flexibility, not minimum complexity. As
such, I'd like intl to have better documentation and more features.
The RFC has a family of locale-aware case conversion functions which
do not exist in intl. This was raised as an issue during the
discussion on my ASCII case conversion RFC. It would be great if intl
could get those functions too.
I think you should consider making this Text class a part of the intl
extension. You're adding a class which is similar to the classes in
that extension. In terms of data, it's like IntlChar, except it's for
strings not characters. Its constructor takes an ICU locale string,
just like IntlBreakIterator or MessageFormatter.
I can understand if you don't want to follow all the existing
conventions of the intl extension. But if that is the rationale for
the RFC, I'd like to see a discussion of the specific usability
problems with the intl extension.
-- Tim Starling
I'm concerned about the time order of using grapheme offsets. For
example, is subString() O(N) in $offset? If the idea is to be easy to
use and performant, you don't want to have subtle algorithmic
complexity traps.
This is a good point; it's certainly true of existing functions, like
grapheme_strlen()
, and indeed mb_strlen()
, which has to iterate variable
width code points.
Perhaps we could take advantage of having a stateful object and internally
optimise this in some way, such as caching a partial lookup table of
graphemes to byte offsets.
For instance, the table might look like this:
10: 22
20: 50
30: 70
35: 82; LAST
Then $string->subString(23, 20) would:
- take a pointer to byte 50
- pass it to the ICU grapheme iterator to skip over 3 graphemes; let's say
that takes us to byte 58 - since 23 + 20 > 35, the rest of the string is included
- the new object could construct an offset table without examining the
string:
7: 12 (grapheme 30 - 23; byte 70 - 58)
12: 24; LAST (grapheme 35 - 23; byte 82 - 58)
Whether this complexity would pay off in real-world scenarios, I don't
know, but if people started using this for all the text on an application,
I can see longer strings becoming a more common use case.
Regards,
Rowan Tommins
[IMSoP]
I have just published an initial draft of the "Unicode Text
Processing" RFC, a proposal to have performant unicode text
processing always available to PHP users, by introducing a new
"Text" class.Using "collator" and "locale" interchangeably seems imprecise. If the
input is an ICU locale string, then I think you should just call it
locale. Then the user will be armed with the correct terminology when
they go looking for more information in the ICU manual. In ICU, case
conversion and BreakIterator need a locale, not a collator.
Yeah, the terms are currently used interchangably (sort of). I will
update that. Although I really would not suggest that users look at the
ICU manual, as it's really hard to find things in it :-)
I'm concerned about the time order of using grapheme offsets. For
example, is subString() O(N) in $offset?
Yes. It would have to scan the Text.
I'm probably not the target audience for this class, since I'm
generally looking for maximum flexibility, not minimum complexity. As
such, I'd like intl to have better documentation and more features.
The RFC has a family of locale-aware case conversion functions which
do not exist in intl. This was raised as an issue during the
discussion on my ASCII case conversion RFC. It would be great if intl
could get those functions too.
AFAIK Intl can do all of these things, but yes, its documentation is
"sparse". However, that's not in scope of this RFC.
I think you should consider making this Text class a part of the intl
extension. You're adding a class which is similar to the classes in
that extension. In terms of data, it's like IntlChar, except it's for
strings not characters. Its constructor takes an ICU locale string,
just like IntlBreakIterator or MessageFormatter.
I did consider that, and rejected that idea. Intl, although powerful,
does not have an approcable API. It is also not installed or available
by default, and I am not suggesting we do that. That than means that it
doesn't fit the design goals here (having it always available).
cheers,
Derick
--
https://derickrethans.nl | https://xdebug.org | https://dram.io
Author of Xdebug. Like it? Consider supporting me: https://xdebug.org/support
Host of PHP Internals News: https://phpinternals.news
mastodon: @derickr@phpc.social @xdebug@phpc.social
twitter: @derickr and @xdebug
Hi,
I have just published an initial draft of the "Unicode Text Processing"
RFC, a proposal to have performant unicode text processing always
available to PHP users, by introducing a new "Text" class.You can find it at:
https://wiki.php.net/rfc/unicode_text_processingI'm looking forwards to hearing your opinions, additions, and
suggestions — the RFC specifically asks for these in places.cheers,
Derick
I love the idea, but we may want to think about using the terms
trimStart
and trimEnd
instead of trimLeft
and trimRight
.
Obviously, when the text in the parameter is in a right-to-left
language (e.g., Hebrew, Arabic), trimLeft
and trimRight
are going
to do the opposite of what one may expect from the names of the
functions. I know that identifiers like ltrim
and rtrim
are
longstanding conventions which everyone understands that they really
mean trimming from the start and end respectively, but this may be an
opportunity to amend the inaccuracy.
If this ever gets expanded to add padding functions, those should
probably be named padStart
and padEnd
as well.
Hi,
I have just published an initial draft of the "Unicode Text Processing"
RFC, a proposal to have performant unicode text processing always
available to PHP users, by introducing a new "Text" class.You can find it at:
https://wiki.php.net/rfc/unicode_text_processingI'm looking forwards to hearing your opinions, additions, and
suggestions — the RFC specifically asks for these in places.cheers,
Derick--
https://derickrethans.nl | https://xdebug.org | https://dram.ioAuthor of Xdebug. Like it? Consider supporting me: https://xdebug.org/support
Host of PHP Internals News: https://phpinternals.newsmastodon: @derickr@phpc.social @xdebug@phpc.social
twitter: @derickr and @xdebug
Derick, thank you for tackling this. It's a decidedly not-simple problem space and I'm glad someone like you is looking into it.
I'm overall in favor of the RFC, though I have some comments/pushback. First, here's my notes just as I'm reading through it:
Re Text::create, which the RFC suggests can be aliased as a function for easier use: Are you sure? Symfony has stand-alone wrapper functions. You cannot alias a static method directly to a function. cf: https://3v4l.org/V4kP2 (It would be really nice if that worked! But it doesn't seem to.)
Text::concat: What happens if different Text objects passed to that have different collations? Which gets used, the first, last, "best fit"...?
Text->wrap: " If $cutLongWords is set, no Text element will be larger than $maxWidth." Can you include an example here? I have to mentally noodle through what this means, which means it could use an example.
Text->getPositionOfFirstOccurrence() et al: These should not return false on not-found. That is an anti-pattern. That PHP's existing libraries do that is a bug, not a feature. It has caused no end of bugs. null is the correct thing to return here, especially with the new null-handling syntax we have now in PHP 8. Do not use "false" as a not-found return, ever. Another option to consider is if some of them should return an empty Text object. (That may not be the best answer, but it's one worth considering.)
Also, all of those names are very long. :-(
Text->returnFromFirstOccurence(): I much prefer startingWith(). It's half the length and just as if not more descriptive. It also implies, to me, that the $search will be included in the result, whereas startingAt(), for whatever reason, doesn't seem like it does.
Text->contains(): The header is missing the defined return type.
Comparing Text Objects: Oh, for being able to overload those operators. This would be a great use case for it. :-(
The examples in case-conversion are hard to follow, because the font of code samples is not that different from normal text. Could you perhaps multi-line them, to make it clearer where the "in" text ends and the "out" text starts?
How do toTitle() and wordsToUpper() differ? They sound like the same thing... (Please note the difference in the RFC.)
Why two methods for length? And why confuse it with "character" when the text has been very consistent about using grapheme to this point.
getCodePointCount(): I... don't understand how this is different from length, so I don't see why we'd use it. If it's kept, please include a better explanation of what it is or why I'd care.
getWordCount(): The example uses getWordIterator as a property, when I think it's supposed to be a method. Also, it's not syntax highlighted.
"The return of the iterators are effected by the text's locale." - affected, not effected.
getCharacterIterator(): Again, dropping in the word character here. Calling it getGraphemeIterator() would be terrible, of course. :-) This feels like an older part of the text that wasn't updated when most of it started using grapheme. Perhaps skip the explanation here and move a central definition of "character" to the start of the RFC? (Which could be "means the same as grapheme in this case, NOT the same as byte.")
getWordIterator(): It's not clear to me if this includes whitespace as its own Text objects. Would the string "Mr. Smith Goes to Washington" be a word iterator of ["Mr.", "Smith", "Goes", "to", "Washington"]? Or ["Mr", ".", "Smith", "Goes", "to", "Washington"]? Or ["Mr", ".", " ", "Smith", " ", "Goes", " ", "to", " ", "Washington"]? I'm not clear which is the intent here. (Feel free to steal this example for the RFC.)
getLineIterator(): I do not understand this description at all. From the name, I'd expect it to break the string at newline characters. The description seems like it's something completely different I do not understand.
getTitleIterator(): What's a title, in this context?
Transliteration section: The formatting here seems wonky and confusing. Please clean up.
Second, there is a PHP-FIG Working Group on translation. It's mostly idle at the moment, as we're waiting on the MessageFormat working group at the W3C to stabilize their next version so we can just steal it. I don't know that there's any direct overlap between this RFC and that WG, but I'm mentioning it for transparency, and to encourage people to think about how they could both be developed to play nicely together, whatever that means.
Third, is there some way to say "this string, but in some other collation?" It looks like the only way to do that is via Text::create($txt, 'new-collation') / new Text($txt, 'new-collation'). A ->withCollation('new-collation') method would be very helpful, especially as so many methods rely on the collation for things like case insensitivity. That way, we could do $txt->withCollation('case-insensitive-english')->split(',') (or similar).
Fourth, that brings me to my biggest concern. "The format of this locale/collation name needs extensive documentation." - This line scares the ever-loving crap out of me. :-) We know from experience that complex formatting strings are trivially screwed up, mistyped, or otherwise gotten wrong. Especially when they're not self-evident. ("ks" means "case-insensitive"? I would never have guessed that in a million years.) The links provided to the Unicode sites don't really illuminate anything for me.
This to me sounds like it cries out for either a builder object, an enumeration (or multiple), or some combination of those. It sounds like the TextCoallator class is maybe trying to be that, but it's still under-described, especially for anyone who hasn't already used intl. It also looks like it's just producing a string, rather than, what I'd consider probably more ergonomic, using the object directly like DateTimeZone.
I'm not sure what the best answer here is since I don't know the problem space well enough, but I do know that "here's a string, GL" is insufficient. We should noodle on how to make that more ergonomic so that casual developers don't get it horribly subtly wrong. (Because you know they/we/I will on this topic.) That may also include extra utility methods like ->withCaseInsensitiveCollation() (which changes only the case sensitivity marker but leaves the rest of the collation alone), or something like that. Again, I'm not sure what the best design here is.
Everything i just said also applies to the "$transliterationString", which is mentioned in passing but no description is provided. I have no clue what the syntax for that even is.
And of course it would be off brand for me to not note that issues with a fixed set of methods on an object like this, rather than pipe-friendly functions that are innately more extensible. I know, I know, we don't have pipes yet, but I have to mention it anyway or people would be worried about me. :-)
--Larry Garfield