Hey,
I can't find any recent discussion in this mailing list on this topic, i
think that most close one is
http://grokbase.com/t/php/php-internals/143b6aevsp/unicode-strings. I
was also reading papers like that:
http://www.infoworld.com/article/2618358/application-development/php-5-4-emerges-from-the-collapse-of-php-6-0.html
Latter is referring to difficulties like "excess memory usage" and
"rewrite the language". I'm developing an open-source Unicode
implementation library (nunicode), and it doesn't consume any heap at
all, it also works on native binary strings, as PHP does. Hence i thinks
that maybe it could help with at least these two problems.
But i hardly understand if my work is even applicable here. My library
is a rather pragmatic implementation, it's conformant to Unicode 7.0 and
ISO/IEC 14651, but it does not implement the whole Unicode specification.
I would appreciate if someone would point me to a good read or explain
collective opinion on this topic. I'm basically interested in the
following questions:
- Is there a need for more Unicode support in PHP?
- What is currently missing in that regard?
- Is this a good place to ask such questions?
Thanks.
Hey,
I can't find any recent discussion in this mailing list on this topic, i
think that most close one is
http://grokbase.com/t/php/php-internals/143b6aevsp/unicode-strings. I was
also reading papers like that:
http://www.infoworld.com/article/2618358/application-development/php-5-4-emerges-from-the-collapse-of-php-6-0.htmlLatter is referring to difficulties like "excess memory usage" and "rewrite
the language". I'm developing an open-source Unicode implementation library
(nunicode), and it doesn't consume any heap at all, it also works on native
binary strings, as PHP does. Hence i thinks that maybe it could help with at
least these two problems.
On the face of it, this implies a rather large performance hit and a
tendency to overflow the stack much more readily, do you have any
details on these elements?
But i hardly understand if my work is even applicable here. My library is a
rather pragmatic implementation, it's conformant to Unicode 7.0 and ISO/IEC
14651, but it does not implement the whole Unicode specification.I would appreciate if someone would point me to a good read or explain
collective opinion on this topic. I'm basically interested in the following
questions:
The only additional thing I can find quickly is something Pierre put
together earlier this year, when PHP6 (now 7) discussions were
started:
https://wiki.php.net/ideas/php6/unicode
- Is there a need for more Unicode support in PHP?
- What is currently missing in that regard?
- Is this a good place to ask such questions?
My personal view on questions 1 and 2 is "no" and "nothing"
respectively, but I think this is not a popular opinion (and those
answers are a vast oversimplification of the issues).
This is certainly a good place to ask those questions, though.
Thanks.
Chris,
Latter is referring to difficulties like "excess memory usage" and "rewrite
the language". I'm developing an open-source Unicode implementation library
(nunicode), and it doesn't consume any heap at all, it also works on native
binary strings, as PHP does. Hence i thinks that maybe it could help with at
least these two problems.On the face of it, this implies a rather large performance hit and a
tendency to overflow the stack much more readily, do you have any
details on these elements?
I can't really tell if hit is going to be large before understanding
what final result would be, at least approximately.
I can tell that internal complexity of nunicode is O(1) everywhere. I'm
comparing performance to ICU and nunicode mostly outperforms it. I've
compiled some numbers here:
https://bitbucket.org/alekseyt/nunicode#markdown-header-performance-considerations
Regarding stack, i'm not sure if get the point. As far as i'm concerned,
library does not have recursive calls, it does not have internal
representation and does not allocate on stack aggressively. Everything
works on immutable binary strings, stack will be used mostly for
function calls.
But honestly, i feel like i'm not answering your question at all. Could
you possibly clarify it?
I would appreciate if someone would point me to a good read or explain
collective opinion on this topic. I'm basically interested in the following
questions:The only additional thing I can find quickly is something Pierre put
together earlier this year, when PHP6 (now 7) discussions were
started:
https://wiki.php.net/ideas/php6/unicode
Thank you, this is exactly what i was looking for.
I would appreciate if someone would comment on the following:
Some of the keys point we need to take care of are:
- UTF-8 storage
- UTF-8 support for almost (if not all) existing string APIs
- Performance
As of today, I did not find any library covering at least two of
these key points.
I think i could claim that nunicode is covering at least two key points,
maybe all of them, but i'm not sure about point 2). API do include
operations on strings, but this API is simply following standard string
functions (UTF equivalents of strcoll()
, strchr()
, strstr()
, etc). Does
that sound good or not?
Chris,
Latter is referring to difficulties like "excess memory usage" and
"rewrite
the language". I'm developing an open-source Unicode implementation
library
(nunicode), and it doesn't consume any heap at all, it also works on
native
binary strings, as PHP does. Hence i thinks that maybe it could help with
at
least these two problems.On the face of it, this implies a rather large performance hit and a
tendency to overflow the stack much more readily, do you have any
details on these elements?I can't really tell if hit is going to be large before understanding what
final result would be, at least approximately.I can tell that internal complexity of nunicode is O(1) everywhere. I'm
comparing performance to ICU and nunicode mostly outperforms it. I've
compiled some numbers here:
https://bitbucket.org/alekseyt/nunicode#markdown-header-performance-considerations
Great, thanks for this
Regarding stack, i'm not sure if get the point. As far as i'm concerned,
library does not have recursive calls, it does not have internal
representation and does not allocate on stack aggressively. Everything works
on immutable binary strings, stack will be used mostly for function calls.But honestly, i feel like i'm not answering your question at all. Could you
possibly clarify it?
My apologies, this was a case of typing before thinking properly. I
was envisaging very large stack frames due to large char arrays being
allocated on the stack but when I actually apply my brain to what you
are doing I realise that this isn't going to be the case.
Carry on.
I would appreciate if someone would point me to a good read or explain
collective opinion on this topic. I'm basically interested in the
following
questions:The only additional thing I can find quickly is something Pierre put
together earlier this year, when PHP6 (now 7) discussions were
started:
https://wiki.php.net/ideas/php6/unicodeThank you, this is exactly what i was looking for.
I would appreciate if someone would comment on the following:
Some of the keys point we need to take care of are:
- UTF-8 storage
- UTF-8 support for almost (if not all) existing string APIs
- Performance
As of today, I did not find any library covering at least two of these key
points.I think i could claim that nunicode is covering at least two key points,
maybe all of them, but i'm not sure about point 2). API do include
operations on strings, but this API is simply following standard string
functions (UTF equivalents ofstrcoll()
,strchr()
,strstr()
, etc). Does that
sound good or not?
I would appreciate if someone would point me to a good read or explain collective opinion on this topic. I'm basically interested in the following questions:
- Is there a need for more Unicode support in PHP?
Yes.
- What is currently missing in that regard?
Unicode string support.
- Is this a good place to ask such questions?
Yes.
If you want to see a pragmatic, actually working, work-in-progress attempt at better PHP unicode support, see this: https://github.com/krakjoe/ustring
It would add a UString class to PHP for Unicode strings. This would make Unicode text manipulation much easier than it is now. And both internal and userland code which accepts strings would already be compatible as it has a __toString method, but new code could also choose to accept UStrings directly.
Andrea Faulds
http://ajf.me/
If you want to see a pragmatic, actually working, work-in-progress attempt at better PHP unicode support, see this: https://github.com/krakjoe/ustring
It would add a UString class to PHP for Unicode strings. This would make Unicode text manipulation much easier than it is now. And both internal and userland code which accepts strings would already be compatible as it has a __toString method, but new code could also choose to accept UStrings directly.
Looking at it now. UString and repo linked in its description are very
good read indeed. Thank you.
- What is currently missing in that regard?
Unicode string support.
I know that was probably deliberately flippant, but I think there is a
genuine question to be asked here. A lot of people talk about "Unicode
support" like they talk about "XPath support"; but XPath is an API you
can adhere to, Unicode is a whole lot more (and less) than that.
What it probably means to most people is "string functions which do what
I expect with a vast range of obscure Unicode code point sequences".
Those expectations need to be documented before an API is written,
rather than writing a whole load of functions which use a Unicode
library, but don't actually provide the tools that people need.
If you want to see a pragmatic, actually working, work-in-progress attempt at better PHP unicode support, see this: https://github.com/krakjoe/ustring
It looks like a good prototype, but glancing at the documentation, I'm
not clear exactly what the assumptions of some of the functions are.
There's a lot of talk of "characters", which is a very slippery notion
in Unicode; charAt() returns a single code point, and $length returns a
number of code points. This makes me wonder if it will pass "the noël
test" [1] - does a combining diacritic move onto a different letter when
you run ->reverse()?
As I've mentioned before, a lot of the time what people actually want to
deal with is "grapheme clusters" - the kind of thing that you'd think of
as a character if you were writing by hand. Most people, if asked the
length of the string "noël", would answer 4, but there may be 5 code
points. (That's not just a case of normalisation choices; most
combinations of letter+diacritic have no single code point, that's why
the combining forms exist.)
A good Unicode string API should probably give clear labels and choices
for such things - $string->codePointAt(3) is not the same as
$string->graphemeAt(3), $string->codePointCount is not the same as
$string->graphemeCount, and so forth. A single property $length seems
more user-friendly, until the user finds it means something different to
what they wanted.
Similarly, an automatic __toString() function is handy, but what
encoding does it output, and why? UTF-8? The same encoding that the
string was constructed with?
If I know that my database is expecting UTF-8, I probably want to say
$string->getByteString('UTF-8'). I may also want to say
$string->getByteStringWithMaxLength('UTF-8', 20) to fit an exact number
of graphemes into a 20-byte binary space; something that neither
$string->substring(0, 20)->getByteString('UTF-8') nor substr(
$string->getByteString('UTF-8'), 0, 20 ) can do.
In short, we can only abstract so much - supporting Unicode
automatically means supporting its complexity, not just pretending it's
a really big version of ASCII.
[1] http://mortoray.com/2013/11/27/the-string-type-is-broken/
--
Rowan Collins
[IMSoP]
If you want to see a pragmatic, actually working, work-in-progress attempt at better PHP unicode support, see this: https://github.com/krakjoe/ustring
It looks like a good prototype, but glancing at the documentation, I'm not clear exactly what the assumptions of some of the functions are.
There's a lot of talk of "characters", which is a very slippery notion in Unicode; charAt() returns a single code point, and $length returns a number of code points. This makes me wonder if it will pass "the noël test" [1] - does a combining diacritic move onto a different letter when you run ->reverse()?
As I've mentioned before, a lot of the time what people actually want to deal with is "grapheme clusters" - the kind of thing that you'd think of as a character if you were writing by hand. Most people, if asked the length of the string "noël", would answer 4, but there may be 5 code points. (That's not just a case of normalisation choices; most combinations of letter+diacritic have no single code point, that's why the combining forms exist.)
A good Unicode string API should probably give clear labels and choices for such things - $string->codePointAt(3) is not the same as $string->graphemeAt(3), $string->codePointCount is not the same as $string->graphemeCount, and so forth. A single property $length seems more user-friendly, until the user finds it means something different to what they wanted.
This is true. It ought to talk about code points but doesn’t. Length is primarily needed for iterating through strings and the like. If you went length in characters, you probably need to implement your own algorithm, as it really depends on your specific use case.
It will, however, always produce valid UTF8 strings for output. That’s better than standard string functions which can mangle UTF8.
Similarly, an automatic __toString() function is handy, but what encoding does it output, and why? UTF-8? The same encoding that the string was constructed with?
Always UTF-8.
If I know that my database is expecting UTF-8, I probably want to say $string->getByteString('UTF-8’).
You can do that.
I may also want to say $string->getByteStringWithMaxLength('UTF-8', 20) to fit an exact number of graphemes into a 20-byte binary space; something that neither $string->substring(0, 20)->getByteString('UTF-8') nor substr( $string->getByteString('UTF-8'), 0, 20 ) can do.
I’m not sure quite how you’d do that. There might be a function in mbstring for that.
In short, we can only abstract so much - supporting Unicode automatically means supporting its complexity, not just pretending it's a really big version of ASCII.
Sure. But just handling code points safely is hard enough as it is. This handles that. It doesn’t handle characters, sure, but it’s a start. And for many applications, you do not need to handle characters.
Andrea Faulds
http://ajf.me/
Rowan,
As I've mentioned before, a lot of the time what people actually want to
deal with is "grapheme clusters" - the kind of thing that you'd think of
as a character if you were writing by hand. Most people, if asked the
length of the string "noël", would answer 4, but there may be 5 code
points. (That's not just a case of normalisation choices; most
combinations of letter+diacritic have no single code point, that's why
the combining forms exist.)
Very good point. I'll give another example: is there a substring "s" in
string "Maße"? If it's case-sensitive search, when there is no such
substring, but if it's case-insensitive search, then "ß" folds into "ss"
and substring "s" appears.
This works both ways. For instance, if someone wants to split string
"MASSE" after "ß" in case-insensitive manner, one approach might be: 1)
find "ß" position, it's +2; 2) split string at +3. Result would be two
strings: "MAS" and "SE".
Back to combining characters, i dig the idea of introducing graphemes,
but i think French person would write word "noël" using precomposed
character. I'm using French keyboard at
https://translate.google.com/#fr/. "ë" is Shift + "^", then "e", it
produces precomposed U+00EB.
If script doesn't have precomposed equivalent, then this grapheme will
always be in the same decomposed form and collation will work. Substring
search will also work, because needle will be decomposed in the same way
as haystack. There are some border-line cases possible, but are they
really practical in a scope of Unicode support in a programming language?
Any ideas?
P.S. Point about documentation taken.
Very good point. I'll give another example: is there a substring "s" in
string "Maße"? If it's case-sensitive search, when there is no such
substring, but if it's case-insensitive search, then "ß" folds into "ss"
and substring "s" appears.
In Unicode 5.1 there is "ẞ" U+1E9E LATIN CAPITAL LETTER SHARP S.
(The point of this post mostly is to show that there is another
dimension making this even more complicated, again - different Unicode
versions)
johannes
Very good point. I'll give another example: is there a substring "s" in
string "Maße"? If it's case-sensitive search, when there is no such
substring, but if it's case-insensitive search, then "ß" folds into "ss"
and substring "s" appears.In Unicode 5.1 there is "ẞ" U+1E9E LATIN CAPITAL LETTER SHARP S.
(The point of this post mostly is to show that there is another
dimension making this even more complicated, again - different Unicode
versions)
It's still in Unicode 7.0. According to Unicode character database "ß"
uppercase is "SS", "ẞ" lowercase is "ß", both casefolds into "ss". Thus
upper(lower("ẞ")) should produce "SS". There is another dimension indeed.
Back to combining characters, i dig the idea of introducing graphemes,
but i think French person would write word "noël" using precomposed
character. I'm using French keyboard at
https://translate.google.com/#fr/. "ë" is Shift + "^", then "e", it
produces precomposed U+00EB.
You don't even need to rely on the input method using the combined form,
Unicode includes an algorithm for normalisation to this form (where such
composites are coded), known as NFC.
If script doesn't have precomposed equivalent, then this grapheme will
always be in the same decomposed form and collation will work.
Substring search will also work, because needle will be decomposed in
the same way as haystack.
No, it won't. You won't get false negatives as long as both strings are
normalised to the same form (whether that is NFC or NFD), but you will
get false positives. For instance, searching for the substring "e" would
not match a combined ë, but it would match an uncombined sequence with e
at its base (e.g. with two diacritics).
Normalising to NFD (fully de-composed) would at least mean that "e"
consistently matched all graphemes with "e" at their base, but is not a
lossless operation, so performing it implicitly is probably not a good idea.
All of which ignores the questions of length and string reversal, which
I think are much more important in this respect.
There are some border-line cases possible, but are they really
practical in a scope of Unicode support in a programming language?
As I understand it, the entirety of the Korean writing system is an
"edge case" in this respect - it uses 3 code points for each grapheme,
and cutting one of those graphemes apart leaves you with gibberish.
It's pretty meaningless to say you support Unicode, but only the easy
bits. You might as well just tag each string with one of the pages of
ISO-8859.
--
Rowan Collins
[IMSoP]
Rowan,
Back to combining characters, i dig the idea of introducing graphemes,
but i think French person would write word "noël" using precomposed
character. I'm using French keyboard at
https://translate.google.com/#fr/. "ë" is Shift + "^", then "e", it
produces precomposed U+00EB.You don't even need to rely on the input method using the combined form,
Unicode includes an algorithm for normalisation to this form (where such
composites are coded), known as NFC.
The problem with NFC is that it's not only composition, but
decomposition + reordering + re-composition. I know about NFC quick
check, but the issue is if check fails and string need transformation,
this would be very challenging, if not impossible, to do while keeping
string immutable and without introducing internal representation of that
string.
Internal representation and string modifications brings overhead which
might eventually render implementation unusable for a range of applications.
On the other side, language specific characters which can be
precomposed, are likely to be precomposed.
If script doesn't have precomposed equivalent, then this grapheme will
always be in the same decomposed form and collation will work.
Substring search will also work, because needle will be decomposed in
the same way as haystack.No, it won't. You won't get false negatives as long as both strings are
normalised to the same form (whether that is NFC or NFD), but you will
get false positives. For instance, searching for the substring "e" would
not match a combined ë, but it would match an uncombined sequence with e
at its base (e.g. with two diacritics).Normalising to NFD (fully de-composed) would at least mean that "e"
consistently matched all graphemes with "e" at their base, but is not a
lossless operation, so performing it implicitly is probably not a good
idea.
Good point. That's what i meant by border-line case. Could you possibly
point me to a specific example of such false positive? I'm interested in
well-formed UTF-8 string. I believe "noël" test is ill-formed UTF-8 and
doesn't conform to shortest-form requirement.
It's pretty meaningless to say you support Unicode, but only the easy
bits. You might as well just tag each string with one of the pages of
ISO-8859.
As far as i'm concerned Unicode specification does not require to
implement all annexes or even support entire character set to be
conformant. I think there are always trade-offs involved, depending on
what is more important for you.
Good point. That's what i meant by border-line case. Could you possibly
point me to a specific example of such false positive? I'm interested
in
well-formed UTF-8 string. I believe "noël" test is ill-formed UTF-8
and
doesn't conform to shortest-form requirement.
You're confusing two concepts here: well-formed UTF-8 represents any single code point with the smallest number of bytes, but it makes no requirements about what code points are represented. Representing " ë " as two code points is perfectly valid Unicode, and would in fact be required under NFD.
That "most" input sources would prefer the combined form seems like a weak assumption to base a library on; it only takes one popular third-party to routinely return data in NFD for the problems to start showing up.
It's pretty meaningless to say you support Unicode, but only the easy
bits. You might as well just tag each string with one of the pages of
ISO-8859.As far as i'm concerned Unicode specification does not require to
implement all annexes or even support entire character set to be
conformant. I think there are always trade-offs involved, depending on
what is more important for you.
Sure, but there are certain user expectations of what "Unicode support" means. Handling Korean characters in a meaningfulmeaningful way would definitely be on that list.
As I said at the top of my first post, the important thing is to capture what those requirements actually are. Just as you'd choose what array functions were needed if you were adding "array support" to a language.
To put it a different way, in what situation would you actively want to know the number of code points in a string, rather than either the number of bytes in its UTF8 representation, or the number of graphemes?
Rowan,
As I said at the top of my first post, the important thing is to capture
what those requirements actually are. Just as you'd choose what array
functions were needed if you were adding "array support" to a language.
I'm sorry for not making myself clear. What i'm essentially saying is
that i think "noël" test is synthetic and impractical, it's also
solvable with requirement of NFC strings at input and this is not
implementation defect. I also believe that Hangul is most likely to be
precomposed and will work alright. And i have another opinion on UTF-8
shortest-form.
This is my personal opinion of course.
That aside.
I think requirements is what i was asking about, i'm assuming that your
standpoint is that string modification routines are at least required to
take into account entire characters, not only code points. Am i correct?
What is confusing me is that i think you're seeing it as a major
implementation defect. To avoid arguable implementations, i've made
short example in Java:
System.out.println(new StringBuffer("noël").reverse().toString());
It does produce string "l̈eon" as i would expect. Precomposed "noël" also
works as i would expect producing string "lëon". What do you think, is
this implementation issue or solely requirements issue?
Aleksey Tulinov wrote (on 15/10/2014):
Rowan,
As I said at the top of my first post, the important thing is to capture
what those requirements actually are. Just as you'd choose what array
functions were needed if you were adding "array support" to a language.I'm sorry for not making myself clear. What i'm essentially saying is
that i think "noël" test is synthetic and impractical
I remain unconvinced on that, and it's just one example. There are
plenty of forms which don't have a combined form, otherwise there would
be no need for combining diacritics to exist in the first place.
it's also solvable with requirement of NFC strings at input and this
is not implementation defect. I also believe that Hangul is most
likely to be precomposed and will work alright.
Requiring a particular normal form on input is not something a
programming language can do. The only way you can guarantee NFC form is
by performing the normalisation.
And i have another opinion on UTF-8 shortest-form.
There's no need for opinion there, we can consult the standard.
http://www.unicode.org/versions/Unicode6.0.0/
D76 Unicode scalar value: Any Unicode code point except
high-surrogate and low-surrogate
code points.
D79 A Unicode encoding form assigns each Unicode scalar value to a
unique code unit
sequence.
D77 Code unit: The minimal bit combination that can represent a unit
of encoded text
for processing or interchange. [...] The Unicode Standard uses 8-bit
code units in the UTF-8 encoding form [...]
D79 A Unicode encoding form assigns each Unicode scalar value to a
unique code unit
sequence.
D85a Minimal well-formed code unit subsequence: A well-formed Unicode
code unit
sequence that maps to a single Unicode scalar value.
D92 UTF-8 encoding form: The Unicode encoding form that assigns each
Unicode scalar
value to an unsigned byte sequence of one to four bytes in length, as
specified in
Table 3-6 and Table 3-7.
Before the Unicode Standard, Version 3.1, the problematic
“non-shortest form”
byte sequences in UTF-8 were those where BMP characters could be represented
in more than one way. These sequences are ill-formed, because they are
not allowed by Table 3-7.
In short: UTF-8 defines a mapping between sequences of 8-bit "code
units" to abstract "Unicode scalar values". Every Unicode scalar value
maps to a single unique sequence of code units, but all Unicode scalar
values can be represented. Since "U+0308 COMBINING DIAERESIS" is a valid
Unicode scalar value, a UTF-8 string representing that value can be
well-formed. It is only alternative representations of the same Unicode
scalar value which must be in shortest form.
There may be standards for interchange in particular situations which
enforce additional constraints, such as that all strings should be in
NFC, but the applicability or correct implementation of such standards
is not something that you can use to define handling in an entire
programming language.
That aside.
I think requirements is what i was asking about, i'm assuming that
your standpoint is that string modification routines are at least
required to take into account entire characters, not only code points.
Am i correct?
Yes, I think that at least some functions should be available which work
on "characters" as users would define them, such as length and perhaps
safe truncation.
What is confusing me is that i think you're seeing it as a major
implementation defect. To avoid arguable implementations, i've made
short example in Java:System.out.println(new StringBuffer("noël").reverse().toString());
It does produce string "l̈eon" as i would expect.
Why do you expect that? Is this a result which would ever be useful?
To be clear, I am suggesting that we aim to be the language which gets
this right, where other languages get it wrong.
Precomposed "noël" also works as i would expect producing string
"lëon". What do you think, is this implementation issue or solely
requirements issue?
Well, you can only define an implementation defect with respect to the
original requirement. If the requirement was to reverse "characters", as
most users would understand that term, then moving the diacritic to a
different letter fails that requirement, because a user would not
consider a diacritic a separate character.
If the requirement was to reverse code points, regardless of their
meaning, then the implementation is fine, but I would argue that the
requirement failed to capture what most users would actually want.
Regards,
Rowan Collins
[IMSoP]
Rowan,
What is confusing me is that i think you're seeing it as a major
implementation defect. To avoid arguable implementations, i've made
short example in Java:System.out.println(new StringBuffer("noël").reverse().toString());
It does produce string "l̈eon" as i would expect.
Why do you expect that? Is this a result which would ever be useful?
I think expect it to work this way because i know that this is a good
trade-off between performance and produced result. It also leaves a
possibility to do it better if i need to.
To be clear, I am suggesting that we aim to be the language which gets
this right, where other languages get it wrong.
Thank you for explaining this. I also think it could do better. I think
Unicode-aware strrev()
shouldn't be too complicated to do.
Hello,
I think that Rowan is right: PHP users need to manipulate grapheme clusters
first (and code points in some rare situations). The fact that most of us
live in a world were NFC composes all our characters only hides this
reality.
A typical use case is a template engine: nearly all string manipulations
there need grapheme awareness: cutting strings for getting excerpt,
inserting a space between every "character", changing the case, etc. A
typical use case for a PHP app.
An other use case is if you want to implement text indexing in PHP: you
need to normalize before indexing, handle case folding, and thus think in
terms of graphemes. I'm not sure this is frequent in PHP though.
Like already said, alongside with grapheme clusters, we should also deals
with string matching: collations are out of scope, but normalization and
case folding is in. Please do not forget the turkish alphabet
https://github.com/nicolas-grekas/Patchwork-UTF8/blob/master/class/Patchwork/TurkishUtf8.php
also...
This is required IMHO to have what user expects for str_replace, strpos,
strcmp, etc.
I wrote a quite successful PHP lib to deal with this in PHP:
https://github.com/nicolas-grekas/Patchwork-UTF8
My experience from this is the following:
- dealing with grapheme clusters in current PHP is ok with grapheme_*()
functions, but these require intl. It would be great to have them (or an
equivalent) in core, - NFC normalization of all input is required to deal with string
comparisons, so having Normalizer in core looks required also, - almost everybody uses mbstring when dealing with utf8 strings, but almost
all cases should use a grapheme_*() instead.
To be clear, I am suggesting that we aim to be the language which gets
this right, where other languages get it wrong.
Thank you for explaining this. I also think it could do better. I think
Unicode-awarestrrev()
shouldn't be too complicated to do.
Perl 6 identified the subject very well and invented what they call "NFG",
which is NFC + dynamic internal code points for non-composable grapheme
clusters:
http://docs.parrot.org/parrot/latest/html/docs/pdds/pdd28_strings.pod.html
Maybe worth looking at?
Cheers,
Nicolas
- Is there a need for more Unicode support in PHP?
- What is currently missing in that regard?
- Is this a good place to ask such questions?
I need to ask ...
Is this discussion only about improving support for UTF8 content in PHP?
What is the current state of play with regards function and variable names?
--
Lester Caine - G8HFL
Contact - http://lsces.co.uk/wiki/?page=contact
L.S.Caine Electronic Services - http://lsces.co.uk
EnquirySolve - http://enquirysolve.com/
Model Engineers Digital Workshop - http://medw.co.uk
Rainbow Digital Media - http://rainbowdigitalmedia.co.uk