hi,
Unicode still remains one of the top requested features in PHP.
However as Rasmus and other stated earlier, it is not a trivial job.
Some of the keys point we need to take care of are:
- UTF-8 storage
- UTF-8 support for almost (if not all) existing string APIs
- Performance
As of today, I did not find any library covering at least two of these
key points.
Please keep in mind that I am by no mean a Unicode expert, and this
summary is what I gather by reading the ICU and other projects
documentation and discussions archives. Experiments still have to be
done. However I rather prefer to discuss the options prior to go wild
with an implementation (huge task, even for basic features coverage).
If one of the following statement is wrong or not accurate, please fix
it. I will keep a dedicated wiki page to summarize the discussions and
options about unicode support.
- ICU:
U_CHARSET_IS_UTF8 allows to force ICU to use UTF-8 by default. It is a
ICU compile time setting.It is is not possible to set it at PHP
configure time. It means that users will have to create their own
build. Alternatively we can bundle ICU but this will be awkward, a
maintenance nightmare for both php and the distros.
Alternatively UText can be used to create UTF-8 string. APIs accepting
UText allow almost everything we need. However the counterpart is that
a UTF-8 UText is readonly. Any operation altering its content will
require duplication, clones or conversions. That may kill all gains we
got from using UTF-8 only.
The U_CHARSET_IS_UTF8 is very appealing but to bundle ICU is actually
show stopper. Asking users to custom build ICU is not an option
either. I do not know if the distros will be ready to provide two
different builds of ICU either, it may add a lot of issues with all
projects using ICU.
- UTF8proc
utf8proc is very attractive, small and relatively fast. I see it as a
good starting point. However its features cover a very little part of
what PHP needs.It is easy to bundle but will require a fork and a lot
of work to add all missing features.
librope
Same comments than utf8proc, with even less features.
I would like to begin to discuss our option now already. I am not
asking to get in all implementation details from a userland point of
view (like u"some text" or addng new APIs or not) but only to see what
we can do internally to work with UTF-8 string.
Thoughts, comments or ideas?
Links&reference
https://github.com/josephg/librope
https://github.com/josephg/librope
http://userguide.icu-project.org/strings/utf-8
Cheers,
Pierre
@pierrejoye | http://www.libgd.org
Hi Pierre,
Thoughts, comments or ideas?
it may be crazy to even think about it, but my idea is to mix the
"worst" (C++ and ICU) to get the ultimate unicode foundation.
Boost.Locale:
http://www.boost.org/doc/libs/1_55_0/libs/locale/doc/html/rationale.html#why_icu
cryptocompress
On Feb 20, 2014 10:05 PM, "Crypto Compress" cryptocompress@googlemail.com
wrote:
Hi Pierre,
Thoughts, comments or ideas?
it may be crazy to even think about it, but my idea is to mix the "worst"
(C++ and ICU) to get the ultimate unicode foundation.Boost.Locale:
http://www.boost.org/doc/libs/1_55_0/libs/locale/doc/html/rationale.html#why_icu
Mainly because we like to use UTF-8 storage.
Also pure c++ api is not an option yet. Unless we rewrite php in c++ but
then I would go with a 3-4 years Dev phase, not really what I am looking
for.
Cheers
Pierre
Hello :-),
Also pure c++ api is not an option yet. Unless we rewrite php in c++ but
then I would go with a 3-4 years Dev phase, not really what I am looking
for.
Several projects use a mix of C and C++ (the first name that comes in my
minds is Gecko). That's not a bad thing.
--
Ivan Enderlin
Developer of Hoa
http://hoa-project.net/
PhD. student at DISC/Femto-ST (Vesontio) and INRIA (Cassis)
http://disc.univ-fcomte.fr/ and http://www.inria.fr/
Member of HTML and WebApps Working Group of W3C
http://w3.org/
On Feb 20, 2014 10:48 PM, "Ivan Enderlin @ Hoa" <
ivan.enderlin@hoa-project.net> wrote:
Hello :-),
Also pure c++ api is not an option yet. Unless we rewrite php in c++ but
then I would go with a 3-4 years Dev phase, not really what I am looking
for.Several projects use a mix of C and C++ (the first name that comes in my
minds is Gecko). That's not a bad thing.
Really want this part to be c++? It will be used everywhere.
On Feb 20, 2014 10:48 PM, "Ivan Enderlin @ Hoa"
<ivan.enderlin@hoa-project.net mailto:ivan.enderlin@hoa-project.net>
wrote:Hello :-),
Also pure c++ api is not an option yet. Unless we rewrite php in
c++ but
then I would go with a 3-4 years Dev phase, not really what I am
looking
for.Several projects use a mix of C and C++ (the first name that comes
in my minds is Gecko). That's not a bad thing.Really want this part to be c++? It will be used everywhere.
Nop. It just was saying that if it must be, it must not be a problem. Nop?
--
Ivan Enderlin
Developer of Hoa
http://hoa-project.net/
PhD. student at DISC/Femto-ST (Vesontio) and INRIA (Cassis)
http://disc.univ-fcomte.fr/ and http://www.inria.fr/
Member of HTML and WebApps Working Group of W3C
http://w3.org/
Hello :-),
Also pure c++ api is not an option yet. Unless we rewrite php in c++ but
then I would go with a 3-4 years Dev phase, not really what I am looking
for.
Several projects use a mix of C and C++ (the first name that comes in my
minds is Gecko). That's not a bad thing.
and I recall that MySQL is a big C/C++ mix :)
Andrey
Hello :-),
Also pure c++ api is not an option yet. Unless we rewrite php in c++ but
then I would go with a 3-4 years Dev phase, not really what I am looking
for.
Several projects use a mix of C and C++ (the first name that comes in my
minds is Gecko). That's not a bad thing.
Mixing to some degree is fine. The issue here is that essentially all
our source files become C++ files as basically everywhere we need string
APIs ... ok, we might abstract away most common things behind C wrappers
but then we get a fancy mix where we can't use the good parts of C++ but
are bound by C's limitations.
Not really good for having maintainable clean code. Pure C++11 could
bring nice code (with some C++ weirdness) but that's a rewrite, not an
evolution. (-> HHVM). In PHP's context C++ is good to use in individual
libraries (i.e. ext/intl) but not for core foundation things.
Unfortunately.
johannes
Hi Pierre,
Thoughts, comments or ideas?
it may be crazy to even think about it, but my idea is to mix the
"worst" (C++ and ICU) to get the ultimate unicode foundation.Boost.Locale:
http://www.boost.org/doc/libs/1_55_0/libs/locale/doc/html/rationale.html#why_icu
Mainly because we like to use UTF-8 storage.
What do you understand by "storage"?
Quote: "U_CHARSET_IS_UTF8 allows to force ICU to use UTF-8 by default.
It is a ICU compile time setting."
Source: Pierre
Quote: "...stateless encodings like UTF-8..."
Source:
http://www.boost.org/doc/libs/1_55_0/libs/locale/doc/html/charset_handling.html#codecvt_limitations
Also pure c++ api is not an option yet. Unless we rewrite php in c++
but then I would go with a 3-4 years Dev phase
Yes, a complete rewrite is insane not crazy. Prefer small evolutionary
steps and some bigger ones.
...not really what I am looking for.
We know what you are looking for. You asked for ideas.
cryptocompress
On Feb 21, 2014 4:05 AM, "Crypto Compress" cryptocompress@googlemail.com
wrote:
Hi Pierre,
Thoughts, comments or ideas?
it may be crazy to even think about it, but my idea is to mix the
"worst" (C++ and ICU) to get the ultimate unicode foundation.Boost.Locale:
http://www.boost.org/doc/libs/1_55_0/libs/locale/doc/html/rationale.html#why_icu
Mainly because we like to use UTF-8 storage.
What do you understand by "storage"?
To have string stored as UTF-8 only, no conversion required for 99% of our
use.
Quote: "U_CHARSET_IS_UTF8 allows to force ICU to use UTF-8 by default. It
is a ICU compile time setting."
Source: PierreQuote: "...stateless encodings like UTF-8..."
Source:
http://www.boost.org/doc/libs/1_55_0/libs/locale/doc/html/charset_handling.html#codecvt_limitations
ICU has limitations in the utf-8 mode. To bundle UTF-8 or ask custom build
could be a problem as well. The other point to check is whether ICU can
have two installs on the sane systems, one with the flag and one without.
It could help as distros can then provide both.
Also pure c++ api is not an option yet. Unless we rewrite php in c++ but
then I would go with a 3-4 years Dev phaseYes, a complete rewrite is insane not crazy. Prefer small evolutionary
steps and some bigger ones.
...not really what I am looking for.
We know what you are looking for. You asked for ideas.
Maybe I was not clear here. What I am not looking for is to delay a
possible 6 release to 2-3 years later. Ideas, like yours here, are indeed
more than welcome.
Pierre Joye wrote:
What do you understand by "storage"?
To have string stored as UTF-8 only, no conversion required for 99% of our use.
I think that the first thing that needs to be agreed on is if there will be
support for UTF-8 in the core? As has already been said, in many places this
currently just works and so blocking that may be more of a problem now? The
question surly is "What is the 1% that needs some extra work?"
I light library would be most appropriate for filling the gaps currently created
by use of UTF-8 strings in the core? It is not until one starts adding the
mbstring level of string processing that a more powerful library is required.
Something that simply ensures UTF-8 strings are valid and can carry out
comparisons as required?
The black hole is still 'case sensitivity' and it is perhaps laying down a
'light' set of rules for this which would allow a path forward? As I have
indicated, I'd prefer simply dropping case insensitivity, but a compromise might
be to retain it where a string length does not change, and a clean reverse
transform exists? So a library that provides that comparison as part of the core
package?
I think that moving forward, ICU support is essential, but it is difficult while
the 'wrong' defaults are applied and I am seeing private builds being used in
other projects to get around that hurdle. Hence my question as to if people are
taking that approach.
--
Lester Caine - G8HFL
Contact - http://lsces.co.uk/wiki/?page=contact
L.S.Caine Electronic Services - http://lsces.co.uk
EnquirySolve - http://enquirysolve.com/
Model Engineers Digital Workshop - http://medw.co.uk
Rainbow Digital Media - http://rainbowdigitalmedia.co.uk
Pierre Joye wrote:
What do you understand by "storage"?
To have string stored as UTF-8 only, no conversion required for 99% of our
use.I think that the first thing that needs to be agreed on is if there will be
support for UTF-8 in the core? As has already been said, in many places this
currently just works and so blocking that may be more of a problem now? The
question surly is "What is the 1% that needs some extra work?"
I think we pretty much agree already that we need UTF-8 as the base,
meaning are stored in UTF-8. Conversions may be needed for advanced
usages provided by ICU (or maybe not, I just do not know for sure
now).
I light library would be most appropriate for filling the gaps currently
created by use of UTF-8 strings in the core? It is not until one starts
adding the mbstring level of string processing that a more powerful library
is required. Something that simply ensures UTF-8 strings are valid and can
carry out comparisons as required?
it is more than only comparison. If only comparison, additions and the
likes, utf8proc is enough, or librope with some additions.
The black hole is still 'case sensitivity' and it is perhaps laying down a
'light' set of rules for this which would allow a path forward? As I have
indicated, I'd prefer simply dropping case insensitivity, but a compromise
might be to retain it where a string length does not change, and a clean
reverse transform exists? So a library that provides that comparison as part
of the core package?
I do not care much about languages support for UTF-8 names for
methods, functons, variables etc. My take on it is that we should
stick to ASCII for it and be done with that. But that's only my
opinion :)
We may end writing our own library for the core operations... But I
would prefer to avoid that as it is really not a trivial task.
Cheers,
Pierre
@pierrejoye | http://www.libgd.org
Pierre Joye wrote:
Pierre Joye wrote:
What do you understand by "storage"?
To have string stored as UTF-8 only, no conversion required for 99% of our
use.I think that the first thing that needs to be agreed on is if there will be
support for UTF-8 in the core? As has already been said, in many places this
currently just works and so blocking that may be more of a problem now? The
question surly is "What is the 1% that needs some extra work?"I think we pretty much agree already that we need UTF-8 as the base,
meaning are stored in UTF-8. Conversions may be needed for advanced
usages provided by ICU (or maybe not, I just do not know for sure
now).I light library would be most appropriate for filling the gaps currently
created by use of UTF-8 strings in the core? It is not until one starts
adding the mbstring level of string processing that a more powerful library
is required. Something that simply ensures UTF-8 strings are valid and can
carry out comparisons as required?it is more than only comparison. If only comparison, additions and the
likes, utf8proc is enough, or librope with some additions.
Only thing putting me off utf8proc is that it only supports Unicode 5.0.0
librope does not seem to understand any of the fine detail of the uncode
standards? What I've been looking for is the case switch actions and currently
all I can find is ICU to handle that?
The black hole is still 'case sensitivity' and it is perhaps laying down a
'light' set of rules for this which would allow a path forward? As I have
indicated, I'd prefer simply dropping case insensitivity, but a compromise
might be to retain it where a string length does not change, and a clean
reverse transform exists? So a library that provides that comparison as part
of the core package?I do not care much about languages support for UTF-8 names for
methods, functons, variables etc. My take on it is that we should
stick to ASCII for it and be done with that. But that's only my
opinion :)
While I have no intention of using more than ASCII myself I can see the argument
for supporting use of more user friendly names for functions and the like. I see
the complaints about our current 'English' names and how they need improving
while at the same time I am dealing with customer sites where we provide simple
aliases for all text in a local translation. Easy enough in a relational
database where you simply select the right set of entries from a table, but not
so easy for PHP ...
We may end writing our own library for the core operations... But I
would prefer to avoid that as it is really not a trivial task.
Totally agree ... but I don't see a good path yet?
While ICU creates it's own complications, using ready bundled versions, it is by
far the cleanest code for both UTF-8 and actually UTF-32 if one simply ditches
all the UTF-16 mess. I'd much rather start from that code than any of the other
libraries so far identified. In any case I don't see any option for the
conversion process to and from UTF-8?
--
Lester Caine - G8HFL
Contact - http://lsces.co.uk/wiki/?page=contact
L.S.Caine Electronic Services - http://lsces.co.uk
EnquirySolve - http://enquirysolve.com/
Model Engineers Digital Workshop - http://medw.co.uk
Rainbow Digital Media - http://rainbowdigitalmedia.co.uk
hi,
Hello :-),
Unicode still remains one of the top requested features in PHP.
However as Rasmus and other stated earlier, it is not a trivial job.
Some of the keys point we need to take care of are:
- UTF-8 storage
- UTF-8 support for almost (if not all) existing string APIs
- Performance
As of today, I did not find any library covering at least two of these
key points.[snip]
I would like to begin to discuss our option now already. I am not
asking to get in all implementation details from a userland point of
view (like u"some text" or addng new APIs or not) but only to see what
we can do internally to work with UTF-8 string.
Just a little note: using au"foobar"
syntax would help to switch from
one to another light or heavy implementation internally, and thus, it
would help to cover at least two of the key points described above.
I would mention the Rust implementation of UTF-8 strings [1, 2]. It's
fast, it's safe and it has a nice large API. I don't say I want to see
PHP using Rust. I think it would be hard to do (even if it will
certainly benefit PHP), but the algorithms they used can be a source of
inspiration for us. Maybe we should consider it if we decide to have our
own implementation instead of using a third library.
Cheers.
[1] https://github.com/mozilla/rust/blob/master/src/libstd/str.rs
[2] http://static.rust-lang.org/doc/master/std/str/index.html
--
Ivan Enderlin
Developer of Hoa
http://hoa-project.net/
PhD. student at DISC/Femto-ST (Vesontio) and INRIA (Cassis)
http://disc.univ-fcomte.fr/ and http://www.inria.fr/
Member of HTML and WebApps Working Group of W3C
http://w3.org/
hi,
I'm a PHP developer a long time by have only a little knowledge in C/C++
so I can't know some really internal parts of the engine.
From my perspective the internal datatype "string" should be a binary
string (byte array) and only in specific context this binary string can
be interpreted as a more specialized string. In my understanding in 90%
it's already the case.
Unicode support (and other) could be done as a String class like it's
done in Java and implementing a magic method "__toString" to get the raw
binary string. - We already have "(binary)" as an alias for "(string)".
This should be almost compatible with current behavior and provide a
very clean API as sugar.
Only things were the current string type will not be handled as a binary
string without context needs to be updated.
... like var_dump("1e1" == "10"); but var_dump("1e1" == 10); should work
as before because the integer type would switch the binary string into
the context of a numeric (ascii) string.
Thoughts?
Marc
hi,
Unicode still remains one of the top requested features in PHP.
However as Rasmus and other stated earlier, it is not a trivial job.
Some of the keys point we need to take care of are:
- UTF-8 storage
- UTF-8 support for almost (if not all) existing string APIs
- Performance
As of today, I did not find any library covering at least two of these
key points.Please keep in mind that I am by no mean a Unicode expert, and this
summary is what I gather by reading the ICU and other projects
documentation and discussions archives. Experiments still have to be
done. However I rather prefer to discuss the options prior to go wild
with an implementation (huge task, even for basic features coverage).If one of the following statement is wrong or not accurate, please fix
it. I will keep a dedicated wiki page to summarize the discussions and
options about unicode support.
- ICU:
U_CHARSET_IS_UTF8 allows to force ICU to use UTF-8 by default. It is a
ICU compile time setting.It is is not possible to set it at PHP
configure time. It means that users will have to create their own
build. Alternatively we can bundle ICU but this will be awkward, a
maintenance nightmare for both php and the distros.Alternatively UText can be used to create UTF-8 string. APIs accepting
UText allow almost everything we need. However the counterpart is that
a UTF-8 UText is readonly. Any operation altering its content will
require duplication, clones or conversions. That may kill all gains we
got from using UTF-8 only.The U_CHARSET_IS_UTF8 is very appealing but to bundle ICU is actually
show stopper. Asking users to custom build ICU is not an option
either. I do not know if the distros will be ready to provide two
different builds of ICU either, it may add a lot of issues with all
projects using ICU.
- UTF8proc
utf8proc is very attractive, small and relatively fast. I see it as a
good starting point. However its features cover a very little part of
what PHP needs.It is easy to bundle but will require a fork and a lot
of work to add all missing features.librope
Same comments than utf8proc, with even less features.I would like to begin to discuss our option now already. I am not
asking to get in all implementation details from a userland point of
view (like u"some text" or addng new APIs or not) but only to see what
we can do internally to work with UTF-8 string.Thoughts, comments or ideas?
Links&reference
https://github.com/josephg/librope
https://github.com/josephg/librope
http://userguide.icu-project.org/strings/utf-8Cheers,
- ICU:
U_CHARSET_IS_UTF8 allows to force ICU to use UTF-8 by default. It is a
ICU compile time setting.It is is not possible to set it at PHP
configure time. It means that users will have to create their own
build. Alternatively we can bundle ICU but this will be awkward, a
maintenance nightmare for both php and the distros.Alternatively UText can be used to create UTF-8 string. APIs accepting
UText allow almost everything we need. However the counterpart is that
a UTF-8 UText is readonly. Any operation altering its content will
require duplication, clones or conversions. That may kill all gains we
got from using UTF-8 only.The U_CHARSET_IS_UTF8 is very appealing but to bundle ICU is actually
show stopper. Asking users to custom build ICU is not an option
either. I do not know if the distros will be ready to provide two
different builds of ICU either, it may add a lot of issues with all
projects using ICU.
Here is a 1st reply from ICU:
http://sourceforge.net/p/icu/mailman/message/32031609/
It sounds like this flag could be a good option for PHP's Unicode support.
Btw, I created a sub page for Unicode support:
https://wiki.php.net/ideas/php6/unicode
Thoughts, comments or ideas?
I found another C++ library to do the basic UTF-8 operations, easl:
https://code.google.com/p/easl/
It could be a nice one to use in combination with ICU, small and fast
(1st tests).
Cheers,
Pierre
@pierrejoye | http://www.libgd.org
Pierre Joye wrote:
- ICU:
U_CHARSET_IS_UTF8 allows to force ICU to use UTF-8 by default. It is a
ICU compile time setting.It is is not possible to set it at PHP
configure time. It means that users will have to create their own
build. Alternatively we can bundle ICU but this will be awkward, a
maintenance nightmare for both php and the distros.Alternatively UText can be used to create UTF-8 string. APIs accepting
UText allow almost everything we need. However the counterpart is that
a UTF-8 UText is readonly. Any operation altering its content will
require duplication, clones or conversions. That may kill all gains we
got from using UTF-8 only.The U_CHARSET_IS_UTF8 is very appealing but to bundle ICU is actually
show stopper. Asking users to custom build ICU is not an option
either. I do not know if the distros will be ready to provide two
different builds of ICU either, it may add a lot of issues with all
projects using ICU.Here is a 1st reply from ICU:
http://sourceforge.net/p/icu/mailman/message/32031609/
It sounds like this flag could be a good option for PHP's Unicode support.
Reading between the lines, it would seem that a switch to UTF-8 base is their
preferred path, but the core code is too engrained as UTF-16? Since there is
really no alternative to ICU for the heavy grunt, I do see this as the right
starting point. Any 'bells and whistles' should use the ICU UTF-8 style rather
than pulling in yet more variations?
The main problem in all of this is how it dovetails into windows? The reliance
on 'UTF-16' style WCHAR seems to be the real problem there?
Btw, I created a sub page for Unicode support:
https://wiki.php.net/ideas/php6/unicode
Thoughts, comments or ideas?
Like you Pierre I'm no Unicode expert, and digging deeper simply reinforces the
at times irritating compromises that Unicode contains. Obviously designed by
committee? :(
Currently I'm trying to work out just what is required at the core to support
UTF-8 and while it is not a trivial problem, the bulk of the code is designed to
handle strings of variable length and in it's basic form UTF-8 just creates
longer strings? So isn't the next question quite simply 'case'? And how we
handle case insensitivity in the core will determine what core Unicode functions
are required?
I found another C++ library to do the basic UTF-8 operations, easl:
https://code.google.com/p/easl/
It could be a nice one to use in combination with ICU, small and fast
(1st tests).
C++ ?
That what ever is used will need to be both tailored for PHP and transparent as
far as ICU is concerned is as you have identified - a given. ICU is still built
using 32bit string lengths ( I think? ) which does add to the fun, but I don't
see any reason not to be using functions like compareUTF8() and
ucasemap_utf8ToLower() from ICU in which case the strings need to be standard
ICU UTF-8 strings? I can see the advantage of the 'fast' compare that I have
been banging on about elsewhere, which looks for a simple match between two raw
strings of bytes. UTF-8 only comes into that when you need to add 'rank'? But
much of the core processing CAN simply ignore that as long as the generic calls
don't have dead tails which activate it?
Given the complexity of case conversion I can see the possible need for a mirror
string holding a 'lower case' version which may be a different length and so
'string' could become a more complex object? But is this aspect what you are
looking for the 'small fast library' to provide? easl would seem only to be
trying to smooth the edges between windows and other platforms?
--
Lester Caine - G8HFL
Contact - http://lsces.co.uk/wiki/?page=contact
L.S.Caine Electronic Services - http://lsces.co.uk
EnquirySolve - http://enquirysolve.com/
Model Engineers Digital Workshop - http://medw.co.uk
Rainbow Digital Media - http://rainbowdigitalmedia.co.uk
Pierre Joye wrote:
- ICU:
U_CHARSET_IS_UTF8 allows to force ICU to use UTF-8 by default. It is a
ICU compile time setting.It is is not possible to set it at PHP
configure time. It means that users will have to create their own
build. Alternatively we can bundle ICU but this will be awkward, a
maintenance nightmare for both php and the distros.Alternatively UText can be used to create UTF-8 string. APIs accepting
UText allow almost everything we need. However the counterpart is that
a UTF-8 UText is readonly. Any operation altering its content will
require duplication, clones or conversions. That may kill all gains we
got from using UTF-8 only.The U_CHARSET_IS_UTF8 is very appealing but to bundle ICU is actually
show stopper. Asking users to custom build ICU is not an option
either. I do not know if the distros will be ready to provide two
different builds of ICU either, it may add a lot of issues with all
projects using ICU.Here is a 1st reply from ICU:
http://sourceforge.net/p/icu/mailman/message/32031609/
It sounds like this flag could be a good option for PHP's Unicode support.
Reading between the lines, it would seem that a switch to UTF-8 base is
their preferred path, but the core code is too engrained as UTF-16? Since
there is really no alternative to ICU for the heavy grunt, I do see this as
the right starting point. Any 'bells and whistles' should use the ICU UTF-8
style rather than pulling in yet more variations?
There are optimizations when this flag is used. Not all operations are
possible using UTF-8, in these cases a conversions will be done
before.
There are not much to read between the lines here :)
The main problem in all of this is how it dovetails into windows? The
reliance on 'UTF-16' style WCHAR seems to be the real problem there?
wchar is not UTF-16, nor Unicode. It is something we have to deal with
no matter which road we take. Conversions from UTF-* to and from wchar
will be required anyway on windows, for any *w APIs call.
Btw, I created a sub page for Unicode support:
https://wiki.php.net/ideas/php6/unicode
Thoughts, comments or ideas?
Like you Pierre I'm no Unicode expert, and digging deeper simply reinforces
the at times irritating compromises that Unicode contains. Obviously
designed by committee? :(Currently I'm trying to work out just what is required at the core to
support UTF-8 and while it is not a trivial problem, the bulk of the code is
designed to handle strings of variable length and in it's basic form UTF-8
just creates longer strings? So isn't the next question quite simply 'case'?
And how we handle case insensitivity in the core will determine what core
Unicode functions are required?
I do not care about case insensitivity yet, nor about unicode
function/method/constant/etc names. This is a secondary issue at this
stage.
I found another C++ library to do the basic UTF-8 operations, easl:
https://code.google.com/p/easl/
It could be a nice one to use in combination with ICU, small and fast
(1st tests).C++ ?
yes. with c helpers.
That what ever is used will need to be both tailored for PHP and transparent
as far as ICU is concerned is as you have identified - a given. ICU is still
built using 32bit string lengths ( I think? ) which does add to the fun, but
I don't see any reason not to be using functions like compareUTF8() and
ucasemap_utf8ToLower() from ICU in which case the strings need to be
standard ICU UTF-8 strings? I can see the advantage of the 'fast' compare
that I have been banging on about elsewhere, which looks for a simple match
between two raw strings of bytes. UTF-8 only comes into that when you need
to add 'rank'? But much of the core processing CAN simply ignore that as
long as the generic calls don't have dead tails which activate it?
We may use our own functions (or other lib) to covers operations not
implemented in ICU or too slow because of the conversions. That's why
investigating in other tools is still a good thing to do.
Cheers,
Pierre
Pierre Joye wrote:
That what ever is used will need to be both tailored for PHP and transparent
as far as ICU is concerned is as you have identified - a given. ICU is still
built using 32bit string lengths ( I think? ) which does add to the fun, but
I don't see any reason not to be using functions like compareUTF8() and
ucasemap_utf8ToLower() from ICU in which case the strings need to be
standard ICU UTF-8 strings? I can see the advantage of the 'fast' compare
that I have been banging on about elsewhere, which looks for a simple match
between two raw strings of bytes. UTF-8 only comes into that when you need
to add 'rank'? But much of the core processing CAN simply ignore that as
long as the generic calls don't have dead tails which activate it?
We may use our own functions (or other lib) to covers operations not
implemented in ICU or too slow because of the conversions. That's why
investigating in other tools is still a good thing to do.
The bit I'm still missing here is 'operations not implemented in ICU'?
As soon as conversions are required then speed is always going to be
compromised, but where the platform is already UTF-8 based, which is a growing
situation, then all we are looking for is to handle UTF-8 strings quickly. For
the best performance conversions can simply be avoided. So I'm currently looking
at conversion as a secondary problem - probably less important than case! - and
just trying to identify what is missing from ICU's UTF-8 that needs to be added?
It may well be that windows is a special case that needs it's own conversion
layer, but that should not form part of any core upgrade. It is not needed for
many installations?
--
Lester Caine - G8HFL
Contact - http://lsces.co.uk/wiki/?page=contact
L.S.Caine Electronic Services - http://lsces.co.uk
EnquirySolve - http://enquirysolve.com/
Model Engineers Digital Workshop - http://medw.co.uk
Rainbow Digital Media - http://rainbowdigitalmedia.co.uk
The bit I'm still missing here is 'operations not implemented in ICU'?
As soon as conversions are required then speed is always going to be
compromised, but where the platform is already UTF-8 based, which is a
growing situation, then all we are looking for is to handle UTF-8 strings
quickly. For the best performance conversions can simply be avoided. So I'm
currently looking at conversion as a secondary problem - probably less
important than case! - and just trying to identify what is missing from
ICU's UTF-8 that needs to be added?
yes. And to see if what is available is "fast enough".
It may well be that windows is a special case that needs it's own conversion
layer, but that should not form part of any core upgrade. It is not needed
for many installations?
Sorry if I was not clear earlier. No matter what we do, we will have
to convert to or from UTF-8 for all file related APIs on Windows
anyway. But that will happen before 6, so not really a unicode problem
in this case, more a bug fix to bring windows in line with linux,
about UTF-8 paths support (and longer paths), but that's a different
topic.
I do not understand what you mean by "not form part of any core upgrade".
Cheers,
Pierre
@pierrejoye | http://www.libgd.org
Pierre Joye wrote:
It may well be that windows is a special case that needs it's own conversion
layer, but that should not form part of any core upgrade. It is not needed
for many installations?
Sorry if I was not clear earlier. No matter what we do, we will have
to convert to or from UTF-8 for all file related APIs on Windows
anyway. But that will happen before 6, so not really a unicode problem
in this case, more a bug fix to bring windows in line with linux,
about UTF-8 paths support (and longer paths), but that's a different
topic.
I do not understand what you mean by "not form part of any core upgrade".
Actually you have already provided a better answer to that than I could have :)
Handling the remaining windows related problems only really relate to windows
builds? I was naively thinking that the windows conversions were part of this as
the easl library seemed to be more involved with that than simple UTF-8
handling? I'll put my hand up that I've not worried about windows for some time
now, but I do have a number of PHP5.2 sites that will need to be upgraded soon
and the discussion on Firebird is also about ICU and Windows ... but none of
those sites will ever need anything more than simple ASCII anyway.
--
Lester Caine - G8HFL
Contact - http://lsces.co.uk/wiki/?page=contact
L.S.Caine Electronic Services - http://lsces.co.uk
EnquirySolve - http://enquirysolve.com/
Model Engineers Digital Workshop - http://medw.co.uk
Rainbow Digital Media - http://rainbowdigitalmedia.co.uk
Hi,
unicode parts of oniguruma look small, fast and awesome.
cryptocompress
On Thu, Mar 13, 2014 at 8:28 PM, Crypto Compress <
cryptocompress@googlemail.com> wrote:
unicode parts of oniguruma look small, fast and awesome.
I agree, but it's LGPL...
--
Yasuo Ohgaki
yohgaki@ohgaki.net
Am 14.03.2014 00:07, schrieb Yasuo Ohgaki:
On Thu, Mar 13, 2014 at 8:28 PM, Crypto Compress
<cryptocompress@googlemail.com mailto:cryptocompress@googlemail.com>
wrote:unicode parts of oniguruma look small, fast and awesome.
I agree, but it's LGPL...
License: BSD
https://github.com/k-takata/Onigmo
Hi Crypto,
On Fri, Mar 14, 2014 at 4:49 PM, Crypto Compress <
cryptocompress@googlemail.com> wrote:
On Thu, Mar 13, 2014 at 8:28 PM, Crypto Compress <
cryptocompress@googlemail.com> wrote:unicode parts of oniguruma look small, fast and awesome.
I agree, but it's LGPL...
License: BSD
https://github.com/k-takata/Onigmo
I didn't know this fork.
Thank you :)
I shall try to ask libmbfl copyright folders if it's possible to change the
license to BSD like license.
Regards,
--
Yasuo Ohgaki
yohgaki@ohgaki.net
Hi Crypto,
On Fri, Mar 14, 2014 at 4:49 PM, Crypto Compress <
cryptocompress@googlemail.com> wrote:On Thu, Mar 13, 2014 at 8:28 PM, Crypto Compress <
cryptocompress@googlemail.com> wrote:unicode parts of oniguruma look small, fast and awesome.
I agree, but it's LGPL...
License: BSD
https://github.com/k-takata/OnigmoI didn't know this fork.
Thank you :)I shall try to ask libmbfl copyright folders if it's possible to change the
license to BSD like license.
Did they rewrite everything from scratch? If not I really do not think
it can be BSD all of a sudden :)
Cheers,
Pierre
@pierrejoye | http://www.libgd.org
Am 14.03.2014 09:52, schrieb Pierre Joye:
Hi Crypto,
On Fri, Mar 14, 2014 at 4:49 PM, Crypto Compress <
cryptocompress@googlemail.com> wrote:On Thu, Mar 13, 2014 at 8:28 PM, Crypto Compress <
cryptocompress@googlemail.com> wrote:unicode parts of oniguruma look small, fast and awesome.
I agree, but it's LGPL...
License: BSD
https://github.com/k-takata/OnigmoI didn't know this fork.
Thank you :)I shall try to ask libmbfl copyright folders if it's possible to change the
license to BSD like license.
Did they rewrite everything from scratch? If not I really do not think
it can be BSD all of a sudden :)Cheers,
Quote from wiki: "Oniguruma (鬼車?) by K. Kosako is a BSD licensed
regular expression library that supports a variety of character encodings."
http://en.wikipedia.org/wiki/Oniguruma
Quote from Webpage: "License: BSD license."
http://www.geocities.jp/kosako3/oniguruma/
Quote from Github: "THIS SOFTWARE IS PROVIDED BY THE AUTHOR AND
CONTRIBUTORS ``AS IS'' AND..."
https://raw.github.com/k-takata/Onigmo/master/enc/utf8.c
It is included in PHP anyway?
Hi all,
Did they rewrite everything from scratch? If not I really do not think
it can be BSD all of a sudden :)
My bad :(
I've checked Oniguruma license. It's BSD :)
Regards,
--
Yasuo Ohgaki
yohgaki@ohgaki.net
On Fri, Mar 14, 2014 at 4:49 PM, Crypto Compress <
cryptocompress@googlemail.com> wrote:On Thu, Mar 13, 2014 at 8:28 PM, Crypto Compress <
cryptocompress@googlemail.com> wrote:unicode parts of oniguruma look small, fast and awesome.
I agree, but it's LGPL...
License: BSD
https://github.com/k-takata/OnigmoI didn't know this fork.
Thank you :)I shall try to ask libmbfl copyright folders if it's possible to change
the license to BSD like license.
I think by the time Moriyoshi was working on mbstring-ng, Oniguruma was
LGPL.
I've checked libmbfl AUTHORS in ext/mbstring. There are too many.
Switching multibyte filter is easier, I'll use ICU for it. Then there is no
obstacle building mbstring by default.
--
Yasuo Ohgaki
yohgaki@ohgaki.net
Yasuo Ohgaki wrote:
I've checked libmbfl AUTHORS in ext/mbstring. There are too many.
Switching multibyte filter is easier, I'll use ICU for it. Then there is no
obstacle building mbstring by default.
Slight aside but relevant re. regular expressions library ... What is wrong with
the unicode mode of preg? I'd just been using it without even thinking after
moving over from ereg.
--
Lester Caine - G8HFL
Contact - http://lsces.co.uk/wiki/?page=contact
L.S.Caine Electronic Services - http://lsces.co.uk
EnquirySolve - http://enquirysolve.com/
Model Engineers Digital Workshop - http://medw.co.uk
Rainbow Digital Media - http://rainbowdigitalmedia.co.uk
Yasuo Ohgaki wrote:
I've checked libmbfl AUTHORS in ext/mbstring. There are too many.
Switching multibyte filter is easier, I'll use ICU for it. Then there is
no
obstacle building mbstring by default.Slight aside but relevant re. regular expressions library ... What is
wrong with the unicode mode of preg? I'd just been using it without even
thinking after moving over from ereg.
Nothing is wrong with it, PCRE has very good support for UTF-8 (including
character properties and extended grapheme clusters). Can we just deprecate
mb_ereg? It seems totally useless and just confuses people. If you want to
match regular expressions on non-UTF-8 just do a conversion beforehand (or
use a sane encoding right away, you know).
Nikita
Nothing is wrong with it, PCRE has very good support for UTF-8 (including
character properties and extended grapheme clusters). Can we just
deprecate
mb_ereg? It seems totally useless and just confuses people. If you want to
match regular expressions on non-UTF-8 just do a conversion beforehand (or
use a sane encoding right away, you know).
Several years ago mb_ereg was slightly faster than pcre. It could have
changed since then
Hi all,
On Fri, Mar 14, 2014 at 8:33 PM, Alexey Zakhlestin indeyets@gmail.comwrote:
Nothing is wrong with it, PCRE has very good support for UTF-8 (including
character properties and extended grapheme clusters). Can we just
deprecate
mb_ereg? It seems totally useless and just confuses people. If you want
to
match regular expressions on non-UTF-8 just do a conversion beforehand
(or
use a sane encoding right away, you know).Several years ago mb_ereg was slightly faster than pcre. It could have
changed since then
Besides unneeded conversion is better to be avoided, we also should
consider the case encoding is broken some how. Conversion should fail or
replace broken bytes, but it changes original data.
Regards,
--
Yasuo Ohgaki
yohgaki@ohgaki.net
Hi Nikita,
Nothing is wrong with it, PCRE has very good support for UTF-8 (including
character properties and extended grapheme clusters). Can we just deprecate
mb_ereg? It seems totally useless and just confuses people. If you want to
match regular expressions on non-UTF-8 just do a conversion beforehand (or
use a sane encoding right away, you know).
Encoding conversion would not work always. i.e. there are number of vendor
specific extensions. Therefore, native encoding regex is required for those
who need to handle chars.
Regards,
--
Yasuo Ohgaki
yohgaki@ohgaki.net