Where are we ACTUALLY on Unicode?

15 years ago by Lester Caine — view source

unread

From Andrei Zmievski last year ....

"In PHP 6, everything by default will be Unicode," such as default string types, said PHP core developer Andrei Zmievski during a keynote presentation at the 2009 Zend/PHP conference in San Jose, Calif. The PHP 6 platform also will feature the ability to use Unicode characters for identifiers.

I must say that I was still under the impression that USING Unicode in PHP5 was
not possible? I know that with care some language related constructs can be
used, but thought that these are not readily transferred between installations?
I presume that the 'examples' of internationalization actually only work because
PHP5 see them as a strange ascii string? Rather than a true unicode string? And
if you see them as 'unicode' it is because 'your email client can handle them'
rather than because PHP5 is?

From my own view point, as I have said, using strange names for identifiers
will probably not happen, but I am now using using Unicode to store customer
names and other data which in the past have been messed up when 'code pages'
were being used. I can nowadays simply cut and past those details from paypal or
wherever and use them, but if I access that data directly from PHP5 things
become a little hit or miss. Quite often content gets corrupted because it has
not been 'unmangeled and remangled' when the database just handles it transparently.

What I am probably asking is what was the brick wall PHP6 hit. I was under the
impression that there was no agreement on 'switchable or only' to unicode core?
( And those who did write PHP6 books seemed to have their own views on which way
the discussions would go ;) ).

--
Lester Caine - G8HFL

Contact - http://lsces.co.uk/wiki/?page=contact
L.S.Caine Electronic Services - http://lsces.co.uk
EnquirySolve - http://enquirysolve.com/
Model Engineers Digital Workshop - http://medw.co.uk//
Firebird - http://www.firebirdsql.org/index.php

15 years ago by William A. Rowe Jr. — view source

unread

If Unicode were the solution, the PHP project was on the right page with 6.0.
Sure there remained work to do, but...

How long did it take to realize UTF16 wasn't the end of the story? UCS-4 is
the minimum to solve this, and we all agree that 32 bits aren't storing a single
char in the western world, no way, no how.

The UTF-8 solution is probably the right answer... you maintain 95% of char *UTF
behavior, and you gain international character representation. The only Unicode
OS I can think of offhand is NT, and of course they hit the UCS-4 problem early.
They found this out 15+ years ago.

Sure it doesn't appear as atomic, one Xword per char, but the existing library
frameworks contain most of the string processing that is required. There is no
16-bit network transmission API that I can think of, you are still devolving to
UTF-8 for client results.

To move forward with accepting -and preferring- UTF-8 as the representation of
characters throughout PHP, recognizing UTF-8 for char-length representations,
and so forth, would do wonders to move forwards. And 8-bit octet data can be
set aside in the same data structures. It is the straightforward answer, which
is probably why Linux did not repeat Windows NT decision, and adopted utf-8.

15 years ago by Stan Vassilev — view source

unread

If Unicode were the solution, the PHP project was on the right page with
6.0.
Sure there remained work to do, but...

How long did it take to realize UTF16 wasn't the end of the story? UCS-4
is
the minimum to solve this, and we all agree that 32 bits aren't storing a
single
char in the western world, no way, no how.

The UTF-8 solution is probably the right answer... you maintain 95% of
char *UTF
behavior, and you gain international character representation. The only
Unicode
OS I can think of offhand is NT, and of course they hit the UCS-4 problem
early.
They found this out 15+ years ago.

Sure it doesn't appear as atomic, one Xword per char, but the existing
library
frameworks contain most of the string processing that is required. There
is no
16-bit network transmission API that I can think of, you are still
devolving to
UTF-8 for client results.

To move forward with accepting -and preferring- UTF-8 as the
representation of
characters throughout PHP, recognizing UTF-8 for char-length
representations,
and so forth, would do wonders to move forwards. And 8-bit octet data can
be
set aside in the same data structures. It is the straightforward answer,
which
is probably why Linux did not repeat Windows NT decision, and adopted
utf-8.

Hi,

UTF8 is good for text that contains mostly ASCII chars and the occasional
Unicode international chars. It's also generally ok for storing and passing
strings between apps.

However, it's a really poor representation of a string in memory as a code
point can vary between 1 and 4 bytes. Doing simple calculations like
$string[$x] means you need to walk and interpret the string from the start
until you count to the codepoint you needed.

UTF8 also takes 4 bytes for representing characters in the higher bit
planes, as quite a lot of bits are lost for every char in order to describe
how long the code point is, and when it ends and so on. This means
memory-wise it may not be of big benefit to asian countries.

Since the western world, as you put it, wouldn't want to waste 4 bytes for
characters they use that fit in 1 byte, we could opt to store the encoding
of a string as a byte enumerating all possible encodings supported by PHP (I
believe they're less than 255..), so the string functions know how to
operate and convert between them.

This means you can use Unicode only when you need it, which reduces the
impact of using full 4 bytes per code point, as you can still use Latin-1
1-byte encoding and freely mix it with Unicode, and still produce UTF8
output in the end, for the web (the final output encoding to UTF8 from
anything is cheap).

Another alternative is doing what JavaScript does. JavaScript uses 2-byte
encoding for Unicode, and when a code point needs more than 2 bytes, it's
encoded in 4 bytes. JavaScript will count that codepoint as 2 chars,
although it's technically one codepoint. It's awkward, but since PHP is a
web language, consistency with JavaScript may even be beneficial. It also
solves the $string[$x] problem as you no longer need to walk the array, you
just blindly report the 2 bytes at address string points + 2 * $x.

With this approach, all characters in the BMP will report correct offsets
with char index and substr functions as they fit in 2 bytes. Workarounds and
helper functions can be introduced for handling 4 byte codepoints for the
other planes.

It of course makes certain operations harder, such as character ranges
between two 4-byte codepoints in regex will produce unexpected results, and
regex will see these chars:

[2bytes2bytes-2bytes2bytes] i.e.: [a b-c d]

and not this:

[4bytes-4bytes]

Still, having variable-width encoding UTF8 or UTF16 doesn't cut it for
general use to me as in tests it shows drastic slowdown when the script
needs to do heavy string processing. I'd rather have it take more RAM for
Unicode strings while being fast, and use Latin-1 when what I need is
Latin-1.

Regards,
Stan Vassilev

15 years ago by Pierre Joye — view source

unread

hi,

UTF8 is good for text that contains mostly ASCII chars and the occasional
Unicode international chars. It's also generally ok for storing and passing
strings between apps.

That's not completely correct. UTF-8 is used out there for almost
unicode only applications as well. I'd to say it is a matter of what
the projects are written for. See below.

Still, having variable-width encoding UTF8 or UTF16 doesn't cut it for
general use to me as in tests it shows drastic slowdown when the script
needs to do heavy string processing. I'd rather have it take more RAM for
Unicode strings while being fast, and use Latin-1 when what I need is
Latin-1.

The problem I have with UTF-16 is that it does not fit well with PHP
usage. While you are right about the performence vs memory usage, it
is sadly a small part of the problem. If you take a look at the
current implementation (trunk, which uses UTF-16), we have to convert
to UTF-8 almost everywhere as long as we deal with external APIs (file
systems or other libs). The win we may have from using UTF-16 is
almost completely lost by the conversions cost.

That obviously does not apply for scripts using only core PHP features
(no file access, no extension usage, etc.), but these scripts are
barely real worlds use cases.

Please not that I'm not voting against UTF-16 or for UTF-8, but I
would like to have a real evaluation this time, unlike what has been
done for trunk a couple of years ago.

Cheers,

Pierre

@pierrejoye | http://blog.thepimp.net | http://www.libgd.org

15 years ago by Jordi Boggiano — view source

unread

UTF8 also takes 4 bytes for representing characters in the higher bit
planes, as quite a lot of bits are lost for every char in order to describe
how long the code point is, and when it ends and so on. This means
memory-wise it may not be of big benefit to asian countries.

I remember Brian Aker saying that they chose to work internally with
UTF-8 for Drizzle. His explanation of it was that asian countries have
so much english content mixed in that on average even for them UTF-8
still had a lower footprint than UTF-16/32. I do not know where the
stats came from, but if it holds any truth it is worth considering.

Cheers,
Jordi

15 years ago by Pierre Joye — view source

unread

UTF8 also takes 4 bytes for representing characters in the higher bit
planes, as quite a lot of bits are lost for every char in order to describe
how long the code point is, and when it ends and so on. This means
memory-wise it may not be of big benefit to asian countries.

I remember Brian Aker saying that they chose to work internally with
UTF-8 for Drizzle. His explanation of it was that asian countries have
so much english content mixed in that on average even for them UTF-8
still had a lower footprint than UTF-16/32. I do not know where the
stats came from, but if it holds any truth it is worth considering.

The idea behind his reasonning was to about optimizing the 90% of the
cases while being "fast enough" for the last 10% (could have been
other numbers, but that's the idea). For what I remember about our
discussions, he also mentioned fast UTF-8 capable string processing
implementation (as fast as what UTF-16 could be). I like this the
90/10 approach especially as it actually matches what we have in PHP.

Cheers,

Pierre

@pierrejoye | http://blog.thepimp.net | http://www.libgd.org

15 years ago by Moriyoshi Koizumi — view source

unread

UTF8 also takes 4 bytes for representing characters in the higher bit
planes, as quite a lot of bits are lost for every char in order to describe
how long the code point is, and when it ends and so on. This means
memory-wise it may not be of big benefit to asian countries.

I remember Brian Aker saying that they chose to work internally with
UTF-8 for Drizzle. His explanation of it was that asian countries have
so much english content mixed in that on average even for them UTF-8
still had a lower footprint than UTF-16/32. I do not know where the
stats came from, but if it holds any truth it is worth considering.

This is true, as most of the text data that are interchanged in the
Internet should be represented in HTML, in which such characters and
alphabetic tags always appear alternatively.

Moriyoshi

15 years ago by dreamcat four — view source

unread

Hi,
I used to work a job where we used UTF-16 for embedded applications.
Our company chose UTF-16 over UTF-8 because it was byte-aligned and
therefore faster / more effecient to process than UTF-8. However
theres no reason why UTF-8 has to be drastically slower. The truch is,
even we could have used UTF-8 there. And I don't buy the whole byte
size / memory thing either. Even in our restricted embedded
environments, that was never a consideration anyway. Because a well
written program won't bloat memory by holding too many strings. That's
what MYSQL is for.

Apple uses UTF-16 for CFString, NSString data. But elsewhere (and on
the web!) most people uses UTF-8. Pretty much.

You should implement UTF-8, with a view to still allow adding UTF-16
support later on. That is to say, the encoding should be wrapped, and
switchable underneath. Of course all that is easier said than done
with PHP. But thats the right way to do it.

UTF8 also takes 4 bytes for representing characters in the higher bit
planes, as quite a lot of bits are lost for every char in order to describe
how long the code point is, and when it ends and so on. This means
memory-wise it may not be of big benefit to asian countries.

I remember Brian Aker saying that they chose to work internally with
UTF-8 for Drizzle. His explanation of it was that asian countries have
so much english content mixed in that on average even for them UTF-8
still had a lower footprint than UTF-16/32. I do not know where the
stats came from, but if it holds any truth it is worth considering.

Cheers,
Jordi

15 years ago by Alexey Zakhlestin — view source

unread

You should implement UTF-8, with a view to still allow adding UTF-16
support later on. That is to say, the encoding should be wrapped, and
switchable underneath. Of course all that is easier said than done
with PHP. But thats the right way to do it.

I think you misunderstand and probably there are others too…
The discussion is not about which encodings should be supported and which should not. PHP6 in its current form supports pretty much all encodings there are. The discussion is about which encoding should be taken as "internal representation". Currently, PHP6 uses UTF-16 for this purpose.

15 years ago by Stanislav Malyshev — view source

unread

Hi!

What I am probably asking is what was the brick wall PHP6 hit. I was
under the impression that there was no agreement on 'switchable or only'
to unicode core? ( And those who did write PHP6 books seemed to have
their own views on which way the discussions would go ;) ).

From what I can see, the biggest issues are these:

Performance - Unicode-based PHP right now requires tons of
conversions when talking to outside world (like MySQL) which slows down
the app significantly. Many extensions frequently used by PHP app
writers (such as mysql, pcre, etc.) do not support UTF-16 properly.
Also, inflated memory usage hurts scalability a lot.
Compatibility - it's hard to make existing app works with Unicode and
doesn't lose in performance or doesn't have any weird scenarios where
your passwords suddenly stop working because there's an extra recoding
step in some md5() call.
--
Stanislav Malyshev, Zend Software Architect
stas@zend.com http://www.zend.com/
(408)253-8829 MSN: stas@zend.com

15 years ago by Lester Caine — view source

unread

Stanislav Malyshev wrote:

Hi!

What I am probably asking is what was the brick wall PHP6 hit. I was
under the impression that there was no agreement on 'switchable or only'
to unicode core? ( And those who did write PHP6 books seemed to have
their own views on which way the discussions would go ;) ).

From what I can see, the biggest issues are these:

Performance - Unicode-based PHP right now requires tons of
conversions when talking to outside world (like MySQL) which slows down
the app significantly. Many extensions frequently used by PHP app
writers (such as mysql, pcre, etc.) do not support UTF-16 properly.
Also, inflated memory usage hurts scalability a lot.

Compatibility - it's hard to make existing app works with Unicode and
doesn't lose in performance or doesn't have any weird scenarios where
your passwords suddenly stop working because there's an extra recoding
step in some md5() call.

I think that there does need to be a proper review of just what the target is?

There are a number of 'unknowns' such as how does one identify the version of
unicode being used. Differences seem to exist between OS's which don't help with
that problem?

On disk storage should probably be UTF-8 without any question? Windows use of
widestrings for some files simple doubles up the on disk storage requirements
for very little gain? And remembering to convert '.reg' files back to normal raw
text so I can read them on the Linux machines adds to the fun.

In memory handling of character strings is I think where some alternative
methods may be appropriate. Firebird's original UNICODE_FSS collation was 3
bytes per character ( that IS the limit for Unicode ;) ) and so all of the
character counting stuff works transparently. Firebird records are automatically
compressed before storage, so white space in character strings is not wasting
space on disk, and the unicode collations get compressed in the same way.

'3' is not a very processor friendly number, so working with 4 even though
wasteful on memory, does make perfect sense. How long is it since we had a 640k
limit on working memory? SERVERS should have a good amount of memory for caching
information anyway. SO is UTF-16 the right approach for processing wide strings?
It needs special code to handle everything wider than 16 bits, but at what gain
really? If all core functionality is handled as 32 bit characters is there that
much of an overhead over the additional processing to get around strings of
dissimilar sizes in UTF-16 ?

Most of my own data handling is done via the database anyway, so queries return
data already sorted and filtered. There is no point pulling un-proccessed data
and then throwing much of it away, hence the rest of the infrastructure being
used is important to get the best performance?

Probably 90% of the time a string will come in and go out without requiring any
processing at all, so leave it as UTF-8 ? The only time we need to accurately
know the number and position of characters is when we need to do some sting
processing, and then only if the strings use multibyte characters. SO how about
an additional couple of flags on a string variable. When a UTF-8 string is
loaded, it is counted for bytes, and characters, and number of bytes per. If
bytes and characters are the same ... no problems. If number of bytes is greater
than 1, then sting handling needs to 'open them up' before processing, and '2'
just uses an efficient UTF-16 processing, while '3+' goes to 32 bit processing?

Am I missing something? Why does unicode have to complicate things when in
reality they are quite simple? Legacy stuff gets converted to UTF-8 and in many
cases the user will not even see a difference, but the 'unicode on/off' switch
just allows 127 single byte characters rather than 255 ? Currently all the
multilingual stuff IS passing through PHP transparently and it would seem we can
use unicode for variable names? So what IS missing?

--
Lester Caine - G8HFL

15 years ago by dreamcat four — view source

unread

'3' is not a very processor friendly number, so working with 4 even though
wasteful on memory, does make perfect sense. How long is it since we had a
640k limit on working memory? SERVERS should have a good amount of memory
for caching information anyway. SO is UTF-16 the right approach for
processing wide strings? It needs special code to handle everything wider
than 16 bits, but at what gain really? If all core functionality is handled
as 32 bit characters is there that much of an overhead over the additional
processing to get around strings of dissimilar sizes in UTF-16 ?

Just to re-enforce some of Lester's points above here.

4-byte per character is never slower that 2-bytes per character... its
faster if anything. Bear in mind that 4-byte has been the defacto size
for all modern cpu registers / 32-bit microarchitectures since....
like... Forever. Give a c compiler 4bytes of data... it'll say: thank
you very much, and more of the same please! It keeps em happy ;)

Sure UTF-16 can make sense. But only if your external representations
are also in UTF-16. So whats the default Unicode settings for MYSQL,
POSTGRE, etc? Well, are they always set to UTF-8, or UTF-16?

Just do the same as them.

15 years ago by Andrey Hristov — view source

unread

dreamcat four wrote:

'3' is not a very processor friendly number, so working with 4 even though
wasteful on memory, does make perfect sense. How long is it since we had a
640k limit on working memory? SERVERS should have a good amount of memory
for caching information anyway. SO is UTF-16 the right approach for
processing wide strings? It needs special code to handle everything wider
than 16 bits, but at what gain really? If all core functionality is handled
as 32 bit characters is there that much of an overhead over the additional
processing to get around strings of dissimilar sizes in UTF-16 ?

Just to re-enforce some of Lester's points above here.

4-byte per character is never slower that 2-bytes per character... its
faster if anything. Bear in mind that 4-byte has been the defacto size
for all modern cpu registers / 32-bit microarchitectures since....
like... Forever. Give a c compiler 4bytes of data... it'll say: thank
you very much, and more of the same please! It keeps em happy ;)

Sure UTF-16 can make sense. But only if your external representations
are also in UTF-16. So whats the default Unicode settings for MYSQL,
POSTGRE, etc? Well, are they always set to UTF-8, or UTF-16?

Just do the same as them.

All MySQL GA versions (not including the upcoming 5.5 which is not GA)
can't eat UTF-16 queries but can receive UTF-16 results (although all
MySQL GA releases that know character sets, 4.1, 5.0, 5.1, don't know
anything about UTF-16 but only UCS-2, which are the characters in the
BMP). It is probable (I can't say definitely due to Oracle's recognition
rules) that 5.5 will have proper UTF-16. UTF-16 has its advantages.

If your unicode data includes mostly ASCII characters and here and there
some non-ascii ones, then UTF-8 should be the choice - less disk space
used, which means the HDD can read more data which in turn means more
table rows server per second.
Converting in the client (PHP) is ok, as it scales, just throw some more
web servers. Scaling a RDBMS is completely different story

Best,
Andrey

15 years ago by dreamcat four — view source

unread

'3' is not a very processor friendly number, so working with 4 even though
wasteful on memory, does make perfect sense. How long is it since we had a
640k limit on working memory? SERVERS should have a good amount of memory
for caching information anyway. SO is UTF-16 the right approach for
processing wide strings? It needs special code to handle everything wider
than 16 bits, but at what gain really? If all core functionality is handled
as 32 bit characters is there that much of an overhead over the additional
processing to get around strings of dissimilar sizes in UTF-16 ?

Just to re-enforce some of Lester's points above here.

4-byte per character is never slower that 2-bytes per character... its
faster if anything. Bear in mind that 4-byte has been the defacto size
for all modern cpu registers / 32-bit microarchitectures since....
like... Forever. Give a c compiler 4bytes of data... it'll say: thank
you very much, and more of the same please! It keeps em happy ;)

Sure UTF-16 can make sense. But only if your external representations
are also in UTF-16. So whats the default Unicode settings for MYSQL,
POSTGRE, etc? Well, are they always set to UTF-8, or UTF-16?

To answer my own question, I have done some some further research.

It seems that both MySQL and Postgre recommend / default to Latin1
(8-bit ASCII) and 'C' (7-bit ASCII) respectively. So that is to say
neither set themselves to any unicode standard by default.

In the case of Postgre, the ASCII default is often overiden to UTF-8
by the distro / os / package managers. From the $LOCALE environment
variable. So then its UTF-8.

In the case of MySQL, it may be left as latin1. But most competent web
developers decide to set it to utf-8. Again, its not generally
believed that very many people (by comparison) actively chooses
utf-16. The most common encoding issue people run into is that their
web application has sent their database utf-8 encoded data. But their
(usually a MySQL) database still has the factory default encoding
Latin-1 (8-bit ascii). People who discover this almost always solve
the problem by converting their databases into utf-8.

As for text files on disk, if they are unicode, they are most commonly
utf-8 too. So then, why use utf-16 as internal unicode representation
in Php? It doesn't really make a lot of sense for most regular people
who want to use Php for their web application. Unless they don't
really care how slow its gonna be converting everything, constantly...

15 years ago by Andrey Hristov — view source

unread

dreamcat four wrote:

'3' is not a very processor friendly number, so working with 4 even though
wasteful on memory, does make perfect sense. How long is it since we had a
640k limit on working memory? SERVERS should have a good amount of memory
for caching information anyway. SO is UTF-16 the right approach for
processing wide strings? It needs special code to handle everything wider
than 16 bits, but at what gain really? If all core functionality is handled
as 32 bit characters is there that much of an overhead over the additional
processing to get around strings of dissimilar sizes in UTF-16 ?
Just to re-enforce some of Lester's points above here.

4-byte per character is never slower that 2-bytes per character... its
faster if anything. Bear in mind that 4-byte has been the defacto size
for all modern cpu registers / 32-bit microarchitectures since....
like... Forever. Give a c compiler 4bytes of data... it'll say: thank
you very much, and more of the same please! It keeps em happy ;)

Sure UTF-16 can make sense. But only if your external representations
are also in UTF-16. So whats the default Unicode settings for MYSQL,
POSTGRE, etc? Well, are they always set to UTF-8, or UTF-16?

To answer my own question, I have done some some further research.

It seems that both MySQL and Postgre recommend / default to Latin1
(8-bit ASCII) and 'C' (7-bit ASCII) respectively. So that is to say
neither set themselves to any unicode standard by default.

In the case of Postgre, the ASCII default is often overiden to UTF-8
by the distro / os / package managers. From the $LOCALE environment
variable. So then its UTF-8.

In the case of MySQL, it may be left as latin1. But most competent web
developers decide to set it to utf-8. Again, its not generally
believed that very many people (by comparison) actively chooses
utf-16. The most common encoding issue people run into is that their
web application has sent their database utf-8 encoded data. But their
(usually a MySQL) database still has the factory default encoding
Latin-1 (8-bit ascii). People who discover this almost always solve
the problem by converting their databases into utf-8.

MySQL doesn't support UTF-16 in any GA release. UCS-2 can be used though.

As for text files on disk, if they are unicode, they are most commonly
utf-8 too. So then, why use utf-16 as internal unicode representation
in Php? It doesn't really make a lot of sense for most regular people
who want to use Php for their web application. Unless they don't
really care how slow its gonna be converting everything, constantly...

Andrey

15 years ago by Rasmus Lerdorf — view source

unread

As for text files on disk, if they are unicode, they are most commonly
utf-8 too. So then, why use utf-16 as internal unicode representation
in Php? It doesn't really make a lot of sense for most regular people
who want to use Php for their web application. Unless they don't
really care how slow its gonna be converting everything, constantly...

Well, the obvious original reason is that ICU uses UTF-16 internally and
the logic was that we would be going in and out of ICU to do all the
various Unicode operations many more times than we would be interfacing
with external things like MySQL or files on disk. You generally only
read or write a string once from an external source, but you may perform
multiple Unicode operations on that same string so avoiding a conversion
for each operation seems logical.

-Rasmus

15 years ago by Lester Caine — view source

unread

Rasmus Lerdorf wrote:

As for text files on disk, if they are unicode, they are most commonly
utf-8 too. So then, why use utf-16 as internal unicode representation
in Php? It doesn't really make a lot of sense for most regular people
who want to use Php for their web application. Unless they don't
really care how slow its gonna be converting everything, constantly...

Well, the obvious original reason is that ICU uses UTF-16 internally and
the logic was that we would be going in and out of ICU to do all the
various Unicode operations many more times than we would be interfacing
with external things like MySQL or files on disk. You generally only
read or write a string once from an external source, but you may perform
multiple Unicode operations on that same string so avoiding a conversion
for each operation seems logical.

Which begs the question - is ICU actually the right base?

But I'd still like some feedback on my idea that until an operation needs to be
able to handle multi byte character string processing, why not simply stay in
UTF-8? No reason why a string variable can't be converted only when needed, and
then dropped back to UTF-8 if needed later? And if the user is only using single
byte characters then the multi byte stuff never kicks in anyway? If you NEED raw
speed use the basic character set.

--
Lester Caine - G8HFL

15 years ago by dreamcat four — view source

unread

As for text files on disk, if they are unicode, they are most commonly
utf-8 too. So then, why use utf-16 as internal unicode representation
in Php? It doesn't really make a lot of sense for most regular people
who want to use Php for their web application. Unless they don't
really care how slow its gonna be converting everything, constantly...

Well, the obvious original reason is that ICU uses UTF-16 internally and
the logic was that we would be going in and out of ICU to do all the
various Unicode operations many more times than we would be interfacing
with external things like MySQL or files on disk. You generally only
read or write a string once from an external source, but you may perform
multiple Unicode operations on that same string so avoiding a conversion
for each operation seems logical.

-Rasmus

Its only logical if you've bothered to profile the conversion calls to
ICU against the non-ICU conversion calls. Im guessing the way to do
that, is to have 2 versions of each conversion method. One used by
ICU, and another used everywhere else. The harder part is to find some
suitable, real life php programs to test with.

15 years ago by Rasmus Lerdorf — view source

unread

As for text files on disk, if they are unicode, they are most commonly
utf-8 too. So then, why use utf-16 as internal unicode representation
in Php? It doesn't really make a lot of sense for most regular people
who want to use Php for their web application. Unless they don't
really care how slow its gonna be converting everything, constantly...

Well, the obvious original reason is that ICU uses UTF-16 internally and
the logic was that we would be going in and out of ICU to do all the
various Unicode operations many more times than we would be interfacing
with external things like MySQL or files on disk. You generally only
read or write a string once from an external source, but you may perform
multiple Unicode operations on that same string so avoiding a conversion
for each operation seems logical.

-Rasmus

Its only logical if you've bothered to profile the conversion calls to
ICU against the non-ICU conversion calls. Im guessing the way to do
that, is to have 2 versions of each conversion method. One used by
ICU, and another used everywhere else. The harder part is to find some
suitable, real life php programs to test with.

You mean check to see how many actual Unicode operations a standard app
makes? We did talk about that, but there is a bit of a chicken-and-egg
problem here. Because PHP doesn't natively support Unicode, people
write apps in a way that lets them just pass Unicode through PHP and
deal with it elsewhere. I would expect the profile to change once PHP
gets better support for Unicode.

But yes, some ideas around lazy conversions and other tricks would be
interesting. If your input and output encoding are both utf-8 and all
your data sources are utf-8 and you never do any sort of string
manipulation on a particular string, why bother doing the utf-8 to
utf-16 conversion on that string.

-Rasmus

15 years ago by Lester Caine — view source

unread

Rasmus Lerdorf wrote:

As for text files on disk, if they are unicode, they are most commonly
utf-8 too. So then, why use utf-16 as internal unicode representation
in Php? It doesn't really make a lot of sense for most regular people
who want to use Php for their web application. Unless they don't
really care how slow its gonna be converting everything, constantly...

Well, the obvious original reason is that ICU uses UTF-16 internally and
the logic was that we would be going in and out of ICU to do all the
various Unicode operations many more times than we would be interfacing
with external things like MySQL or files on disk. You generally only
read or write a string once from an external source, but you may perform
multiple Unicode operations on that same string so avoiding a conversion
for each operation seems logical.

-Rasmus

Its only logical if you've bothered to profile the conversion calls to
ICU against the non-ICU conversion calls. Im guessing the way to do
that, is to have 2 versions of each conversion method. One used by
ICU, and another used everywhere else. The harder part is to find some
suitable, real life php programs to test with.

You mean check to see how many actual Unicode operations a standard app
makes? We did talk about that, but there is a bit of a chicken-and-egg
problem here. Because PHP doesn't natively support Unicode, people
write apps in a way that lets them just pass Unicode through PHP and
deal with it elsewhere. I would expect the profile to change once PHP
gets better support for Unicode.

But yes, some ideas around lazy conversions and other tricks would be
interesting. If your input and output encoding are both utf-8 and all
your data sources are utf-8 and you never do any sort of string
manipulation on a particular string, why bother doing the utf-8 to
utf-16 conversion on that string.

I think that is what I said originally ;)
When a string is read in you set an extra flag if it needs special handling,
otherwise you just handle it as a single byte per character string ... and for
the diehards you add a switch to treat everything as it is now :)

--
Lester Caine - G8HFL

15 years ago by Pierre Joye — view source

unread

Well, the obvious original reason is that ICU uses UTF-16 internally and
the logic was that we would be going in and out of ICU to do all the
various Unicode operations many more times than we would be interfacing
with external things like MySQL or files on disk. You generally only
read or write a string once from an external source, but you may perform
multiple Unicode operations on that same string so avoiding a conversion
for each operation seems logical.

Exactly, that's why I was not so affirmative about using UTF-8 over
UTF-16. I would like to evaluate both solutions with a small set of
PHP features (say some file ops, 1-2 DBs and part of the core string
functions) and see the impact of using UTF-8 or UTF-16. But it is
definitivelly not a small decision.

--
Pierre

@pierrejoye | http://blog.thepimp.net | http://www.libgd.org

15 years ago by William A. Rowe Jr. — view source

unread

Sure UTF-16 can make sense. But only if your external representations
are also in UTF-16. So whats the default Unicode settings for MYSQL,
POSTGRE, etc? Well, are they always set to UTF-8, or UTF-16?

This is a very good point. The PHP project consumes some 30-odd libraries
of extensions. How many do utf-8? How many do ucs2? Utf-16?

15 years ago by Stanislav Malyshev — view source

unread

Hi!

On disk storage should probably be UTF-8 without any question? Windows
use of widestrings for some files simple doubles up the on disk storage

As file content, it's OK (an it'd be easy to add option to specify
content transformation if we wanted), but prescribing filenames as UTF-8
would probably be not workable, since different systems (and maybe even
different filesystems inside same OS?) can have different opinions on that.

'3' is not a very processor friendly number, so working with 4 even
though wasteful on memory, does make perfect sense. How long is it since

I'm not sure it does. Most of PHP strings are short, so memory loss
would be very significant. Also, take into account that CPU caches
aren't as big as the main memory, and not fitting your data into the
cache is expensive.

we had a 640k limit on working memory? SERVERS should have a good amount

It doesn't matter how much memory you have, in numbers. Until we find an
unlimited source of computer memory left by the aliens in Himalayas,
memory costs money. It doesn't matter how much memory do you have -
however many gigs you have, you'll be able to run 3 times less PHP
processes in new version on the same hardware than in old version, which
means new PHP would cost you more to run. "Memory is cheap" is a very
misunderstood expression - it's only cheap if you always have much more
than you need.

Probably 90% of the time a string will come in and go out without
requiring any processing at all, so leave it as UTF-8 ? The only time we

It might be great if we could do that. The problem might be that right
now AFAIK we don't have a good library to work with utf-8 strings
(please correct me if I'm wrong here).

Stanislav Malyshev, Zend Software Architect
stas@zend.com http://www.zend.com/
(408)253-8829 MSN: stas@zend.com

15 years ago by Ferenc Kovacs — view source

unread

Hi!

On disk storage should probably be UTF-8 without any question? Windows
use of widestrings for some files simple doubles up the on disk storage

As file content, it's OK (an it'd be easy to add option to specify content
transformation if we wanted), but prescribing filenames as UTF-8 would
probably be not workable, since different systems (and maybe even different
filesystems inside same OS?) can have different opinions on that.

'3' is not a very processor friendly number, so working with 4 even
though wasteful on memory, does make perfect sense. How long is it since

I'm not sure it does. Most of PHP strings are short, so memory loss would be
very significant. Also, take into account that CPU caches aren't as big as
the main memory, and not fitting your data into the cache is expensive.

we had a 640k limit on working memory? SERVERS should have a good amount

It doesn't matter how much memory you have, in numbers. Until we find an
unlimited source of computer memory left by the aliens in Himalayas, memory
costs money. It doesn't matter how much memory do you have - however many
gigs you have, you'll be able to run 3 times less PHP processes in new
version on the same hardware than in old version, which means new PHP would
cost you more to run. "Memory is cheap" is a very misunderstood expression -
it's only cheap if you always have much more than you need.

Probably 90% of the time a string will come in and go out without
requiring any processing at all, so leave it as UTF-8 ? The only time we

It might be great if we could do that. The problem might be that right now
AFAIK we don't have a good library to work with utf-8 strings (please
correct me if I'm wrong here).
http://source.icu-project.org/repos/icu/icuhtml/trunk/design/strings/icu_utf8.html
from ICU 3.6 changelog => The UTF-8 transformation functions and
macros are faster.
from 4.2 => UTF-8 friendly internal data structure for Unicode data lookup
so it's seems that guys at ICU tries to close the gap between the
UTF-16 and UTF-8 performance, so maybe it would be a good idea, to
check out the current situation.

Tyrael

--
Stanislav Malyshev, Zend Software Architect
stas@zend.com http://www.zend.com/
(408)253-8829 MSN: stas@zend.com

15 years ago by dreamcat four — view source

unread

And remember,

Its not just the number of times its send to ICU for conversion. Its
also the number of times your UTF-16 string has to be converted back
into utf-8 afterwards. This is why Apple makes its utf-16 strings
immutable. So they are read-only, and the utf-8 representation can be
cached afterward.

Think of it this way:

Load a utf-8 string from DB or file
Convert it to utf-16
Perform ICU conv 3-5 times
Page gets hit by memcache
utf-16 is converted back to utf-8
Something changes
? String was cached ?
need to spit out another utf-8 version of the string again

And a persistent web application can be held for many hours in memory.
Are we converting back to utf-8 every time? Then it might be better to
wrap the string conversions just around ICU.

I'd suggest selecting a real (but still as easy-to-work with as can be
found) unicode php app. One that has been written to use a unicode php
module. Then getting a single, representative page from it. By that I
mean the kind of page that gets accessed the most. So for imdb that
would be a movie's page, etc. The smalled 'slice' of the app, not the
whole thing. Dummy-out the other stuff.

Then convert that part (for rendering one page) into the current php6
unicode scheme. And can see what's what.

Hi!

On disk storage should probably be UTF-8 without any question? Windows
use of widestrings for some files simple doubles up the on disk storage

As file content, it's OK (an it'd be easy to add option to specify content
transformation if we wanted), but prescribing filenames as UTF-8 would
probably be not workable, since different systems (and maybe even different
filesystems inside same OS?) can have different opinions on that.

'3' is not a very processor friendly number, so working with 4 even
though wasteful on memory, does make perfect sense. How long is it since

I'm not sure it does. Most of PHP strings are short, so memory loss would be
very significant. Also, take into account that CPU caches aren't as big as
the main memory, and not fitting your data into the cache is expensive.

we had a 640k limit on working memory? SERVERS should have a good amount

It doesn't matter how much memory you have, in numbers. Until we find an
unlimited source of computer memory left by the aliens in Himalayas, memory
costs money. It doesn't matter how much memory do you have - however many
gigs you have, you'll be able to run 3 times less PHP processes in new
version on the same hardware than in old version, which means new PHP would
cost you more to run. "Memory is cheap" is a very misunderstood expression -
it's only cheap if you always have much more than you need.

Probably 90% of the time a string will come in and go out without
requiring any processing at all, so leave it as UTF-8 ? The only time we

It might be great if we could do that. The problem might be that right now
AFAIK we don't have a good library to work with utf-8 strings (please
correct me if I'm wrong here).
http://source.icu-project.org/repos/icu/icuhtml/trunk/design/strings/icu_utf8.html
from ICU 3.6 changelog => The UTF-8 transformation functions and
macros are faster.
from 4.2 => UTF-8 friendly internal data structure for Unicode data lookup
so it's seems that guys at ICU tries to close the gap between the
UTF-16 and UTF-8 performance, so maybe it would be a good idea, to
check out the current situation.

Tyrael

--
Stanislav Malyshev, Zend Software Architect
stas@zend.com http://www.zend.com/
(408)253-8829 MSN: stas@zend.com

15 years ago by Ferenc Kovacs — view source

unread

And remember,

Its not just the number of times its send to ICU for conversion. Its
also the number of times your UTF-16 string has to be converted back
into utf-8 afterwards. This is why Apple makes its utf-16 strings
immutable. So they are read-only, and the utf-8 representation can be
cached afterward.

Think of it this way:

Load a utf-8 string from DB or file

Convert it to utf-16

Perform ICU conv 3-5 times

Page gets hit by memcache

utf-16 is converted back to utf-8

Something changes
? String was cached ?

need to spit out another utf-8 version of the string again

And a persistent web application can be held for many hours in memory.
Are we converting back to utf-8 every time? Then it might be better to
wrap the string conversions just around ICU.

I'd suggest selecting a real (but still as easy-to-work with as can be
found) unicode php app. One that has been written to use a unicode php
module. Then getting a single, representative page from it. By that I
mean the kind of page that gets accessed the most. So for imdb that
would be a movie's page, etc. The smalled 'slice' of the app, not the
whole thing. Dummy-out the other stuff.

Then convert that part (for rendering one page) into the current php6
unicode scheme. And can see what's what.

I would choose mediawiki software for this kind of test, it works in a
really internationalized environment, plus I did see
posting/contributing the main developer of the mediawiki/wikipedia
application on the mailing list.

But that's just my two cents.

Tyrael

Hi!

On disk storage should probably be UTF-8 without any question? Windows
use of widestrings for some files simple doubles up the on disk storage

As file content, it's OK (an it'd be easy to add option to specify content
transformation if we wanted), but prescribing filenames as UTF-8 would
probably be not workable, since different systems (and maybe even different
filesystems inside same OS?) can have different opinions on that.

'3' is not a very processor friendly number, so working with 4 even
though wasteful on memory, does make perfect sense. How long is it since

I'm not sure it does. Most of PHP strings are short, so memory loss would be
very significant. Also, take into account that CPU caches aren't as big as
the main memory, and not fitting your data into the cache is expensive.

we had a 640k limit on working memory? SERVERS should have a good amount

It doesn't matter how much memory you have, in numbers. Until we find an
unlimited source of computer memory left by the aliens in Himalayas, memory
costs money. It doesn't matter how much memory do you have - however many
gigs you have, you'll be able to run 3 times less PHP processes in new
version on the same hardware than in old version, which means new PHP would
cost you more to run. "Memory is cheap" is a very misunderstood expression -
it's only cheap if you always have much more than you need.

Probably 90% of the time a string will come in and go out without
requiring any processing at all, so leave it as UTF-8 ? The only time we

It might be great if we could do that. The problem might be that right now
AFAIK we don't have a good library to work with utf-8 strings (please
correct me if I'm wrong here).
http://source.icu-project.org/repos/icu/icuhtml/trunk/design/strings/icu_utf8.html
from ICU 3.6 changelog => The UTF-8 transformation functions and
macros are faster.
from 4.2 => UTF-8 friendly internal data structure for Unicode data lookup
so it's seems that guys at ICU tries to close the gap between the
UTF-16 and UTF-8 performance, so maybe it would be a good idea, to
check out the current situation.

Tyrael

--
Stanislav Malyshev, Zend Software Architect
stas@zend.com http://www.zend.com/
(408)253-8829 MSN: stas@zend.com

15 years ago by Lukas Kahwe Smith — view source

unread

Hi,

I do not claim to be able to add anything to the discussion content-wise, but organizational-wise (just invented this word for no good reason), it seems that while the brainstorming is useful, I urge people to start structuring their ideas and proposals so that they can reference it later and so that the ideas do not get lost in the archives. Again the wiki is the perfect place for this. Also this will help others to collaborate in moving the idea to a full fledged proposal .. ideally with some numbers to backup any performance claims.

regards,
Lukas Kahwe Smith
mls@pooteeweet.org

Where are we ACTUALLY on Unicode?

-- Lester Caine - G8HFL

Cheers,

Cheers,

-- Lester Caine - G8HFL

-- Lester Caine - G8HFL

-- Lester Caine - G8HFL

It might be great if we could do that. The problem might be that right now AFAIK we don't have a good library to work with utf-8 strings (please correct me if I'm wrong here).

--
Lester Caine - G8HFL

--
Lester Caine - G8HFL

--
Lester Caine - G8HFL

--
Lester Caine - G8HFL

It might be great if we could do that. The problem might be that right
now AFAIK we don't have a good library to work with utf-8 strings
(please correct me if I'm wrong here).