Newsgroups: php.internals
Path: news.php.net
Xref: news.php.net php.internals:47298
Mailing-List: contact internals-help@lists.php.net; run by ezmlm
Received-SPF: error (pb1.pair.com: domain lsces.co.uk from 213.123.20.124 cause and error)
Message-ID: <4B9F4196.9030404@lsces.co.uk>
Date: Tue, 16 Mar 2010 08:30:14 +0000
User-Agent: Mozilla/5.0 (X11; U; Linux x86_64; en-US; rv:1.9.1.8) Gecko/20100217 Fedora/2.0.3-1.fc12 SeaMonkey/2.0.3
MIME-Version: 1.0
To: PHP internals <internals@lists.php.net>
References: <4B9C9007.1080802@lsces.co.uk> <4B9EC3B2.7070901@zend.com>
In-Reply-To: <4B9EC3B2.7070901@zend.com>
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 7bit
Subject: Re: [PHP-DEV] Where are we ACTUALLY on Unicode?
From: lester@lsces.co.uk (Lester Caine)

Stanislav Malyshev wrote:
> Hi!
>
>> What I am probably asking is what was the brick wall PHP6 hit. I was
>> under the impression that there was no agreement on 'switchable or only'
>> to unicode core? ( And those who did write PHP6 books seemed to have
>> their own views on which way the discussions would go ;) ).
>
>  From what I can see, the biggest issues are these:
> 1. Performance - Unicode-based PHP right now requires tons of
> conversions when talking to outside world (like MySQL) which slows down
> the app significantly. Many extensions frequently used by PHP app
> writers (such as mysql, pcre, etc.) do not support UTF-16 properly.
> Also, inflated memory usage hurts scalability a lot.
> 2. Compatibility - it's hard to make existing app works with Unicode and
> doesn't lose in performance or doesn't have any weird scenarios where
> your passwords suddenly stop working because there's an extra recoding
> step in some md5() call.

I think that there does need to be a proper review of just what the target is?

There are a number of 'unknowns' such as how does one identify the version of 
unicode being used. Differences seem to exist between OS's which don't help with 
that problem?

On disk storage should probably be UTF-8 without any question? Windows use of 
widestrings for some files simple doubles up the on disk storage requirements 
for very little gain? And remembering to convert '.reg' files back to normal raw 
text so I can read them on the Linux machines adds to the fun.

In memory handling of character strings is I think where some alternative 
methods may be appropriate. Firebird's original UNICODE_FSS collation was 3 
bytes per character ( that IS the limit for Unicode ;) ) and so all of the 
character counting stuff works transparently. Firebird records are automatically 
compressed before storage, so white space in character strings is not wasting 
space on disk, and the unicode collations get compressed in the same way.

'3' is not a very processor friendly number, so working with 4 even though 
wasteful on memory, does make perfect sense. How long is it since we had a 640k 
limit on working memory? SERVERS should have a good amount of memory for caching 
information anyway. SO is UTF-16 the right approach for processing wide strings? 
It needs special code to handle everything wider than 16 bits, but at what gain 
really? If all core functionality is handled as 32 bit characters is there that 
much of an overhead over the additional processing to get around strings of 
dissimilar sizes in UTF-16 ?

Most of my own data handling is done via the database anyway, so queries return 
data already sorted and filtered. There is no point pulling un-proccessed data 
and then throwing much of it away, hence the rest of the infrastructure being 
used is important to get the best performance?

Probably 90% of the time a string will come in and go out without requiring any 
processing at all, so leave it as UTF-8 ? The only time we need to accurately 
know the number and position of characters is when we need to do some sting 
processing, and then only if the strings use multibyte characters. SO how about 
an additional couple of flags on a string variable. When a UTF-8 string is 
loaded, it is counted for bytes, and characters, and number of bytes per. If 
bytes and characters are the same ... no problems. If number of bytes is greater 
than 1, then sting handling needs to 'open them up' before processing, and '2' 
just uses an efficient UTF-16 processing, while '3+' goes to 32 bit processing?

Am I missing something? Why does unicode have to complicate things when in 
reality they are quite simple? Legacy stuff gets converted to UTF-8 and in many 
cases the user will not even see a difference, but the 'unicode on/off' switch 
just allows 127 single byte characters rather than 255 ? Currently all the 
multilingual stuff IS passing through PHP transparently and it would seem we can 
use unicode for variable names? So what IS missing?

-- 
Lester Caine - G8HFL
-----------------------------
Contact - http://lsces.co.uk/wiki/?page=contact
L.S.Caine Electronic Services - http://lsces.co.uk
EnquirySolve - http://enquirysolve.com/
Model Engineers Digital Workshop - http://medw.co.uk//
Firebird - http://www.firebirdsql.org/index.php