Newsgroups: php.internals Path: news.php.net Xref: news.php.net php.internals:47298 Return-Path: Mailing-List: contact internals-help@lists.php.net; run by ezmlm Delivered-To: mailing list internals@lists.php.net Received: (qmail 38784 invoked from network); 16 Mar 2010 08:30:23 -0000 Received: from unknown (HELO lists.php.net) (127.0.0.1) by localhost with SMTP; 16 Mar 2010 08:30:23 -0000 Authentication-Results: pb1.pair.com header.from=lester@lsces.co.uk; sender-id=unknown Authentication-Results: pb1.pair.com smtp.mail=lester@lsces.co.uk; spf=permerror; sender-id=unknown Received-SPF: error (pb1.pair.com: domain lsces.co.uk from 213.123.20.124 cause and error) X-PHP-List-Original-Sender: lester@lsces.co.uk X-Host-Fingerprint: 213.123.20.124 c2bthomr06.btconnect.com Received: from [213.123.20.124] ([213.123.20.124:10962] helo=c2bthomr06.btconnect.com) by pb1.pair.com (ecelerity 2.1.1.9-wez r(12769M)) with ESMTP id 44/82-15129-D914F9B4 for ; Tue, 16 Mar 2010 03:30:22 -0500 Received: from [10.0.0.5] (host81-138-11-136.in-addr.btopenworld.com [81.138.11.136]) by c2bthomr06.btconnect.com with ESMTP id FDJ28705; Tue, 16 Mar 2010 08:30:14 GMT X-Mirapoint-IP-Reputation: reputation=Fair-1, source=Queried, refid=0001.0A0B0302.4B9F4196.025A, actions=tag Message-ID: <4B9F4196.9030404@lsces.co.uk> Date: Tue, 16 Mar 2010 08:30:14 +0000 User-Agent: Mozilla/5.0 (X11; U; Linux x86_64; en-US; rv:1.9.1.8) Gecko/20100217 Fedora/2.0.3-1.fc12 SeaMonkey/2.0.3 MIME-Version: 1.0 To: PHP internals References: <4B9C9007.1080802@lsces.co.uk> <4B9EC3B2.7070901@zend.com> In-Reply-To: <4B9EC3B2.7070901@zend.com> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit X-Junkmail-Status: score=10/50, host=c2bthomr06.btconnect.com X-Junkmail-SD-Raw: score=unknown, refid=str=0001.0A0B0203.4B9F419A.0358,ss=1,fgs=0, ip=0.0.0.0, so=2009-07-20 21:54:04, dmn=5.7.1/2009-08-27, mode=single engine X-Junkmail-IWF: false Subject: Re: [PHP-DEV] Where are we ACTUALLY on Unicode? From: lester@lsces.co.uk (Lester Caine) Stanislav Malyshev wrote: > Hi! > >> What I am probably asking is what was the brick wall PHP6 hit. I was >> under the impression that there was no agreement on 'switchable or only' >> to unicode core? ( And those who did write PHP6 books seemed to have >> their own views on which way the discussions would go ;) ). > > From what I can see, the biggest issues are these: > 1. Performance - Unicode-based PHP right now requires tons of > conversions when talking to outside world (like MySQL) which slows down > the app significantly. Many extensions frequently used by PHP app > writers (such as mysql, pcre, etc.) do not support UTF-16 properly. > Also, inflated memory usage hurts scalability a lot. > 2. Compatibility - it's hard to make existing app works with Unicode and > doesn't lose in performance or doesn't have any weird scenarios where > your passwords suddenly stop working because there's an extra recoding > step in some md5() call. I think that there does need to be a proper review of just what the target is? There are a number of 'unknowns' such as how does one identify the version of unicode being used. Differences seem to exist between OS's which don't help with that problem? On disk storage should probably be UTF-8 without any question? Windows use of widestrings for some files simple doubles up the on disk storage requirements for very little gain? And remembering to convert '.reg' files back to normal raw text so I can read them on the Linux machines adds to the fun. In memory handling of character strings is I think where some alternative methods may be appropriate. Firebird's original UNICODE_FSS collation was 3 bytes per character ( that IS the limit for Unicode ;) ) and so all of the character counting stuff works transparently. Firebird records are automatically compressed before storage, so white space in character strings is not wasting space on disk, and the unicode collations get compressed in the same way. '3' is not a very processor friendly number, so working with 4 even though wasteful on memory, does make perfect sense. How long is it since we had a 640k limit on working memory? SERVERS should have a good amount of memory for caching information anyway. SO is UTF-16 the right approach for processing wide strings? It needs special code to handle everything wider than 16 bits, but at what gain really? If all core functionality is handled as 32 bit characters is there that much of an overhead over the additional processing to get around strings of dissimilar sizes in UTF-16 ? Most of my own data handling is done via the database anyway, so queries return data already sorted and filtered. There is no point pulling un-proccessed data and then throwing much of it away, hence the rest of the infrastructure being used is important to get the best performance? Probably 90% of the time a string will come in and go out without requiring any processing at all, so leave it as UTF-8 ? The only time we need to accurately know the number and position of characters is when we need to do some sting processing, and then only if the strings use multibyte characters. SO how about an additional couple of flags on a string variable. When a UTF-8 string is loaded, it is counted for bytes, and characters, and number of bytes per. If bytes and characters are the same ... no problems. If number of bytes is greater than 1, then sting handling needs to 'open them up' before processing, and '2' just uses an efficient UTF-16 processing, while '3+' goes to 32 bit processing? Am I missing something? Why does unicode have to complicate things when in reality they are quite simple? Legacy stuff gets converted to UTF-8 and in many cases the user will not even see a difference, but the 'unicode on/off' switch just allows 127 single byte characters rather than 255 ? Currently all the multilingual stuff IS passing through PHP transparently and it would seem we can use unicode for variable names? So what IS missing? -- Lester Caine - G8HFL ----------------------------- Contact - http://lsces.co.uk/wiki/?page=contact L.S.Caine Electronic Services - http://lsces.co.uk EnquirySolve - http://enquirysolve.com/ Model Engineers Digital Workshop - http://medw.co.uk// Firebird - http://www.firebirdsql.org/index.php