Newsgroups: php.internals
Path: news.php.net
Xref: news.php.net php.internals:72614
Mailing-List: contact internals-help@lists.php.net; run by ezmlm
Received-SPF: error (pb1.pair.com: domain lsces.co.uk from 217.147.176.204 cause and error)
Message-ID: <52FE856A.8000003@lsces.co.uk>
Date: Fri, 14 Feb 2014 21:06:50 +0000
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:27.0) Gecko/20100101 Firefox/27.0 SeaMonkey/2.24
MIME-Version: 1.0
To: internals@lists.php.net
References: <50100EC8.3040102@ajf.me> <CAH-PCH5BNg6i2TcZbiWbi0bYKkSd9i8q5e3nfniS=aFL9V+fXA@mail.gmail.com> <52FDF7BC.8050408@lsces.co.uk> <52FE46D2.4060903@gmail.com> <52FE6FEA.5050204@lerdorf.com>
In-Reply-To: <52FE6FEA.5050204@lerdorf.com>
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Subject: Re: [PHP-DEV] PHP6 wiki page
From: lester@lsces.co.uk (Lester Caine)

Rasmus Lerdorf wrote:
> What we really need is an awesome small and fast Unicode library that
> does everything ICU does but faster and in less code while using UTF-8
> as its internal storage so we don't have to convert on each and every
> operation. There are a ton of non-obvious things beyond simple string
> manipulation. String collation alone is massively complicated, for example.

Surely the bottom line is that to cover every fine detail, ICU has to be used as 
the smaller libraries tend to make few assumptions to make life easy? But my 
point was that most of the time you only need the simple stuff? Simply using 
UTF8 strings in place of the byte based ones in all of the relevant string?

Remove the need to 'lowercase' by dropping case-insensitivity and things are 
simplified somewhat? I've found the comment I was looking for finally while 
searching around ... "UTF-8 is specially designed so that many byte-oriented 
string functions continue to work or only need minor modifications."
This is why people can put unicode characters in many places in PHP now without 
it actually breaking?

I've seen a few comments about switching to C++ and 
http://utfcpp.sourceforge.net/ caught my eye, but 
http://www.public-software-group.org/utf8proc-documentation came to light when I 
started looking at NDF/NDC but I've been looking for a suitable unicode string 
handler for doing substring clipping and all of that. I AM right in thinking 
that mbstring is basically overkill if everything being worked with has already 
been converted to UTF8? While I was aware of accent code points, I'd not quite 
appreciated how complicated they can get. Up until now I've just been looking at 
text cut and pasted from UTM8 messages.

If one simply ignores the transcoding in and out, leaving the core only to 
handle clean UTF8 strings what non-trivial things are left? Could this be a 
candidate for a SOC project?

-- 
Lester Caine - G8HFL
-----------------------------
Contact - http://lsces.co.uk/wiki/?page=contact
L.S.Caine Electronic Services - http://lsces.co.uk
EnquirySolve - http://enquirysolve.com/
Model Engineers Digital Workshop - http://medw.co.uk
Rainbow Digital Media - http://rainbowdigitalmedia.co.uk