Newsgroups: php.internals Path: news.php.net Xref: news.php.net php.internals:72614 Return-Path: Mailing-List: contact internals-help@lists.php.net; run by ezmlm Delivered-To: mailing list internals@lists.php.net Received: (qmail 97118 invoked from network); 14 Feb 2014 21:03:16 -0000 Received: from unknown (HELO lists.php.net) (127.0.0.1) by localhost with SMTP; 14 Feb 2014 21:03:16 -0000 Authentication-Results: pb1.pair.com header.from=lester@lsces.co.uk; sender-id=unknown Authentication-Results: pb1.pair.com smtp.mail=lester@lsces.co.uk; spf=permerror; sender-id=unknown Received-SPF: error (pb1.pair.com: domain lsces.co.uk from 217.147.176.204 cause and error) X-PHP-List-Original-Sender: lester@lsces.co.uk X-Host-Fingerprint: 217.147.176.204 mail4.serversure.net Linux 2.6 Received: from [217.147.176.204] ([217.147.176.204:43576] helo=mail4.serversure.net) by pb1.pair.com (ecelerity 2.1.1.9-wez r(12769M)) with ESMTP id D6/2D-34645-2948EF25 for ; Fri, 14 Feb 2014 16:03:15 -0500 Received: (qmail 28718 invoked by uid 89); 14 Feb 2014 21:03:11 -0000 Received: by simscan 1.3.1 ppid: 28712, pid: 28715, t: 0.0676s scanners: attach: 1.3.1 clamav: 0.96/m:52 Received: from unknown (HELO linux-dev4.lsces.org.uk) (lester@rainbowdigitalmedia.org.uk@81.138.11.136) by mail4.serversure.net with ESMTPA; 14 Feb 2014 21:03:11 -0000 Message-ID: <52FE856A.8000003@lsces.co.uk> Date: Fri, 14 Feb 2014 21:06:50 +0000 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:27.0) Gecko/20100101 Firefox/27.0 SeaMonkey/2.24 MIME-Version: 1.0 To: internals@lists.php.net References: <50100EC8.3040102@ajf.me> <52FDF7BC.8050408@lsces.co.uk> <52FE46D2.4060903@gmail.com> <52FE6FEA.5050204@lerdorf.com> In-Reply-To: <52FE6FEA.5050204@lerdorf.com> Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 7bit Subject: Re: [PHP-DEV] PHP6 wiki page From: lester@lsces.co.uk (Lester Caine) Rasmus Lerdorf wrote: > What we really need is an awesome small and fast Unicode library that > does everything ICU does but faster and in less code while using UTF-8 > as its internal storage so we don't have to convert on each and every > operation. There are a ton of non-obvious things beyond simple string > manipulation. String collation alone is massively complicated, for example. Surely the bottom line is that to cover every fine detail, ICU has to be used as the smaller libraries tend to make few assumptions to make life easy? But my point was that most of the time you only need the simple stuff? Simply using UTF8 strings in place of the byte based ones in all of the relevant string? Remove the need to 'lowercase' by dropping case-insensitivity and things are simplified somewhat? I've found the comment I was looking for finally while searching around ... "UTF-8 is specially designed so that many byte-oriented string functions continue to work or only need minor modifications." This is why people can put unicode characters in many places in PHP now without it actually breaking? I've seen a few comments about switching to C++ and http://utfcpp.sourceforge.net/ caught my eye, but http://www.public-software-group.org/utf8proc-documentation came to light when I started looking at NDF/NDC but I've been looking for a suitable unicode string handler for doing substring clipping and all of that. I AM right in thinking that mbstring is basically overkill if everything being worked with has already been converted to UTF8? While I was aware of accent code points, I'd not quite appreciated how complicated they can get. Up until now I've just been looking at text cut and pasted from UTM8 messages. If one simply ignores the transcoding in and out, leaving the core only to handle clean UTF8 strings what non-trivial things are left? Could this be a candidate for a SOC project? -- Lester Caine - G8HFL ----------------------------- Contact - http://lsces.co.uk/wiki/?page=contact L.S.Caine Electronic Services - http://lsces.co.uk EnquirySolve - http://enquirysolve.com/ Model Engineers Digital Workshop - http://medw.co.uk Rainbow Digital Media - http://rainbowdigitalmedia.co.uk