Newsgroups: php.internals
Path: news.php.net
Xref: news.php.net php.internals:84440
Mailing-List: contact internals-help@lists.php.net; run by ezmlm
Received-SPF: error (pb1.pair.com: domain lsces.co.uk from 217.147.176.214 cause and error)
Message-ID: <54FC708E.90007@lsces.co.uk>
Date: Sun, 08 Mar 2015 15:53:50 +0000
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:31.0) Gecko/20100101 Thunderbird/31.4.0
MIME-Version: 1.0
To: internals@lists.php.net
References: <CAGa2bXYa5Lz0JqySVSQ+hGfCW-WxxKWvqFtWaZo=OVf+LLsV8A@mail.gmail.com> <CAGa2bXa+RyQLuYe72f2m+j5c+YObfPtaOTeknMRU7p3s+96Seg@mail.gmail.com> <CEF2AD42-3CE1-489F-8192-D1DC3D8D8698@gmail.com> <CAGa2bXZDnuE3mKQYD+Sq2=kH=YDXuRQA3o+9wg4v81ZS+h3rLw@mail.gmail.com> <54F83C4D.1020206@gmail.com> <CAEZPtU6ni038E+b0ziAC1b0w=t3gsmBwMjAui4e0sXd8EgbyXQ@mail.gmail.com> <CAL0xaBF7u2h9A5UnVB+-z6SwDtLOVY_qL7B9UGj7w_Lecwct6A@mail.gmail.com> <CAGa2bXa5zER03VrMrtD9aUQ38LK9C_UWU-jbGjzEZUoxbUsSQQ@mail.gmail.com> <CAL0xaBFJtxd3gf9H3ToD0-6mugOBFWR50wB_MRnQ0UZsPWF0Fw@mail.gmail.com> <CAGa2bXaO=Spn5f6qTY8ZrPE8eJ-qwPMS-+2-FHKAVqAnSKsp+Q@mail.gmail.com> <CAEZPtU41SqAf3gV=BY8+g3UNO=k=SyuCER2ch4SNBZ0P4bTbuQ@mail.gmail.com> <CAGa2bXaotozdH6mHcXVDPrYDPjw4dxfrFxKkVWrqwPdQBD4tmA@mail.gmail.com> <54FB3175.3000308@luni.fr> <CAGa2bXZ5ez1Lu2_HRwm_PUQRrTYzO_0bsk3oGbQQ++b_wLzbww@mail.gmail.com> <54FC1E67.3070504@luni.fr> <54FC2FC1.9070008@lsces.co.uk> <54FC5465.10208@luni.fr>
In-Reply-To: <54FC5465.10208@luni.fr>
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 8bit
Subject: Re: [PHP-DEV] Consistent function names
From: lester@lsces.co.uk (Lester Caine)

On 08/03/15 13:53, Grégory Planchat wrote:
>> On 08/03/15 10:03, Grégory Planchat wrote:
>>> Then using multiple encodings in a same script or using a same script
>>> for multiple encodings becomes straightforward and standard. Most PHP
>>> developers doesn't even know what is Unicode or a character encoding,
>>> they just see "odd characters that are removed with a header() call or
>>> utf8_decode()", no teasing intended, they just don't want to have to
>>> handle this. PHP should not let this sort of consideration to the sole
>>> awareness of user-space developers.
>>
>> Not part of THIS discussion exactly, but I have to take that in
>> isolation. 'Most PHP developers' need to be very aware of Unicode these
>> days. Simply pretending it does not exist is a dangerous exercise and
>> my own code base has been UTF8 for several years now. Even though I
>> don't speak anything but English, a large section of the material one
>> has to handle has characters which get lost if one does not maintain
>> UTF8 through out the process. People are going on about 'data loss' when
>> converting, and that applies equally to strings as numbers.
>>
>> The default encoding these days is UTF8 ...
> 
> This is not exactly what I meant, and your point is the way things
> should be, of course.
> 
> What I meant is that a text search or fetching the size of a string
> *MUST* behave the same way, whatever which encoding you use, without
> having to know what is the actual encoding of the string at any time.
> 
> Currently a strlen on an UTF-8 behaves more like a C "sizeof(str) - 1"
> when you are using other characters than the ASCII page.
> 
> The idea is really making these statements work, whatever the encoding
> you are using :
> 
> "Lorem ipsum dolor sit amet"->length();
> "Lorem ipsum dolor sit amet"->search('lorem');
> "Lorem ipsum dolor sit amet"->replace('lorem', 'Lorem'); 

This is actually the problem that trying to ignore unicode then creates
a black hole. The amount of space needed to store the string is a
variable once one moves outside the single byte encodings, but where
legacy systems only allow buffering for the single byte version, one
gets a number of problems where the data returned has multi-byte
characters. The first example has several answers depending on what one
is doing with the return. Size of buffer needed (sizeof in my crib
sheet), or one of the methods of counting the number of symbols used
(count but with an agreed decoding). The other two actually work with
multi-byte strings until one adds 'adornments' to the characters which
may need a search to look for a set of similar words all with the same
meaning, just encoded differently.

My point is perhaps that it is all to easy nowadays or post/get data to
have multi-byte strings from different languages which trying to map to
a single byte solution is no longer appropriate. I've just been
downloading a set of documents which are essentially all English, but
the file names includes words from a number of other languages resulting
in UTF8 being the only way to store them, and ideally the search engine
should be able to find them again in the future.

-- 
Lester Caine - G8HFL
-----------------------------
Contact - http://lsces.co.uk/wiki/?page=contact
L.S.Caine Electronic Services - http://lsces.co.uk
EnquirySolve - http://enquirysolve.com/
Model Engineers Digital Workshop - http://medw.co.uk
Rainbow Digital Media - http://rainbowdigitalmedia.co.uk