Newsgroups: php.internals
Path: news.php.net
Xref: news.php.net php.internals:84442
Mailing-List: contact internals-help@lists.php.net; run by ezmlm
To: internals@lists.php.net,Lester Caine <lester@lsces.co.uk>
Message-ID: <54FC8C41.40104@luni.fr>
Date: Sun, 08 Mar 2015 18:52:01 +0100
User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.10; rv:31.0) Gecko/20100101 Thunderbird/31.4.0
MIME-Version: 1.0
References: <CAGa2bXYa5Lz0JqySVSQ+hGfCW-WxxKWvqFtWaZo=OVf+LLsV8A@mail.gmail.com> <CAGa2bXa+RyQLuYe72f2m+j5c+YObfPtaOTeknMRU7p3s+96Seg@mail.gmail.com> <CEF2AD42-3CE1-489F-8192-D1DC3D8D8698@gmail.com> <CAGa2bXZDnuE3mKQYD+Sq2=kH=YDXuRQA3o+9wg4v81ZS+h3rLw@mail.gmail.com> <54F83C4D.1020206@gmail.com> <CAEZPtU6ni038E+b0ziAC1b0w=t3gsmBwMjAui4e0sXd8EgbyXQ@mail.gmail.com> <CAL0xaBF7u2h9A5UnVB+-z6SwDtLOVY_qL7B9UGj7w_Lecwct6A@mail.gmail.com> <CAGa2bXa5zER03VrMrtD9aUQ38LK9C_UWU-jbGjzEZUoxbUsSQQ@mail.gmail.com> <CAL0xaBFJtxd3gf9H3ToD0-6mugOBFWR50wB_MRnQ0UZsPWF0Fw@mail.gmail.com> <CAGa2bXaO=Spn5f6qTY8ZrPE8eJ-qwPMS-+2-FHKAVqAnSKsp+Q@mail.gmail.com> <CAEZPtU41SqAf3gV=BY8+g3UNO=k=SyuCER2ch4SNBZ0P4bTbuQ@mail.gmail.com> <CAGa2bXaotozdH6mHcXVDPrYDPjw4dxfrFxKkVWrqwPdQBD4tmA@mail.gmail.com> <54FB3175.3000308@luni.fr> <CAGa2bXZ5ez1Lu2_HRwm_PUQRrTYzO_0bsk3oGbQQ++b_wLzbww@mail.gmail.com> <54FC1E67.3070504@luni.fr> <54FC2FC1.9070008@lsces.co.uk> <54FC5465.10208@luni.fr> <54FC708E.90007@lsces.co.uk>
In-Reply-To: <54FC708E.90007@lsces.co.uk>
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Transfer-Encoding: 8bit
Subject: Re: [PHP-DEV] Consistent function names
From: gregory@luni.fr (=?UTF-8?B?R3LDqWdvcnkgUGxhbmNoYXQ=?=)

Le 08/03/2015 16:53, Lester Caine a écrit :
>> "Lorem ipsum dolor sit amet"->length();
>> "Lorem ipsum dolor sit amet"->search('lorem');
>> "Lorem ipsum dolor sit amet"->replace('lorem', 'Lorem');
>
> This is actually the problem that trying to ignore unicode then creates
> a black hole. The amount of space needed to store the string is a
> variable once one moves outside the single byte encodings, but where
> legacy systems only allow buffering for the single byte version, one
> gets a number of problems where the data returned has multi-byte
> characters. The first example has several answers depending on what one
> is doing with the return. Size of buffer needed (sizeof in my crib
> sheet), or one of the methods of counting the number of symbols used
> (count but with an agreed decoding). The other two actually work with
> multi-byte strings until one adds 'adornments' to the characters which
> may need a search to look for a set of similar words all with the same
> meaning, just encoded differently.
>
> My point is perhaps that it is all to easy nowadays or post/get data to
> have multi-byte strings from different languages which trying to map to
> a single byte solution is no longer appropriate. I've just been
> downloading a set of documents which are essentially all English, but
> the file names includes words from a number of other languages resulting
> in UTF8 being the only way to store them, and ideally the search engine
> should be able to find them again in the future.
>

I understand your point, but what I mean is not making the user totally 
unaware of the encoding, but building a common API to make it easier for 
everyone to avoid the need of permanent awareness, anywhere, anytime of 
the encoding.

Instead, make the string API make the job of choosing whichever backend 
extension is used (default or mbstring), then if the user feels the need 
of mixing encodings, he *SHOULD* have to be aware of it *AND* have the 
right transformations API availiable.

This is to make possible to simply build methods that do not need to be 
aware of encoding (eg. pure frontend text treatments) to automatically 
map to the right encoding (maybe with E_RECOVERABLE_ERROR on 
incompatible charsets).

Maybe string notations could also evolve, such as what exists in C for 
wchar_t strings (like some L prefix, or else), I have no fixed opinions 
about that and it is not the main subject anyway.

The main question is about an uniform API for strings, do you feel it is 
important/useful or not?

Grégory Planchat