Newsgroups: php.internals
Path: news.php.net
Xref: news.php.net php.internals:47345
Mailing-List: contact internals-help@lists.php.net; run by ezmlm
Received-SPF: pass (pb1.pair.com: domain gmail.com designates 209.85.218.209 as permitted sender)
DomainKey-Status: bad
DomainKey-Signature: a=rsa-sha1; c=nofws;
        d=gmail.com; s=gamma;
        h=mime-version:in-reply-to:references:date:message-id:subject:from:to
         :cc:content-type:content-transfer-encoding;
        b=mBQmd7xNjRByZ3M6lWmJu/iOLlQt/a9A2DZafAK7anAb2qazOxZaG5rG9NZmUw+Qu8
         1HYbYrgajyBPUftZZ/Ra0xpNeAE0zF4GxUV72UkS3zSMeTGdiajo/F2P2rjMEm/EgpJY
         oP1VbGONKM1pYVu4fmF91zs6levXzsb4d5v1I=
MIME-Version: 1.0
In-Reply-To: <99cf22521003161343o21262736s801bd2e99ac2b6a8@mail.gmail.com>
References: <4B9C9007.1080802@lsces.co.uk> <4B9EC3B2.7070901@zend.com>
	 <4B9F4196.9030404@lsces.co.uk> <4B9FD68B.5020900@zend.com>
	 <b54d0abe1003161304u35958c0scefb87a5607853b5@mail.gmail.com>
	 <99cf22521003161343o21262736s801bd2e99ac2b6a8@mail.gmail.com>
Date: Tue, 16 Mar 2010 22:50:51 +0100
Message-ID: <b54d0abe1003161450o4101d7bdj2e69a2cbf9352d6f@mail.gmail.com>
To: dreamcat four <dreamcat4@gmail.com>
Cc: Stanislav Malyshev <stas@zend.com>, Lester Caine <lester@lsces.co.uk>, 
	PHP internals <internals@lists.php.net>
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: quoted-printable
Subject: Re: [PHP-DEV] Where are we ACTUALLY on Unicode?
From: tyra3l@gmail.com (Ferenc Kovacs)

On Tue, Mar 16, 2010 at 9:43 PM, dreamcat four <dreamcat4@gmail.com> wrote:
> And remember,
>
> Its not just the number of times its send to ICU for conversion. Its
> also the number of times your UTF-16 string has to be converted back
> into utf-8 afterwards. This is why Apple makes its utf-16 strings
> immutable. So they are read-only, and the utf-8 representation can be
> cached afterward.
>
> Think of it this way:
>
> 1. Load a utf-8 string from DB or file
> 2. Convert it to utf-16
> 3. Perform ICU conv 3-5 times
> 4. Page gets hit by memcache
> 5. utf-16 is converted back to utf-8
> 6. Something changes
> =C2=A0? String was cached ?
> 7. need to spit out another utf-8 version of the string again
>
> And a persistent web application can be held for many hours in memory.
> Are we converting back to utf-8 every time? Then it might be better to
> wrap the string conversions just around ICU.
>
> I'd suggest selecting a real (but still as easy-to-work with as can be
> found) unicode php app. One that has been written to use a unicode php
> module. Then getting a single, representative page from it. By that I
> mean the kind of page that gets accessed the most. So for imdb that
> would be a movie's page, etc. The smalled 'slice' of the app, not the
> whole thing. Dummy-out the other stuff.
>
> Then convert that part (for rendering one page) into the current php6
> unicode scheme. And can see what's what.
>
I would choose mediawiki software for this kind of test, it works in a
really internationalized environment, plus I did see
posting/contributing the main developer of the mediawiki/wikipedia
application on the mailing list.

But that's just my two cents.

Tyrael
>
>
> On Tue, Mar 16, 2010 at 8:04 PM, Ferenc Kovacs <tyra3l@gmail.com> wrote:
>> On Tue, Mar 16, 2010 at 8:05 PM, Stanislav Malyshev <stas@zend.com> wrot=
e:
>>> Hi!
>>>
>>>> On disk storage should probably be UTF-8 without any question? Windows
>>>> use of widestrings for some files simple doubles up the on disk storag=
e
>>>
>>> As file content, it's OK (an it'd be easy to add option to specify cont=
ent
>>> transformation if we wanted), but prescribing filenames as UTF-8 would
>>> probably be not workable, since different systems (and maybe even diffe=
rent
>>> filesystems inside same OS?) can have different opinions on that.
>>>
>>>> '3' is not a very processor friendly number, so working with 4 even
>>>> though wasteful on memory, does make perfect sense. How long is it sin=
ce
>>>
>>> I'm not sure it does. Most of PHP strings are short, so memory loss wou=
ld be
>>> very significant. Also, take into account that CPU caches aren't as big=
 as
>>> the main memory, and not fitting your data into the cache is expensive.
>>>
>>>> we had a 640k limit on working memory? SERVERS should have a good amou=
nt
>>>
>>> It doesn't matter how much memory you have, in numbers. Until we find a=
n
>>> unlimited source of computer memory left by the aliens in Himalayas, me=
mory
>>> costs money. It doesn't matter how much memory do you have - however ma=
ny
>>> gigs you have, you'll be able to run 3 times less PHP processes in new
>>> version on the same hardware than in old version, which means new PHP w=
ould
>>> cost you more to run. "Memory is cheap" is a very misunderstood express=
ion -
>>> it's only cheap if you always have much more than you need.
>>>
>>>> Probably 90% of the time a string will come in and go out without
>>>> requiring any processing at all, so leave it as UTF-8 ? The only time =
we
>>>
>>> It might be great if we could do that. The problem might be that right =
now
>>> AFAIK we don't have a good library to work with utf-8 strings (please
>>> correct me if I'm wrong here).
>> http://source.icu-project.org/repos/icu/icuhtml/trunk/design/strings/icu=
_utf8.html
>> from ICU 3.6 changelog =3D> The UTF-8 transformation functions and
>> macros are faster.
>> from 4.2 =3D> UTF-8 friendly internal data structure for Unicode data lo=
okup
>> so it's seems that guys at ICU tries to close the gap between the
>> UTF-16 and UTF-8 performance, so maybe it would be a good idea, to
>> check out the current situation.
>>
>> Tyrael
>>> --
>>> Stanislav Malyshev, Zend Software Architect
>>> stas@zend.com =C2=A0 http://www.zend.com/
>>> (408)253-8829 =C2=A0 MSN: stas@zend.com
>>>
>>> --
>>> PHP Internals - PHP Runtime Development Mailing List
>>> To unsubscribe, visit: http://www.php.net/unsub.php
>>>
>>>
>>
>> --
>> PHP Internals - PHP Runtime Development Mailing List
>> To unsubscribe, visit: http://www.php.net/unsub.php
>>
>>
>