Newsgroups: php.internals Path: news.php.net Xref: news.php.net php.internals:47345 Return-Path: Mailing-List: contact internals-help@lists.php.net; run by ezmlm Delivered-To: mailing list internals@lists.php.net Received: (qmail 46474 invoked from network); 16 Mar 2010 21:50:56 -0000 Received: from unknown (HELO lists.php.net) (127.0.0.1) by localhost with SMTP; 16 Mar 2010 21:50:56 -0000 Authentication-Results: pb1.pair.com header.from=tyra3l@gmail.com; sender-id=pass; domainkeys=bad Authentication-Results: pb1.pair.com smtp.mail=tyra3l@gmail.com; spf=pass; sender-id=pass Received-SPF: pass (pb1.pair.com: domain gmail.com designates 209.85.218.209 as permitted sender) DomainKey-Status: bad X-DomainKeys: Ecelerity dk_validate implementing draft-delany-domainkeys-base-01 X-PHP-List-Original-Sender: tyra3l@gmail.com X-Host-Fingerprint: 209.85.218.209 mail-bw0-f209.google.com Received: from [209.85.218.209] ([209.85.218.209:64758] helo=mail-bw0-f209.google.com) by pb1.pair.com (ecelerity 2.1.1.9-wez r(12769M)) with ESMTP id EE/29-15129-E3DFF9B4 for ; Tue, 16 Mar 2010 16:50:55 -0500 Received: by bwz1 with SMTP id 1so440648bwz.1 for ; Tue, 16 Mar 2010 14:50:51 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:mime-version:received:in-reply-to:references :date:message-id:subject:from:to:cc:content-type :content-transfer-encoding; bh=LQXziep/2Xuc07ak+nYjMnROBYeydARd1sG8US3vC0s=; b=JzRXfrX0Rzd2D9pDel3KuGWZeJXvVJVAx6Apsya0lLc08W+9zYlaKNi68C0NBrFqKL HW4lrhZb9rXOi01RA5y9hB4j6pa03alPoSvMFFSa8nQzurCDJlAWyPGombeWpaQAJbYB 88d4zAHAiqoJDtKbOZFJv1Zzc2CIThw+FRalE= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :cc:content-type:content-transfer-encoding; b=mBQmd7xNjRByZ3M6lWmJu/iOLlQt/a9A2DZafAK7anAb2qazOxZaG5rG9NZmUw+Qu8 1HYbYrgajyBPUftZZ/Ra0xpNeAE0zF4GxUV72UkS3zSMeTGdiajo/F2P2rjMEm/EgpJY oP1VbGONKM1pYVu4fmF91zs6levXzsb4d5v1I= MIME-Version: 1.0 Received: by 10.204.36.71 with SMTP id s7mr161783bkd.171.1268776251497; Tue, 16 Mar 2010 14:50:51 -0700 (PDT) In-Reply-To: <99cf22521003161343o21262736s801bd2e99ac2b6a8@mail.gmail.com> References: <4B9C9007.1080802@lsces.co.uk> <4B9EC3B2.7070901@zend.com> <4B9F4196.9030404@lsces.co.uk> <4B9FD68B.5020900@zend.com> <99cf22521003161343o21262736s801bd2e99ac2b6a8@mail.gmail.com> Date: Tue, 16 Mar 2010 22:50:51 +0100 Message-ID: To: dreamcat four Cc: Stanislav Malyshev , Lester Caine , PHP internals Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable Subject: Re: [PHP-DEV] Where are we ACTUALLY on Unicode? From: tyra3l@gmail.com (Ferenc Kovacs) On Tue, Mar 16, 2010 at 9:43 PM, dreamcat four wrote: > And remember, > > Its not just the number of times its send to ICU for conversion. Its > also the number of times your UTF-16 string has to be converted back > into utf-8 afterwards. This is why Apple makes its utf-16 strings > immutable. So they are read-only, and the utf-8 representation can be > cached afterward. > > Think of it this way: > > 1. Load a utf-8 string from DB or file > 2. Convert it to utf-16 > 3. Perform ICU conv 3-5 times > 4. Page gets hit by memcache > 5. utf-16 is converted back to utf-8 > 6. Something changes > =C2=A0? String was cached ? > 7. need to spit out another utf-8 version of the string again > > And a persistent web application can be held for many hours in memory. > Are we converting back to utf-8 every time? Then it might be better to > wrap the string conversions just around ICU. > > I'd suggest selecting a real (but still as easy-to-work with as can be > found) unicode php app. One that has been written to use a unicode php > module. Then getting a single, representative page from it. By that I > mean the kind of page that gets accessed the most. So for imdb that > would be a movie's page, etc. The smalled 'slice' of the app, not the > whole thing. Dummy-out the other stuff. > > Then convert that part (for rendering one page) into the current php6 > unicode scheme. And can see what's what. > I would choose mediawiki software for this kind of test, it works in a really internationalized environment, plus I did see posting/contributing the main developer of the mediawiki/wikipedia application on the mailing list. But that's just my two cents. Tyrael > > > On Tue, Mar 16, 2010 at 8:04 PM, Ferenc Kovacs wrote: >> On Tue, Mar 16, 2010 at 8:05 PM, Stanislav Malyshev wrot= e: >>> Hi! >>> >>>> On disk storage should probably be UTF-8 without any question? Windows >>>> use of widestrings for some files simple doubles up the on disk storag= e >>> >>> As file content, it's OK (an it'd be easy to add option to specify cont= ent >>> transformation if we wanted), but prescribing filenames as UTF-8 would >>> probably be not workable, since different systems (and maybe even diffe= rent >>> filesystems inside same OS?) can have different opinions on that. >>> >>>> '3' is not a very processor friendly number, so working with 4 even >>>> though wasteful on memory, does make perfect sense. How long is it sin= ce >>> >>> I'm not sure it does. Most of PHP strings are short, so memory loss wou= ld be >>> very significant. Also, take into account that CPU caches aren't as big= as >>> the main memory, and not fitting your data into the cache is expensive. >>> >>>> we had a 640k limit on working memory? SERVERS should have a good amou= nt >>> >>> It doesn't matter how much memory you have, in numbers. Until we find a= n >>> unlimited source of computer memory left by the aliens in Himalayas, me= mory >>> costs money. It doesn't matter how much memory do you have - however ma= ny >>> gigs you have, you'll be able to run 3 times less PHP processes in new >>> version on the same hardware than in old version, which means new PHP w= ould >>> cost you more to run. "Memory is cheap" is a very misunderstood express= ion - >>> it's only cheap if you always have much more than you need. >>> >>>> Probably 90% of the time a string will come in and go out without >>>> requiring any processing at all, so leave it as UTF-8 ? The only time = we >>> >>> It might be great if we could do that. The problem might be that right = now >>> AFAIK we don't have a good library to work with utf-8 strings (please >>> correct me if I'm wrong here). >> http://source.icu-project.org/repos/icu/icuhtml/trunk/design/strings/icu= _utf8.html >> from ICU 3.6 changelog =3D> The UTF-8 transformation functions and >> macros are faster. >> from 4.2 =3D> UTF-8 friendly internal data structure for Unicode data lo= okup >> so it's seems that guys at ICU tries to close the gap between the >> UTF-16 and UTF-8 performance, so maybe it would be a good idea, to >> check out the current situation. >> >> Tyrael >>> -- >>> Stanislav Malyshev, Zend Software Architect >>> stas@zend.com =C2=A0 http://www.zend.com/ >>> (408)253-8829 =C2=A0 MSN: stas@zend.com >>> >>> -- >>> PHP Internals - PHP Runtime Development Mailing List >>> To unsubscribe, visit: http://www.php.net/unsub.php >>> >>> >> >> -- >> PHP Internals - PHP Runtime Development Mailing List >> To unsubscribe, visit: http://www.php.net/unsub.php >> >> >