Newsgroups: php.internals Path: news.php.net Xref: news.php.net php.internals:19459 Return-Path: Mailing-List: contact internals-help@lists.php.net; run by ezmlm Delivered-To: mailing list internals@lists.php.net Received: (qmail 81553 invoked by uid 1010); 6 Oct 2005 19:56:37 -0000 Delivered-To: ezmlm-scan-internals@lists.php.net Delivered-To: ezmlm-internals@lists.php.net Received: (qmail 81536 invoked from network); 6 Oct 2005 19:56:37 -0000 Received: from unknown (HELO lists.php.net) (127.0.0.1) by localhost with SMTP; 6 Oct 2005 19:56:37 -0000 X-Host-Fingerprint: 216.145.54.171 mrout1.yahoo.com FreeBSD 4.7-5.2 (or MacOS X 10.2-10.3) (2) Received: from ([216.145.54.171:44851] helo=mrout1.yahoo.com) by pb1.pair.com (ecelerity 2.0 beta r(6323M)) with SMTP id 27/AF-54476-47185434 for ; Thu, 06 Oct 2005 15:56:37 -0400 Received: from [66.228.175.145] (borndress-lm.corp.yahoo.com [66.228.175.145]) by mrout1.yahoo.com (8.13.4/8.13.4/y.out) with ESMTP id j96JtBbG088621; Thu, 6 Oct 2005 12:55:11 -0700 (PDT) In-Reply-To: References: Mime-Version: 1.0 (Apple Message framework v623) Content-Type: text/plain; charset=US-ASCII; format=flowed Message-ID: <99dd4f75f4ceebfe1c980cf439e97416@gravitonic.com> Content-Transfer-Encoding: 7bit Cc: PHP Developers Mailing List Date: Thu, 6 Oct 2005 12:55:29 -0700 To: Derick Rethans X-Mailer: Apple Mail (2.623) Subject: Re: [PHP-DEV] Unicode Implementation From: andrei@gravitonic.com (Andrei Zmievski) On Oct 6, 2005, at 10:56 AM, Derick Rethans wrote: > I am thinking that we're doing something with the unicode > implementation and > that's that we're now getting duplicate implementations of quite some > things: > functions, internal functions, hash implementations, two ways for > storing > identifiers... only because we need to support both IS_STRING and > IS_UNICODE > and unicode=off mode. > > I think I would prefer an IS_UNICODE/unicode=on only PHP. > > This would mean that: > - no duplicate functionality for tons of functions that will make > maintaining > the thing very hard This is true. > - a cleaner (and a bit faster) Unicode implementation This is true too. > - we have a bit less BC. "A bit less"? I'd say it would break BC in a major way. People who want to upgrade to PHP 6 would need to rewrite a lot of their scripts. > Internally we would only see IS_UNICODE and IS_BINARY, where we can > have a > small layer around extensions which return IS_STRING where we > automatically > convert it to and from unicode for those extensions. IS_STRING strings > will > still exist, but should not be there for the "user level". > > For things like: > $str = unicode_convert($unicode, 'iso-2022'); > and $unicode being "IS_UNICODE". $str will now be an IS_BINARY string, > with all > the restrictions that we already have on those strings (like no > automatic > conversions). > > Functions that work on binary strings can be quite limited (we > wouldn't need a > strtolower for that f.e.), so we are cutting down in a lot of > duplicated code. > The same goes for not having to support both unicode=off and > unicode=on mode, > as that can make things a bit complicated too. This will limit > functionality on > binary strings a bit though, but I think this is 10 times better than > an > unmaintainable PHP with Unicode support. Sure, if you remove requirement for BC and merge the string/binary semantics, you can use IS_BINARY for all that stuff. > Besides this, I ran some micro benchmarks on about 600 characters of > text with > a few functions and benchmarked their behavior between unicode=1 and > unicode=0 > mode. Results: > > strrev (100.000 iterations over 600 characters of normalized latin > text): > unicode off: 1.8secs > unicode on: 5.0secs > > strtoupper (100.000 iterations over the same text): > unicode off: 2.2secs > unicode on: 7.9secs > > substr(50, 100) (1.000.000 over the same text): > unicode off: 3.9secs > unicode on: 11.9secs > > This is something I find quite not acceptable, and we need to figure > out a way > on how to optimize this - for substr the penalty is probably what we > are using > an iterator and not a direct memcpy (because of surrogates), I am not > so sure > about the others. We can try switching to _UNSAFE versions of the iterator macros - they assume well-formed UTF-16, so they will be somewhat faster. -Andrei