Newsgroups: php.internals
Path: news.php.net
Xref: news.php.net php.internals:19459
Mailing-List: contact internals-help@lists.php.net; run by ezmlm
In-Reply-To: <Pine.LNX.4.62.0510061954140.13434@localhost>
References: <Pine.LNX.4.62.0510061954140.13434@localhost>
Mime-Version: 1.0 (Apple Message framework v623)
Content-Type: text/plain; charset=US-ASCII; format=flowed
Message-ID: <99dd4f75f4ceebfe1c980cf439e97416@gravitonic.com>
Content-Transfer-Encoding: 7bit
Cc: PHP Developers Mailing List <internals@lists.php.net>
Date: Thu, 6 Oct 2005 12:55:29 -0700
To: Derick Rethans <derick@php.net>
Subject: Re: [PHP-DEV] Unicode Implementation 
From: andrei@gravitonic.com (Andrei Zmievski)

On Oct 6, 2005, at 10:56 AM, Derick Rethans wrote:

> I am thinking that we're doing something with the unicode 
> implementation and
> that's that we're now getting duplicate implementations of quite some 
> things:
> functions, internal functions, hash implementations, two ways for 
> storing
> identifiers... only because we need to support both IS_STRING and 
> IS_UNICODE
> and unicode=off mode.
>
> I think I would prefer an IS_UNICODE/unicode=on only PHP.
>
> This would mean that:
> - no duplicate functionality for tons of functions that will make 
> maintaining
>   the thing very hard

This is true.

> - a cleaner (and a bit faster) Unicode implementation

This is true too.

> - we have a bit less BC.

"A bit less"? I'd say it would break BC in a major way. People who want 
to upgrade to PHP 6 would need to rewrite a lot of their scripts.

> Internally we would only see IS_UNICODE and IS_BINARY, where we can 
> have a
> small layer around extensions which return IS_STRING where we 
> automatically
> convert it to and from unicode for those extensions. IS_STRING strings 
> will
> still exist, but should not be there for the "user level".
>
> For things like:
> 	$str = unicode_convert($unicode, 'iso-2022');
> and $unicode being "IS_UNICODE". $str will now be an IS_BINARY string, 
> with all
> the restrictions that we already have on those strings (like no 
> automatic
> conversions).
>
> Functions that work on binary strings can be quite limited (we 
> wouldn't need a
> strtolower for that f.e.), so we are cutting down in a lot of 
> duplicated code.
> The same goes for not having to support both unicode=off and 
> unicode=on mode,
> as that can make things a bit complicated too. This will limit 
> functionality on
> binary strings a bit though, but I think this is 10 times better than 
> an
> unmaintainable PHP with Unicode support.

Sure, if you remove requirement for BC and merge the string/binary 
semantics, you can use IS_BINARY for all that stuff.

> Besides this, I ran some micro benchmarks on about 600 characters of 
> text with
> a few functions and benchmarked their behavior between unicode=1 and 
> unicode=0
> mode. Results:
>
> strrev (100.000 iterations over 600 characters of normalized latin 
> text):
> 	unicode off: 1.8secs
> 	unicode on:  5.0secs
>
> strtoupper (100.000 iterations over the same text):
> 	unicode off: 2.2secs
> 	unicode on:  7.9secs
>
> substr(50, 100) (1.000.000 over the same text):
> 	unicode off: 3.9secs
> 	unicode on: 11.9secs
>
> This is something I find quite not acceptable, and we need to figure 
> out a way
> on how to optimize this - for substr the penalty is probably what we 
> are using
> an iterator and not a direct memcpy (because of surrogates), I am not 
> so sure
> about the others.

We can try switching to _UNSAFE versions of the iterator macros - they 
assume well-formed UTF-16, so they will be somewhat faster.

-Andrei