Newsgroups: php.internals
Path: news.php.net
Xref: news.php.net php.internals:19448
Mailing-List: contact internals-help@lists.php.net; run by ezmlm
Date: Thu, 6 Oct 2005 19:56:34 +0200 (CEST)
To: PHP Developers Mailing List <internals@lists.php.net>
Message-ID: <Pine.LNX.4.62.0510061954140.13434@localhost>
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII
Subject: Unicode Implementation 
From: derick@php.net (Derick Rethans)

Hello!

I am thinking that we're doing something with the unicode implementation and
that's that we're now getting duplicate implementations of quite some things:
functions, internal functions, hash implementations, two ways for storing
identifiers... only because we need to support both IS_STRING and IS_UNICODE
and unicode=off mode. 

I think I would prefer an IS_UNICODE/unicode=on only PHP.

This would mean that:
- no duplicate functionality for tons of functions that will make maintaining
  the thing very hard
- a cleaner (and a bit faster) Unicode implementation
- we have a bit less BC.

Internally we would only see IS_UNICODE and IS_BINARY, where we can have a
small layer around extensions which return IS_STRING where we automatically
convert it to and from unicode for those extensions. IS_STRING strings will
still exist, but should not be there for the "user level".

For things like:
	$str = unicode_convert($unicode, 'iso-2022');
and $unicode being "IS_UNICODE". $str will now be an IS_BINARY string, with all
the restrictions that we already have on those strings (like no automatic
conversions).

Functions that work on binary strings can be quite limited (we wouldn't need a
strtolower for that f.e.), so we are cutting down in a lot of duplicated code.
The same goes for not having to support both unicode=off and unicode=on mode,
as that can make things a bit complicated too. This will limit functionality on
binary strings a bit though, but I think this is 10 times better than an
unmaintainable PHP with Unicode support.

Besides this, I ran some micro benchmarks on about 600 characters of text with
a few functions and benchmarked their behavior between unicode=1 and unicode=0
mode. Results:

strrev (100.000 iterations over 600 characters of normalized latin text):
	unicode off: 1.8secs
	unicode on:  5.0secs

strtoupper (100.000 iterations over the same text):
	unicode off: 2.2secs
	unicode on:  7.9secs

substr(50, 100) (1.000.000 over the same text):
	unicode off: 3.9secs
	unicode on: 11.9secs

This is something I find quite not acceptable, and we need to figure out a way
on how to optimize this - for substr the penalty is probably what we are using
an iterator and not a direct memcpy (because of surrogates), I am not so sure
about the others.

regards,
Derick

-- 
Derick Rethans
http://derickrethans.nl | http://ez.no | http://xdebug.org