Newsgroups: php.internals Path: news.php.net Xref: news.php.net php.internals:19448 Return-Path: Mailing-List: contact internals-help@lists.php.net; run by ezmlm Delivered-To: mailing list internals@lists.php.net Received: (qmail 29691 invoked by uid 1010); 6 Oct 2005 17:56:48 -0000 Delivered-To: ezmlm-scan-internals@lists.php.net Delivered-To: ezmlm-internals@lists.php.net Received: (qmail 29675 invoked from network); 6 Oct 2005 17:56:48 -0000 Received: from unknown (HELO lists.php.net) (127.0.0.1) by localhost with SMTP; 6 Oct 2005 17:56:48 -0000 X-Host-Fingerprint: 82.94.239.5 jdi.jdi-ict.nl Linux 2.5 (sometimes 2.4) (4) Received: from ([82.94.239.5:38793] helo=jdi.jdi-ict.nl) by pb1.pair.com (ecelerity 2.0 beta r(6323M)) with SMTP id 63/F9-54476-E5565434 for ; Thu, 06 Oct 2005 13:56:46 -0400 Received: from localhost (localhost [127.0.0.1]) by jdi.jdi-ict.nl (8.12.11/8.12.11) with ESMTP id j96Hugrq013501 for ; Thu, 6 Oct 2005 19:56:42 +0200 Received: from localhost (localhost [127.0.0.1]) by jdi.jdi-ict.nl (8.12.11/8.12.11) with ESMTP id j96Huakk013479 for ; Thu, 6 Oct 2005 19:56:36 +0200 Date: Thu, 6 Oct 2005 19:56:34 +0200 (CEST) X-X-Sender: derick@localhost To: PHP Developers Mailing List Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII X-Virus-Scanned: by amavisd-new at jdi-ict.nl Subject: Unicode Implementation From: derick@php.net (Derick Rethans) Hello! I am thinking that we're doing something with the unicode implementation and that's that we're now getting duplicate implementations of quite some things: functions, internal functions, hash implementations, two ways for storing identifiers... only because we need to support both IS_STRING and IS_UNICODE and unicode=off mode. I think I would prefer an IS_UNICODE/unicode=on only PHP. This would mean that: - no duplicate functionality for tons of functions that will make maintaining the thing very hard - a cleaner (and a bit faster) Unicode implementation - we have a bit less BC. Internally we would only see IS_UNICODE and IS_BINARY, where we can have a small layer around extensions which return IS_STRING where we automatically convert it to and from unicode for those extensions. IS_STRING strings will still exist, but should not be there for the "user level". For things like: $str = unicode_convert($unicode, 'iso-2022'); and $unicode being "IS_UNICODE". $str will now be an IS_BINARY string, with all the restrictions that we already have on those strings (like no automatic conversions). Functions that work on binary strings can be quite limited (we wouldn't need a strtolower for that f.e.), so we are cutting down in a lot of duplicated code. The same goes for not having to support both unicode=off and unicode=on mode, as that can make things a bit complicated too. This will limit functionality on binary strings a bit though, but I think this is 10 times better than an unmaintainable PHP with Unicode support. Besides this, I ran some micro benchmarks on about 600 characters of text with a few functions and benchmarked their behavior between unicode=1 and unicode=0 mode. Results: strrev (100.000 iterations over 600 characters of normalized latin text): unicode off: 1.8secs unicode on: 5.0secs strtoupper (100.000 iterations over the same text): unicode off: 2.2secs unicode on: 7.9secs substr(50, 100) (1.000.000 over the same text): unicode off: 3.9secs unicode on: 11.9secs This is something I find quite not acceptable, and we need to figure out a way on how to optimize this - for substr the penalty is probably what we are using an iterator and not a direct memcpy (because of surrogates), I am not so sure about the others. regards, Derick -- Derick Rethans http://derickrethans.nl | http://ez.no | http://xdebug.org