Newsgroups: php.internals Path: news.php.net Xref: news.php.net php.internals:35908 Return-Path: Mailing-List: contact internals-help@lists.php.net; run by ezmlm Delivered-To: mailing list internals@lists.php.net Received: (qmail 73659 invoked from network); 3 Mar 2008 10:34:34 -0000 Received: from unknown (HELO lists.php.net) (127.0.0.1) by localhost with SMTP; 3 Mar 2008 10:34:34 -0000 Authentication-Results: pb1.pair.com header.from=helly@php.net; sender-id=unknown Authentication-Results: pb1.pair.com smtp.mail=helly@php.net; spf=unknown; sender-id=unknown Received-SPF: unknown (pb1.pair.com: domain php.net does not designate 85.214.94.56 as permitted sender) X-PHP-List-Original-Sender: helly@php.net X-Host-Fingerprint: 85.214.94.56 aixcept.net Linux 2.6 Received: from [85.214.94.56] ([85.214.94.56:48708] helo=h1149922.serverkompetenz.net) by pb1.pair.com (ecelerity 2.1.1.9-wez r(12769M)) with ESMTP id B4/38-29055-834DBC74 for ; Mon, 03 Mar 2008 05:34:32 -0500 Received: from dhcp-172-28-202-237.zrh.corp.google.com (unknown [193.142.125.1]) (using TLSv1 with cipher AES256-SHA (256/256 bits)) (No client certificate requested) by h1149922.serverkompetenz.net (Postfix) with ESMTP id 9793211F397; Mon, 3 Mar 2008 11:34:29 +0100 (CET) Date: Mon, 3 Mar 2008 11:34:29 +0100 Reply-To: Marcus Boerger X-Priority: 3 (Normal) Message-ID: <1789182684.20080303113429@marcus-boerger.de> To: Alan Knowles CC: internals@lists.php.net In-Reply-To: <47CB3CDC.8050006@akbkhome.com> References: <1706278209.20080302232134@marcus-boerger.de> <47CB3CDC.8050006@akbkhome.com> MIME-Version: 1.0 Content-Type: text/plain; charset=iso-8859-15 Content-Transfer-Encoding: 8bit Subject: Re: [PHP-DEV] [RFC] Replace the flex-based scanner with an re2c [1] based lexer From: helly@php.net (Marcus Boerger) Hello Alan, be my hero then :-) Could you generate a few tests for the multibyte support so that we know how it is used right now and what we need to take care of? marcus Monday, March 3, 2008, 12:48:44 AM, you wrote: > Can you clarify the Multibyte issues: > - I presume this means that it can handle ASCII/UTF8/16 etc. but will > not handle things like BIG5/GB encoding in source code - this may be a > bit of an issue around here.. > Regards > Alan > Marcus Boerger wrote: >> RFC: REPLACE THE FLEX-BASED SCANNER WITH AN RE2C [1] BASED LEXER >> >> Situation: >> The current flex-based lexer depends on an outdated and unsupported flex >> version. Alternatives include either updating to a newer version of flex or >> using re2c, which we already use for a variety of things (serializing, pdo sql >> scanning, date/time parsing). While moving towards a newer flex version would >> be much easier, switching to re2c promises a much faster lexer. Actually, >> without any specific re2c optimizations we already get around a 20% scanner >> performance increase. Running the tests gets an overall speedup of 2%. It is >> arguable whether this is enough, but re2c has more advantages. First of all, >> re2c allows one to scan any type of input (ASCII, UTF-8, UTF-16, UTF-32). >> Secondly, it allows for better integration with Lemon [2], which would be the >> next step. And thirdly we can switch to a reentrant scanner. >> >> Current state: >> Flex has been fully replaced by re2c in Zend. We have also switched to an >> mmap-based lexer approach for now. However, we had to drop multibyte support >> as well as the encoding declare. The current state can be checked out from >> Scott's subversion repository [3] and you can follow the development on his >> Trac setup [4]. When you want to build php with re2c, then you need to grab >> re2c from its sourceforge subversion repository [5]. You can also check out >> the changes in a patch created Sunday 2nd March against a PHP checkout from >> 14th February [6]. >> >> Further steps: >> Commit this to PHP 5.3. Synch to HEAD. Add pecl/intl to 5.3. Discuss/recreate >> multibyte support with libintl. >> >> Future steps: >> Replace bison with lemon in PHP 5.4 or HEAD. >> >> Time Frame: >> Commit to 5.3 between the 5th and the 15th of March. Synch to HEAD a couple >> of days later. Moving pecl/libintl to ext (depends on the 5.3 RMs decision). >> After that is done, decide about multibyte support. Along with the commit to >> the 5.3 branch there will be a new re2c version available. >> >> >> Marcus Boerger >> Nuno Lopes >> Scott MacVicar >> >> >> [1] http://re2c.org/ >> [2] http://www.hwaci.com/sw/lemon/ >> [3] svn://whisky.macvicar.net/php-re2c >> [4] http://trac.macvicar.net/php-re2c/ >> [5] https://re2c.svn.sourceforge.net/svnroot/re2c/trunk/re2c >> [6] http://php.net/~helly/php-re2c-20080302.diff.txt >> >> >> >> Best regards, Marcus