Newsgroups: php.internals Path: news.php.net Xref: news.php.net php.internals:35915 Return-Path: Mailing-List: contact internals-help@lists.php.net; run by ezmlm Delivered-To: mailing list internals@lists.php.net Received: (qmail 26673 invoked from network); 3 Mar 2008 15:19:30 -0000 Received: from unknown (HELO lists.php.net) (127.0.0.1) by localhost with SMTP; 3 Mar 2008 15:19:30 -0000 Authentication-Results: pb1.pair.com smtp.mail=alan@akbkhome.com; spf=pass; sender-id=pass Authentication-Results: pb1.pair.com header.from=alan@akbkhome.com; sender-id=pass Received-SPF: pass (pb1.pair.com: domain akbkhome.com designates 202.81.246.113 as permitted sender) X-PHP-List-Original-Sender: alan@akbkhome.com X-Host-Fingerprint: 202.81.246.113 246-113.netfront.net Received: from [202.81.246.113] ([202.81.246.113:34511] helo=akbkhome.com) by pb1.pair.com (ecelerity 2.1.1.9-wez r(12769M)) with ESMTP id 7F/02-29055-FF61CC74 for ; Mon, 03 Mar 2008 10:19:28 -0500 Received: from wideboy ([192.168.0.27]) by akbkhome.com with esmtp (Exim 4.67) (envelope-from ) id 1JWCRz-00078b-5A; Mon, 03 Mar 2008 23:19:35 +0800 Message-ID: <47CC16FC.7080802@akbkhome.com> Date: Mon, 03 Mar 2008 23:19:24 +0800 User-Agent: Thunderbird 2.0.0.12 (X11/20080227) MIME-Version: 1.0 To: Marcus Boerger CC: internals@lists.php.net References: <1706278209.20080302232134@marcus-boerger.de> <47CB3CDC.8050006@akbkhome.com> <1789182684.20080303113429@marcus-boerger.de> In-Reply-To: <1789182684.20080303113429@marcus-boerger.de> Content-Type: text/plain; charset=ISO-8859-15; format=flowed Content-Transfer-Encoding: 7bit X-mailfort-sig: 81c52d98d573af4f38bf89d70f9ec6ae Subject: Re: [PHP-DEV] [RFC] Replace the flex-based scanner with an re2c [1] based lexer From: alan@akbkhome.com (Alan Knowles) a few replaces with this file should be a good testcase - probably worth testing * comments with these character in them. both /* and // * string with these characters in them. lynx -source 'http://smontagu.damowmow.com/genEncodingTest.cgi?family=windows&codepage=950' | grep test | grep -v testcase I have definatly seen code with chinese characters in comments and strings and a few times function names and variable names with chinese characters... Regards Alan Marcus Boerger wrote: > Hello Alan, > > be my hero then :-) Could you generate a few tests for the multibyte > support so that we know how it is used right now and what we need to take > care of? > > marcus > > Monday, March 3, 2008, 12:48:44 AM, you wrote: > > >> Can you clarify the Multibyte issues: >> - I presume this means that it can handle ASCII/UTF8/16 etc. but will >> not handle things like BIG5/GB encoding in source code - this may be a >> bit of an issue around here.. >> > > >> Regards >> Alan >> > > > >> Marcus Boerger wrote: >> >>> RFC: REPLACE THE FLEX-BASED SCANNER WITH AN RE2C [1] BASED LEXER >>> >>> Situation: >>> The current flex-based lexer depends on an outdated and unsupported flex >>> version. Alternatives include either updating to a newer version of flex or >>> using re2c, which we already use for a variety of things (serializing, pdo sql >>> scanning, date/time parsing). While moving towards a newer flex version would >>> be much easier, switching to re2c promises a much faster lexer. Actually, >>> without any specific re2c optimizations we already get around a 20% scanner >>> performance increase. Running the tests gets an overall speedup of 2%. It is >>> arguable whether this is enough, but re2c has more advantages. First of all, >>> re2c allows one to scan any type of input (ASCII, UTF-8, UTF-16, UTF-32). >>> Secondly, it allows for better integration with Lemon [2], which would be the >>> next step. And thirdly we can switch to a reentrant scanner. >>> >>> Current state: >>> Flex has been fully replaced by re2c in Zend. We have also switched to an >>> mmap-based lexer approach for now. However, we had to drop multibyte support >>> as well as the encoding declare. The current state can be checked out from >>> Scott's subversion repository [3] and you can follow the development on his >>> Trac setup [4]. When you want to build php with re2c, then you need to grab >>> re2c from its sourceforge subversion repository [5]. You can also check out >>> the changes in a patch created Sunday 2nd March against a PHP checkout from >>> 14th February [6]. >>> >>> Further steps: >>> Commit this to PHP 5.3. Synch to HEAD. Add pecl/intl to 5.3. Discuss/recreate >>> multibyte support with libintl. >>> >>> Future steps: >>> Replace bison with lemon in PHP 5.4 or HEAD. >>> >>> Time Frame: >>> Commit to 5.3 between the 5th and the 15th of March. Synch to HEAD a couple >>> of days later. Moving pecl/libintl to ext (depends on the 5.3 RMs decision). >>> After that is done, decide about multibyte support. Along with the commit to >>> the 5.3 branch there will be a new re2c version available. >>> >>> >>> Marcus Boerger >>> Nuno Lopes >>> Scott MacVicar >>> >>> >>> [1] http://re2c.org/ >>> [2] http://www.hwaci.com/sw/lemon/ >>> [3] svn://whisky.macvicar.net/php-re2c >>> [4] http://trac.macvicar.net/php-re2c/ >>> [5] https://re2c.svn.sourceforge.net/svnroot/re2c/trunk/re2c >>> [6] http://php.net/~helly/php-re2c-20080302.diff.txt >>> >>> >>> >>> >>> > > > > > > Best regards, > Marcus > >