Newsgroups: php.internals Path: news.php.net Xref: news.php.net php.internals:35891 Return-Path: Mailing-List: contact internals-help@lists.php.net; run by ezmlm Delivered-To: mailing list internals@lists.php.net Received: (qmail 67642 invoked from network); 2 Mar 2008 22:22:04 -0000 Received: from unknown (HELO lists.php.net) (127.0.0.1) by localhost with SMTP; 2 Mar 2008 22:22:04 -0000 Authentication-Results: pb1.pair.com smtp.mail=helly@php.net; spf=unknown; sender-id=unknown Authentication-Results: pb1.pair.com header.from=helly@php.net; sender-id=unknown Received-SPF: unknown (pb1.pair.com: domain php.net does not designate 85.214.94.56 as permitted sender) X-PHP-List-Original-Sender: helly@php.net X-Host-Fingerprint: 85.214.94.56 aixcept.net Linux 2.6 Received: from [85.214.94.56] ([85.214.94.56:57845] helo=h1149922.serverkompetenz.net) by pb1.pair.com (ecelerity 2.1.1.9-wez r(12769M)) with ESMTP id 7E/95-29055-A882BC74 for ; Sun, 02 Mar 2008 17:22:03 -0500 Received: from MBOERGER-ZRH.corp.google.com (209-222.1-85.cust.bluewin.ch [85.1.222.209]) (using TLSv1 with cipher AES256-SHA (256/256 bits)) (No client certificate requested) by h1149922.serverkompetenz.net (Postfix) with ESMTP id A997111EFC2 for ; Sun, 2 Mar 2008 23:21:59 +0100 (CET) Date: Sun, 2 Mar 2008 23:21:34 +0100 Reply-To: Marcus Boerger X-Priority: 3 (Normal) Message-ID: <1706278209.20080302232134@marcus-boerger.de> To: internals@lists.php.net MIME-Version: 1.0 Content-Type: text/plain; charset=iso-8859-15 Content-Transfer-Encoding: 8bit Subject: [RFC] Replace the flex-based scanner with an re2c [1] based lexer From: helly@php.net (Marcus Boerger) RFC: REPLACE THE FLEX-BASED SCANNER WITH AN RE2C [1] BASED LEXER Situation: The current flex-based lexer depends on an outdated and unsupported flex version. Alternatives include either updating to a newer version of flex or using re2c, which we already use for a variety of things (serializing, pdo sql scanning, date/time parsing). While moving towards a newer flex version would be much easier, switching to re2c promises a much faster lexer. Actually, without any specific re2c optimizations we already get around a 20% scanner performance increase. Running the tests gets an overall speedup of 2%. It is arguable whether this is enough, but re2c has more advantages. First of all, re2c allows one to scan any type of input (ASCII, UTF-8, UTF-16, UTF-32). Secondly, it allows for better integration with Lemon [2], which would be the next step. And thirdly we can switch to a reentrant scanner. Current state: Flex has been fully replaced by re2c in Zend. We have also switched to an mmap-based lexer approach for now. However, we had to drop multibyte support as well as the encoding declare. The current state can be checked out from Scott's subversion repository [3] and you can follow the development on his Trac setup [4]. When you want to build php with re2c, then you need to grab re2c from its sourceforge subversion repository [5]. You can also check out the changes in a patch created Sunday 2nd March against a PHP checkout from 14th February [6]. Further steps: Commit this to PHP 5.3. Synch to HEAD. Add pecl/intl to 5.3. Discuss/recreate multibyte support with libintl. Future steps: Replace bison with lemon in PHP 5.4 or HEAD. Time Frame: Commit to 5.3 between the 5th and the 15th of March. Synch to HEAD a couple of days later. Moving pecl/libintl to ext (depends on the 5.3 RMs decision). After that is done, decide about multibyte support. Along with the commit to the 5.3 branch there will be a new re2c version available. Marcus Boerger Nuno Lopes Scott MacVicar [1] http://re2c.org/ [2] http://www.hwaci.com/sw/lemon/ [3] svn://whisky.macvicar.net/php-re2c [4] http://trac.macvicar.net/php-re2c/ [5] https://re2c.svn.sourceforge.net/svnroot/re2c/trunk/re2c [6] http://php.net/~helly/php-re2c-20080302.diff.txt