Newsgroups: php.internals Path: news.php.net Xref: news.php.net php.internals:36323 Return-Path: Mailing-List: contact internals-help@lists.php.net; run by ezmlm Delivered-To: mailing list internals@lists.php.net Received: (qmail 19074 invoked from network); 22 Mar 2008 14:24:21 -0000 Received: from unknown (HELO lists.php.net) (127.0.0.1) by localhost with SMTP; 22 Mar 2008 14:24:21 -0000 Authentication-Results: pb1.pair.com smtp.mail=helly@php.net; spf=unknown; sender-id=unknown Authentication-Results: pb1.pair.com header.from=helly@php.net; sender-id=unknown Received-SPF: unknown (pb1.pair.com: domain php.net does not designate 85.214.94.56 as permitted sender) X-PHP-List-Original-Sender: helly@php.net X-Host-Fingerprint: 85.214.94.56 aixcept.net Linux 2.6 Received: from [85.214.94.56] ([85.214.94.56:60647] helo=h1149922.serverkompetenz.net) by pb1.pair.com (ecelerity 2.1.1.9-wez r(12769M)) with ESMTP id 80/70-16027-29615E74 for ; Sat, 22 Mar 2008 09:24:20 -0500 Received: from MBOERGER-ZRH.corp.google.com (72-216.1-85.cust.bluewin.ch [85.1.216.72]) (using TLSv1 with cipher AES256-SHA (256/256 bits)) (No client certificate requested) by h1149922.serverkompetenz.net (Postfix) with ESMTP id 8F39011EFAF; Sat, 22 Mar 2008 15:24:14 +0100 (CET) Date: Sat, 22 Mar 2008 15:23:43 +0100 Reply-To: Marcus Boerger X-Priority: 3 (Normal) Message-ID: <497326103.20080322152343@marcus-boerger.de> To: internals@lists.php.net CC: Alan Knowles , Andi Gutmans , Rui Hirokawa , Johannes Schlueter In-Reply-To: <47CC16FC.7080802@akbkhome.com> References: <1706278209.20080302232134@marcus-boerger.de> <47CB3CDC.8050006@akbkhome.com> <1789182684.20080303113429@marcus-boerger.de> <47CC16FC.7080802@akbkhome.com> MIME-Version: 1.0 Content-Type: text/plain; charset=iso-8859-15 Content-Transfer-Encoding: 8bit Subject: Re: [PHP-DEV] [RFC] Replace the flex-based scanner with an re2c [1] based lexer From: helly@php.net (Marcus Boerger) Hello Alan, Andi, Rui, my impression still is that not a single person uses this crap. I only hear of people claiming they have heard that people use it. But what I see is broken code and not a single test. If this is not going to change as in we are not getting any .phpt files for this feature then there are two ways. First I implement something that I personally would expect and I wouldn't care about anything that is there right now or second we simply get rid of it completely. So far I have extended re2c to make it easier to deal with other encodings and even allow multiple char width at the same time. So I did my homework. Now I expect that somebody writes tests! Then we could provide a scanner that works on UCS-2 or on UTF-32 and then try to identofy the script encoding. Then work on th extended charset and do a reverse encoding if necessary for output. THough even thinking about this approach (still like what we seem to have right now) really hurts my very badly becasue it is the wrong approach. What we want is a working HEAD. marcus Monday, March 3, 2008, 4:19:24 PM, you wrote: > a few replaces with this file should be a good testcase > - probably worth testing > * comments with these character in them. both /* and // > * string with these characters in them. > lynx -source > 'http://smontagu.damowmow.com/genEncodingTest.cgi?family=windows&codepage=950' > | grep test | grep -v testcase > I have definatly seen code with chinese characters in comments and > strings and a few times function names and variable names with chinese > characters... > Regards > Alan > Marcus Boerger wrote: >> Hello Alan, >> >> be my hero then :-) Could you generate a few tests for the multibyte >> support so that we know how it is used right now and what we need to take >> care of? >> >> marcus >> >> Monday, March 3, 2008, 12:48:44 AM, you wrote: >> >> >>> Can you clarify the Multibyte issues: >>> - I presume this means that it can handle ASCII/UTF8/16 etc. but will >>> not handle things like BIG5/GB encoding in source code - this may be a >>> bit of an issue around here.. >>> >> >> >>> Regards >>> Alan >>> >> >> >> >>> Marcus Boerger wrote: >>> >>>> RFC: REPLACE THE FLEX-BASED SCANNER WITH AN RE2C [1] BASED LEXER >>>> >>>> Situation: >>>> The current flex-based lexer depends on an outdated and unsupported flex >>>> version. Alternatives include either updating to a newer version of flex or >>>> using re2c, which we already use for a variety of things (serializing, pdo sql >>>> scanning, date/time parsing). While moving towards a newer flex version would >>>> be much easier, switching to re2c promises a much faster lexer. Actually, >>>> without any specific re2c optimizations we already get around a 20% scanner >>>> performance increase. Running the tests gets an overall speedup of 2%. It is >>>> arguable whether this is enough, but re2c has more advantages. First of all, >>>> re2c allows one to scan any type of input (ASCII, UTF-8, UTF-16, UTF-32). >>>> Secondly, it allows for better integration with Lemon [2], which would be the >>>> next step. And thirdly we can switch to a reentrant scanner. >>>> >>>> Current state: >>>> Flex has been fully replaced by re2c in Zend. We have also switched to an >>>> mmap-based lexer approach for now. However, we had to drop multibyte support >>>> as well as the encoding declare. The current state can be checked out from >>>> Scott's subversion repository [3] and you can follow the development on his >>>> Trac setup [4]. When you want to build php with re2c, then you need to grab >>>> re2c from its sourceforge subversion repository [5]. You can also check out >>>> the changes in a patch created Sunday 2nd March against a PHP checkout from >>>> 14th February [6]. >>>> >>>> Further steps: >>>> Commit this to PHP 5.3. Synch to HEAD. Add pecl/intl to 5.3. Discuss/recreate >>>> multibyte support with libintl. >>>> >>>> Future steps: >>>> Replace bison with lemon in PHP 5.4 or HEAD. >>>> >>>> Time Frame: >>>> Commit to 5.3 between the 5th and the 15th of March. Synch to HEAD a couple >>>> of days later. Moving pecl/libintl to ext (depends on the 5.3 RMs decision). >>>> After that is done, decide about multibyte support. Along with the commit to >>>> the 5.3 branch there will be a new re2c version available. >>>> >>>> >>>> Marcus Boerger >>>> Nuno Lopes >>>> Scott MacVicar >>>> >>>> >>>> [1] http://re2c.org/ >>>> [2] http://www.hwaci.com/sw/lemon/ >>>> [3] svn://whisky.macvicar.net/php-re2c >>>> [4] http://trac.macvicar.net/php-re2c/ >>>> [5] https://re2c.svn.sourceforge.net/svnroot/re2c/trunk/re2c >>>> [6] http://php.net/~helly/php-re2c-20080302.diff.txt >>>> >>>> >>>> >>>> >>>> >> >> >> >> >> >> Best regards, >> Marcus >> >> Best regards, Marcus