Newsgroups: php.internals Path: news.php.net Xref: news.php.net php.internals:35907 Return-Path: Mailing-List: contact internals-help@lists.php.net; run by ezmlm Delivered-To: mailing list internals@lists.php.net Received: (qmail 72490 invoked from network); 3 Mar 2008 10:33:18 -0000 Received: from unknown (HELO lists.php.net) (127.0.0.1) by localhost with SMTP; 3 Mar 2008 10:33:18 -0000 Authentication-Results: pb1.pair.com smtp.mail=helly@php.net; spf=unknown; sender-id=unknown Authentication-Results: pb1.pair.com header.from=helly@php.net; sender-id=unknown Received-SPF: unknown (pb1.pair.com: domain php.net does not designate 85.214.94.56 as permitted sender) X-PHP-List-Original-Sender: helly@php.net X-Host-Fingerprint: 85.214.94.56 aixcept.net Linux 2.6 Received: from [85.214.94.56] ([85.214.94.56:48696] helo=h1149922.serverkompetenz.net) by pb1.pair.com (ecelerity 2.1.1.9-wez r(12769M)) with ESMTP id 79/E7-29055-DE3DBC74 for ; Mon, 03 Mar 2008 05:33:18 -0500 Received: from dhcp-172-28-202-237.zrh.corp.google.com (unknown [193.142.125.1]) (using TLSv1 with cipher AES256-SHA (256/256 bits)) (No client certificate requested) by h1149922.serverkompetenz.net (Postfix) with ESMTP id 2467011F397; Mon, 3 Mar 2008 11:33:14 +0100 (CET) Date: Mon, 3 Mar 2008 11:33:13 +0100 Reply-To: Marcus Boerger X-Priority: 3 (Normal) Message-ID: <1207450994.20080303113313@marcus-boerger.de> To: Stanislav Malyshev CC: internals@lists.php.net In-Reply-To: <47CB8107.1090802@zend.com> References: <1706278209.20080302232134@marcus-boerger.de> <47CB2E9D.6010102@zend.com> <1642796941.20080303002651@marcus-boerger.de> <47CB8107.1090802@zend.com> MIME-Version: 1.0 Content-Type: text/plain; charset=iso-8859-15 Content-Transfer-Encoding: 8bit Subject: Re: [PHP-DEV] [RFC] Replace the flex-based scanner with an re2c [1] based lexer From: helly@php.net (Marcus Boerger) Hello Stanislav, Monday, March 3, 2008, 5:39:35 AM, you wrote: > Hi! >>> Were the stream support issues solved? >> >> We completely dropped multibyte support. The reason is that the way we were > I wasn't asking about multibyte (that we discuss below), but about other > streams - I think I mentioned it on IRC last time re2c parser was > discussed. I remember re2c used mmap, and not all files PHP can run can > be mmap'ed. Was it fixed? Ah, you didn't write that so I got confused. Anyway, what we are doing is the following order: 1) If mmap is supported, then use it 2) If mmap is not supported or does not work then read the whole stream 3) If that is not possible read char by char The flex based scanner reads in smaller chunks or char by char, so it is more or less always like case 3. >> Once we have finished the move to re2c, we can support all of those >> correctly. The multibyte support also duplicated the encoding tables >> otherwise available in ext/mbstring or ext/iconv or pecl/intl. > pecl/intl per se doesn't have any encoding tables. ICU does, but that > would mean you have to have ICU to run PHP. That might not be a big > problem since ICU is supported by IBM (read: good chance more "exotic" > systems would have support) it is still dependency on non-bundled 3rd > party library in PHP 5 core. Of course, PHP 6 has this dependency, but > we might want to not have such things in 5.x so that you won't have to > change your system too much while staying on 5.x. Are you saying we cannot depend on ICU in PHP 6 and have to redo it completely or what? >> Rely on a not supported undocumented feature? I am rather able to build php >> and rewrite that support. > Being undocumented is nothing to be proud of, however as poorly > documented as it is, it is used. I'm all for implementing it in a better > way - and having new parser is a good time to do it. That's exactly the > reason we shouldn't rush with it but do it right this time. There's no > burning need to have a new parser right now, so we can have some moment > to think - ok, how we want multibyte support there to work? And if we > might need some modifications, we'd have time and flexibility to do it, > not having the code in 5.3 which was supposed to go in RC in Q1 (ending > 1 month from now). >> You are free to contribute and make MB support working upfront. > I know I'm free :) However, as much as I understand the eagerness of > having it in the source tree, I repeat that I do not think dropping > multibyte support in 5.3 is acceptable. Thus, if it is committed right > now, 5.3 would have to be deferred until this is resolved. If this is > resolved timely for 5.3 - great. If not, we better get it in 5.4 right > than in 5.3 wrong. I don't see a problem with redoing multibyte support in a useable way. Actually we better redo it anyway because it is a very bad solution as it is right now. That is the current solution duplicates the input and uses a flattening filter to always scan an eight bit input stream. Then when something needs to get pushed to the output, we recalculate the position on the original input and use that part. Changing to re2c we can do a very easy solution. When requested or detected per BOM, we switch to a second version of the scanner that works on unsigned int and supports the full unicode character set (only thing to do for re2c is to switch the input type and guess what, this is already in production on a lot of systems). Other approaches are to natively support UTF-8 and UTF-16 besides 8 bit and UTF-32. Further more we can apply any kind of filtering correctly on top of the UTF-* scanner. I Know there is some work left but when we do not apply the work now then we basically have two engines. In that case I'll just rewrite the engine completely and replace it. Best regards, Marcus