Newsgroups: php.internals
Path: news.php.net
Xref: news.php.net php.internals:35907
Mailing-List: contact internals-help@lists.php.net; run by ezmlm
Received-SPF: unknown (pb1.pair.com: domain php.net does not designate 85.214.94.56 as permitted sender)
Date: Mon, 3 Mar 2008 11:33:13 +0100
Reply-To: Marcus Boerger <helly@php.net>
Message-ID: <1207450994.20080303113313@marcus-boerger.de>
To: Stanislav Malyshev <stas@zend.com>
CC: internals@lists.php.net
In-Reply-To: <47CB8107.1090802@zend.com>
References: <1706278209.20080302232134@marcus-boerger.de> <47CB2E9D.6010102@zend.com> <1642796941.20080303002651@marcus-boerger.de> <47CB8107.1090802@zend.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=iso-8859-15
Content-Transfer-Encoding: 8bit
Subject: Re: [PHP-DEV] [RFC] Replace the flex-based scanner with an re2c [1] based lexer
From: helly@php.net (Marcus Boerger)

Hello Stanislav,

Monday, March 3, 2008, 5:39:35 AM, you wrote:

> Hi!

>>> Were the stream support issues solved?
>> 
>> We completely dropped multibyte support. The reason is that the way we were

> I wasn't asking about multibyte (that we discuss below), but about other 
> streams - I think I mentioned it on IRC last time re2c parser was 
> discussed. I remember re2c used mmap, and not all files PHP can run can 
> be mmap'ed. Was it fixed?

Ah, you didn't write that so I got confused. Anyway, what we are doing is
the following order:
1) If mmap is supported, then use it
2) If mmap is not supported or does not work then read the whole stream
3) If that is not possible read char by char

The flex based scanner reads in smaller chunks or char by char, so it is
more or less always like case 3.

>> Once we have finished the move to re2c, we can support all of those
>> correctly. The multibyte support also duplicated the encoding tables
>> otherwise available in ext/mbstring or ext/iconv or pecl/intl.

> pecl/intl per se doesn't have any encoding tables. ICU does, but that 
> would mean you have to have ICU to run PHP. That might not be a big 
> problem since ICU is supported by IBM (read: good chance more "exotic" 
> systems would have support) it is still dependency on non-bundled 3rd 
> party library in PHP 5 core. Of course, PHP 6 has this dependency, but 
> we might want to not have such things in 5.x so that you won't have to 
> change your system too much while staying on 5.x.

Are you saying we cannot depend on ICU in PHP 6 and have to redo it
completely or what?

>> Rely on a not supported undocumented feature? I am rather able to build php
>> and rewrite that support.

> Being undocumented is nothing to be proud of, however as poorly 
> documented as it is, it is used. I'm all for implementing it in a better 
> way - and having new parser is a good time to do it. That's exactly the 
> reason we shouldn't rush with it but do it right this time. There's no 
> burning need to have a new parser right now, so we can have some moment 
> to think - ok, how we want multibyte support there to work? And if we 
> might need some modifications, we'd have time and flexibility to do it, 
> not having the code in 5.3 which was supposed to go in RC in Q1 (ending 
> 1 month from now).

>> You are free to contribute and make MB support working upfront.

> I know I'm free :) However, as much as I understand the eagerness of 
> having it in the source tree, I repeat that I do not think dropping 
> multibyte support in 5.3 is acceptable. Thus, if it is committed right 
> now, 5.3 would have to be deferred until this is resolved. If this is 
> resolved timely for 5.3 - great. If not, we better get it in 5.4 right 
> than in 5.3 wrong.

I don't see a problem with redoing multibyte support in a useable way.
Actually we better redo it anyway because it is a very bad solution as it
is right now. That is the current solution duplicates the input and uses a
flattening filter to always scan an eight bit input stream. Then when
something needs to get pushed to the output, we recalculate the position on
the original input and use that part. Changing to re2c we can do a very
easy solution. When requested or detected per BOM, we switch to a second
version of the scanner that works on unsigned int and supports the full
unicode character set (only thing to do for re2c is to switch the input
type and guess what, this is already in production on a lot of systems).

Other approaches are to natively support UTF-8 and UTF-16 besides 8 bit
and UTF-32. Further more we can apply any kind of filtering correctly on
top of the UTF-* scanner.

I Know there is some work left but when we do not apply the work now then
we basically have two engines. In that case I'll just rewrite the engine
completely and replace it.

Best regards,
 Marcus