Newsgroups: php.internals
Path: news.php.net
Xref: news.php.net php.internals:43842
Mailing-List: contact internals-help@lists.php.net; run by ezmlm
Received-SPF: unknown (pb1.pair.com: domain php.net does not designate 208.43.138.18 as permitted sender)
Message-ID: <49FF64CC.3000208@php.net>
Date: Mon, 04 May 2009 14:57:32 -0700
Reply-To: shire@php.net
User-Agent: Postbox 1.0b11 (Macintosh/2009041623)
MIME-Version: 1.0
To: Matt Wilmas <php_lists@realplain.com>
CC: internals@lists.php.net, Nuno Lopes <nlopess@php.net>, 
 Lukas Kahwe Smith <mls@pooteeweet.org>,
 Dmitry Stogov <dmitry@zend.com>
References: <6604D94D40FD465F992144110B075BB5@pc1> <E17F0EFAC0F847C994EBDA6D020E52BF@PC3EE1F19287> <9D5D4CBF-5CB1-47EC-81F4-59E3C48EEEEF@pooteeweet.org> <C8E7CB69316B42D1B8BB5ED78408088D@PC3EE1F19287> <B06CC8E0A66F4959B3BFC4AA0698223D@pc1> <49FE9AE7.4000008@php.net> <6886527DD6D44DFAAC29221BAE374B07@pc1> <49FF410B.2030807@php.net> <3BCF6C07D9B2434D91A32A8E28C69A41@pc1>
In-Reply-To: <3BCF6C07D9B2434D91A32A8E28C69A41@pc1>
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 7bit
Subject: Re: [PHP-DEV] [PATCH] Scanner "diet" with fixes, etc.
From: shire@php.net (shire)

Matt Wilmas wrote:
> Hi Brian,
>
> ----- Original Message -----
> From: "shire"
> Sent: Monday, May 04, 2009
>
>>
>> Hey Matt,
>>
>> Matt Wilmas wrote:
>>
>>>>>> +/* To save initial string length after scanning to first variable,
>>>>>> CG(doc_comment_len) can be reused */
>>>>>> +#define double_quotes_scanned_len CG(doc_comment_len)
>>>>>> +
>>>>
>>>> (minor) Maybe we should rename this var if we're going to use it for
>>>> other
>>>> purposes, this doesn't really save any typing. Also if we do want the
>>>> define maybe we should upper case it so it's more obvious?
>>>
>>> Yeah, I tried to think of other ways to do it, but just left it trying
>>> to look like another variable (not to save typing). Well, it can easily
>>> be changed later if a "cleaner" way is decided...
>>>
>>
>> Yeah I would just prefer if it was more obvious that it is *not* a
>> variable ;-)
>
> How about this?
>
> #define SET_DOUBLE_QUOTES_SCANNED_LENGTH(len) CG(doc_comment_len) = (len)
> #define GET_DOUBLE_QUOTES_SCANNED_LENGTH() CG(doc_comment_len)
>

Sure, works for me ;-)


>>>>>> + while (YYCURSOR < YYLIMIT) {
>>>>>> + switch (*YYCURSOR++) {
>>>>
>>>> In the example above, which we have a couple examples of here, we don't
>>>> obey the YYFILL macro to detect if we have exceeded our EOF. This
>>>> *might* be a problem, but only really depends on if we intend to use
>>>> the
>>>> YYFILL as a solution for exceeding our mmap bounds.
>>>
>>> I don't understand what the problem might be? The YYCURSOR < YYLIMIT
>>> check is what the YYFILL has been doing. If you mean after changes
>>> later, as long as the the whole thing is mmap()'d (which I'm assuming
>>> would be the case?), it just "looks" like a standard string, with
>>> terminating '\0', right? And there's no reading past YYLIMIT.
>>
>> Sorry yeah this wouldn't be a problem currently, but only if we try to
>> fix the mmap issue by using YYFILL to realloc more space into the buffer.
>> Then that macro would change to something more complicated. (per my
>> previous replies with Arnaud)
>
> Gotcha. If something changes, YYFILL -- or something to handle what
> needs to be done -- could just be added to the manual parts as
> necessary, right?
>
>> Have you considered using the lexer STATES and regex's instead of the
>> manual C code for scanning the rest. It seems like if we have a one-char
>> regex match for what the C code is doing we could handle this in the
>> lexer without a lot of manual intervention (need to look at it more, just
>> a thought I had earlier, the expressions are clearer now with your patch
>> applied) ;-)
>
> It seems that matching one-char-at-a-time with re2c would be more
> complicated than the manual way, not to mention slower than the old
> (current) way.
>
> Do you have any objection (well, you've kinda mentioned some :-)) if I'd
> commit the changes in a little while like Dmitry thought could be done?

Well I'm wondering if something more along these lines (just did this on-top of your patch as you cleaned up a lot) might be more appealing.  (I'm not sure how much slower this would be than the current implementation, obviously it'll be somewhat slower, I'm basically just doing what you did in C but in the scanner instead of course).

<ST_IN_SCRIPTING>"#"|"//" {
     BEGIN(ST_EOL_COMMENT);
     yymore();
}

<ST_EOL_COMMENT>({NEWLINE}|"%>"|"?>") {
     char tmp = *(YYCURSOR-1);
     if ((tmp == '%' && CG(asp_tags)) | tmp == '?') {
       YYCURSOR -= 2;
     }
     CG(zend_lineno)++;
     BEGIN(ST_IN_SCRIPTING);
     return T_COMMENT;
}

<ST_EOL_COMMENT>{ANY_CHAR} {
     if (YYCURSOR >= YYLIMIT) {
       BEGIN(ST_IN_SCRIPTING);
       return T_COMMENT;
     }
     yymore();
}



Let me know what the thoughts are on the above, if we don't want that then I say yeah, commit away!

-shire