Newsgroups: php.internals Path: news.php.net Xref: news.php.net php.internals:43842 Return-Path: Mailing-List: contact internals-help@lists.php.net; run by ezmlm Delivered-To: mailing list internals@lists.php.net Received: (qmail 48834 invoked from network); 4 May 2009 21:57:39 -0000 Received: from unknown (HELO lists.php.net) (127.0.0.1) by localhost with SMTP; 4 May 2009 21:57:39 -0000 Authentication-Results: pb1.pair.com smtp.mail=shire@php.net; spf=unknown; sender-id=unknown Authentication-Results: pb1.pair.com header.from=shire@php.net; sender-id=unknown Received-SPF: unknown (pb1.pair.com: domain php.net does not designate 208.43.138.18 as permitted sender) X-PHP-List-Original-Sender: shire@php.net X-Host-Fingerprint: 208.43.138.18 sizzo.org Linux 2.6 Received: from [208.43.138.18] ([208.43.138.18:50058] helo=sizzo.org) by pb1.pair.com (ecelerity 2.1.1.9-wez r(12769M)) with ESMTP id EE/E0-57065-2D46FF94 for ; Mon, 04 May 2009 17:57:39 -0400 Received: from shirebook.local (outbound500a.pasd.tfbnw.net [204.15.21.171]) (using TLSv1 with cipher DHE-RSA-CAMELLIA256-SHA (256/256 bits)) (No client certificate requested) by sizzo.org (Postfix) with ESMTPSA id A6DDCCBE653; Mon, 4 May 2009 14:57:35 -0700 (PDT) Message-ID: <49FF64CC.3000208@php.net> Date: Mon, 04 May 2009 14:57:32 -0700 Reply-To: shire@php.net User-Agent: Postbox 1.0b11 (Macintosh/2009041623) MIME-Version: 1.0 To: Matt Wilmas CC: internals@lists.php.net, Nuno Lopes , Lukas Kahwe Smith , Dmitry Stogov References: <6604D94D40FD465F992144110B075BB5@pc1> <9D5D4CBF-5CB1-47EC-81F4-59E3C48EEEEF@pooteeweet.org> <49FE9AE7.4000008@php.net> <6886527DD6D44DFAAC29221BAE374B07@pc1> <49FF410B.2030807@php.net> <3BCF6C07D9B2434D91A32A8E28C69A41@pc1> In-Reply-To: <3BCF6C07D9B2434D91A32A8E28C69A41@pc1> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Subject: Re: [PHP-DEV] [PATCH] Scanner "diet" with fixes, etc. From: shire@php.net (shire) Matt Wilmas wrote: > Hi Brian, > > ----- Original Message ----- > From: "shire" > Sent: Monday, May 04, 2009 > >> >> Hey Matt, >> >> Matt Wilmas wrote: >> >>>>>> +/* To save initial string length after scanning to first variable, >>>>>> CG(doc_comment_len) can be reused */ >>>>>> +#define double_quotes_scanned_len CG(doc_comment_len) >>>>>> + >>>> >>>> (minor) Maybe we should rename this var if we're going to use it for >>>> other >>>> purposes, this doesn't really save any typing. Also if we do want the >>>> define maybe we should upper case it so it's more obvious? >>> >>> Yeah, I tried to think of other ways to do it, but just left it trying >>> to look like another variable (not to save typing). Well, it can easily >>> be changed later if a "cleaner" way is decided... >>> >> >> Yeah I would just prefer if it was more obvious that it is *not* a >> variable ;-) > > How about this? > > #define SET_DOUBLE_QUOTES_SCANNED_LENGTH(len) CG(doc_comment_len) = (len) > #define GET_DOUBLE_QUOTES_SCANNED_LENGTH() CG(doc_comment_len) > Sure, works for me ;-) >>>>>> + while (YYCURSOR < YYLIMIT) { >>>>>> + switch (*YYCURSOR++) { >>>> >>>> In the example above, which we have a couple examples of here, we don't >>>> obey the YYFILL macro to detect if we have exceeded our EOF. This >>>> *might* be a problem, but only really depends on if we intend to use >>>> the >>>> YYFILL as a solution for exceeding our mmap bounds. >>> >>> I don't understand what the problem might be? The YYCURSOR < YYLIMIT >>> check is what the YYFILL has been doing. If you mean after changes >>> later, as long as the the whole thing is mmap()'d (which I'm assuming >>> would be the case?), it just "looks" like a standard string, with >>> terminating '\0', right? And there's no reading past YYLIMIT. >> >> Sorry yeah this wouldn't be a problem currently, but only if we try to >> fix the mmap issue by using YYFILL to realloc more space into the buffer. >> Then that macro would change to something more complicated. (per my >> previous replies with Arnaud) > > Gotcha. If something changes, YYFILL -- or something to handle what > needs to be done -- could just be added to the manual parts as > necessary, right? > >> Have you considered using the lexer STATES and regex's instead of the >> manual C code for scanning the rest. It seems like if we have a one-char >> regex match for what the C code is doing we could handle this in the >> lexer without a lot of manual intervention (need to look at it more, just >> a thought I had earlier, the expressions are clearer now with your patch >> applied) ;-) > > It seems that matching one-char-at-a-time with re2c would be more > complicated than the manual way, not to mention slower than the old > (current) way. > > Do you have any objection (well, you've kinda mentioned some :-)) if I'd > commit the changes in a little while like Dmitry thought could be done? Well I'm wondering if something more along these lines (just did this on-top of your patch as you cleaned up a lot) might be more appealing. (I'm not sure how much slower this would be than the current implementation, obviously it'll be somewhat slower, I'm basically just doing what you did in C but in the scanner instead of course). "#"|"//" { BEGIN(ST_EOL_COMMENT); yymore(); } ({NEWLINE}|"%>"|"?>") { char tmp = *(YYCURSOR-1); if ((tmp == '%' && CG(asp_tags)) | tmp == '?') { YYCURSOR -= 2; } CG(zend_lineno)++; BEGIN(ST_IN_SCRIPTING); return T_COMMENT; } {ANY_CHAR} { if (YYCURSOR >= YYLIMIT) { BEGIN(ST_IN_SCRIPTING); return T_COMMENT; } yymore(); } Let me know what the thoughts are on the above, if we don't want that then I say yeah, commit away! -shire