[PATCH] Scanner "diet" with fixes, etc.

16 years ago by Matt Wilmas — view source

unread

Hi Dmitry, Brian, all,

Here's a scanner patch that I mentioned awhile ago, with a possible way to
work around the re2c EOF handling issues.

The primary change is to do a "manual scan" like I talked about in areas
that match large amounts and can contain NULL bytes (strings/comments, which
are now scanned faster too), as is done for inline HTML. I called it a
"diet" :-) because it removes my complicated string regex patterns from a
couple years ago, which doesn't make the .l file much smaller after adding
the manual scan code (easier to understand...?), but it does result in a
~34k reduction of 5.3's generated .c file...

This fixes Bug #46817, as well as a better, more proper fix for the older
Bug #42767, both related to ending comments.

Now inline HTML chunks aren't broken up when a tag starting with "s" is
encountered (<script> for JS, <span>, etc.), since it's unlikely to be a
long PHP <script> tag.

If an opening PHP <SCRIPT> tag was used with a capital "S", it was missed if
it wasn't the first thing scanned:

var_dump(token_get_all("HTML... <SCRIPT language=php>"));

Single-line comments with a Windows newline didn't include the full \r\n:

var_dump(token_get_all("<?php // Comment\r\n?>"));

Finally, part of the optimized scanning is that, for double quoted strings,
when the first variable is encountered (making it non-constant), the amount
that's been scanned up to that point is remembered, which can then be
skipped over (up to the variable) after returning the quote token.
Previously that initial part of the string was rescanned -- the cost
dependent on how far "into" the string the first var is.

I think that's about all -- I'll send another message if I forgot to
mention anything... Just wanted to send this along quick for to you guys to
look at or whatever. It was basically done last week, I just had to do a
couple finishing touches and verify that everything was OK.

http://realplain.com/php/scanner_diet.diff (Merged changes, but didn't test
yet.)
http://realplain.com/php/scanner_diet_5_3.diff

Thanks,
Matt

16 years ago by Dmitry Stogov — view source

unread

Hi Matt,

Does this patch fix EOF handling issues related to mmap()? (e.g. parsing
of files with size 4096, 8192, ...). Now we have two dirty fixes to
handle them correctly.

The patch is quite big to understand it quickly. I'll probably take a
look on weekend.

-ANY_CHAR [^\x00]
+ANY_CHAR [^]

Is [^] a correct regular expression?

Thanks. Dmitry.

Matt Wilmas wrote:

Hi Dmitry, Brian, all,

Here's a scanner patch that I mentioned awhile ago, with a possible way
to work around the re2c EOF handling issues.

The primary change is to do a "manual scan" like I talked about in areas
that match large amounts and can contain NULL bytes (strings/comments,
which are now scanned faster too), as is done for inline HTML. I called
it a "diet" :-) because it removes my complicated string regex patterns
from a couple years ago, which doesn't make the .l file much smaller
after adding the manual scan code (easier to understand...?), but it
does result in a ~34k reduction of 5.3's generated .c file...

This fixes Bug #46817, as well as a better, more proper fix for the
older Bug #42767, both related to ending comments.

Now inline HTML chunks aren't broken up when a tag starting with "s" is
encountered (<script> for JS, <span>, etc.), since it's unlikely to be a
long PHP <script> tag.

If an opening PHP <SCRIPT> tag was used with a capital "S", it was
missed if it wasn't the first thing scanned:

var_dump(token_get_all("HTML... <SCRIPT language=php>"));

Single-line comments with a Windows newline didn't include the full \r\n:

var_dump(token_get_all("<?php // Comment\r\n?>"));

Finally, part of the optimized scanning is that, for double quoted
strings, when the first variable is encountered (making it
non-constant), the amount that's been scanned up to that point is
remembered, which can then be skipped over (up to the variable) after
returning the quote token. Previously that initial part of the string
was rescanned -- the cost dependent on how far "into" the string the
first var is.

I think that's about all -- I'll send another message if I forgot to
mention anything... Just wanted to send this along quick for to you
guys to look at or whatever. It was basically done last week, I just
had to do a couple finishing touches and verify that everything was OK.

http://realplain.com/php/scanner_diet.diff (Merged changes, but didn't
test yet.)
http://realplain.com/php/scanner_diet_5_3.diff

Thanks,
Matt

16 years ago by Richard Quadling — view source

unread

2009/4/30 Dmitry Stogov dmitry@zend.com:

Hi Matt,

Does this patch fix EOF handling issues related to mmap()? (e.g. parsing of
files with size 4096, 8192, ...). Now we have two dirty fixes to handle them
correctly.

The patch is quite big to understand it quickly. I'll probably take a look
on weekend.

-ANY_CHAR [^\x00]
+ANY_CHAR [^]

Is [^] a correct regular expression?

Thanks. Dmitry.

Matt Wilmas wrote:

Hi Dmitry, Brian, all,

Here's a scanner patch that I mentioned awhile ago, with a possible way to
work around the re2c EOF handling issues.

The primary change is to do a "manual scan" like I talked about in areas
that match large amounts and can contain NULL bytes (strings/comments, which
are now scanned faster too), as is done for inline HTML. I called it a
"diet" :-) because it removes my complicated string regex patterns from a
couple years ago, which doesn't make the .l file much smaller after adding
the manual scan code (easier to understand...?), but it does result in a
~34k reduction of 5.3's generated .c file...

This fixes Bug #46817, as well as a better, more proper fix for the older
Bug #42767, both related to ending comments.

Now inline HTML chunks aren't broken up when a tag starting with "s" is
encountered (<script> for JS, <span>, etc.), since it's unlikely to be a
long PHP <script> tag.

If an opening PHP <SCRIPT> tag was used with a capital "S", it was missed
if it wasn't the first thing scanned:

var_dump(token_get_all("HTML... <SCRIPT language=php>"));

Single-line comments with a Windows newline didn't include the full \r\n:

var_dump(token_get_all("<?php // Comment\r\n?>"));

Finally, part of the optimized scanning is that, for double quoted
strings, when the first variable is encountered (making it non-constant),
the amount that's been scanned up to that point is remembered, which can
then be skipped over (up to the variable) after returning the quote token.
Previously that initial part of the string was rescanned -- the cost
dependent on how far "into" the string the first var is.

I think that's about all -- I'll send another message if I forgot to
mention anything... Just wanted to send this along quick for to you guys to
look at or whatever. It was basically done last week, I just had to do a
couple finishing touches and verify that everything was OK.

http://realplain.com/php/scanner_diet.diff (Merged changes, but didn't
test yet.)
http://realplain.com/php/scanner_diet_5_3.diff

Thanks,
Matt

--

Hmm. RegexBuddy explains that as ...

Match any character that is NOT a “]”

But it sure doesn't look valid without a closing ].

--

Richard Quadling
Zend Certified Engineer : http://zend.com/zce.php?c=ZEND002498&r=213474731
"Standing on the shoulders of some very clever giants!"

16 years ago by Scott MacVicar — view source

unread

[^] is a special case to write a portable match any character in re2c.

Scott

Dmitry Stogov wrote:

Hi Matt,

Does this patch fix EOF handling issues related to mmap()? (e.g. parsing
of files with size 4096, 8192, ...). Now we have two dirty fixes to
handle them correctly.

The patch is quite big to understand it quickly. I'll probably take a
look on weekend.

-ANY_CHAR [^\x00]
+ANY_CHAR [^]

Is [^] a correct regular expression?

Thanks. Dmitry.

Matt Wilmas wrote:

Hi Dmitry, Brian, all,

Here's a scanner patch that I mentioned awhile ago, with a possible
way to work around the re2c EOF handling issues.

The primary change is to do a "manual scan" like I talked about in
areas that match large amounts and can contain NULL bytes
(strings/comments, which are now scanned faster too), as is done for
inline HTML. I called it a "diet" :-) because it removes my
complicated string regex patterns from a couple years ago, which
doesn't make the .l file much smaller after adding the manual scan
code (easier to understand...?), but it does result in a ~34k
reduction of 5.3's generated .c file...

This fixes Bug #46817, as well as a better, more proper fix for the
older Bug #42767, both related to ending comments.

Now inline HTML chunks aren't broken up when a tag starting with "s"
is encountered (<script> for JS, <span>, etc.), since it's unlikely to
be a long PHP <script> tag.

If an opening PHP <SCRIPT> tag was used with a capital "S", it was
missed if it wasn't the first thing scanned:

var_dump(token_get_all("HTML... <SCRIPT language=php>"));

Single-line comments with a Windows newline didn't include the full \r\n:

var_dump(token_get_all("<?php // Comment\r\n?>"));

Finally, part of the optimized scanning is that, for double quoted
strings, when the first variable is encountered (making it
non-constant), the amount that's been scanned up to that point is
remembered, which can then be skipped over (up to the variable) after
returning the quote token. Previously that initial part of the string
was rescanned -- the cost dependent on how far "into" the string the
first var is.

I think that's about all -- I'll send another message if I forgot to
mention anything... Just wanted to send this along quick for to you
guys to look at or whatever. It was basically done last week, I just
had to do a couple finishing touches and verify that everything was OK.

http://realplain.com/php/scanner_diet.diff (Merged changes, but didn't
test yet.)
http://realplain.com/php/scanner_diet_5_3.diff

Thanks,
Matt

16 years ago by Richard Quadling — view source

unread

2009/4/30 Scott MacVicar scottmac@php.net:

[^] is a special case to write a portable match any character in re2c.

Scott

Dmitry Stogov wrote:

Hi Matt,

Does this patch fix EOF handling issues related to mmap()? (e.g. parsing
of files with size 4096, 8192, ...). Now we have two dirty fixes to
handle them correctly.

The patch is quite big to understand it quickly. I'll probably take a
look on weekend.

-ANY_CHAR [^\x00]
+ANY_CHAR [^]

Is [^] a correct regular expression?

Thanks. Dmitry.

Matt Wilmas wrote:

Hi Dmitry, Brian, all,

Here's a scanner patch that I mentioned awhile ago, with a possible
way to work around the re2c EOF handling issues.

The primary change is to do a "manual scan" like I talked about in
areas that match large amounts and can contain NULL bytes
(strings/comments, which are now scanned faster too), as is done for
inline HTML. I called it a "diet" :-) because it removes my
complicated string regex patterns from a couple years ago, which
doesn't make the .l file much smaller after adding the manual scan
code (easier to understand...?), but it does result in a ~34k
reduction of 5.3's generated .c file...

This fixes Bug #46817, as well as a better, more proper fix for the
older Bug #42767, both related to ending comments.

Now inline HTML chunks aren't broken up when a tag starting with "s"
is encountered (<script> for JS, <span>, etc.), since it's unlikely to
be a long PHP <script> tag.

If an opening PHP <SCRIPT> tag was used with a capital "S", it was
missed if it wasn't the first thing scanned:

var_dump(token_get_all("HTML... <SCRIPT language=php>"));

Single-line comments with a Windows newline didn't include the full \r\n:

var_dump(token_get_all("<?php // Comment\r\n?>"));

Finally, part of the optimized scanning is that, for double quoted
strings, when the first variable is encountered (making it
non-constant), the amount that's been scanned up to that point is
remembered, which can then be skipped over (up to the variable) after
returning the quote token. Previously that initial part of the string
was rescanned -- the cost dependent on how far "into" the string the
first var is.

I think that's about all -- I'll send another message if I forgot to
mention anything... Just wanted to send this along quick for to you
guys to look at or whatever. It was basically done last week, I just
had to do a couple finishing touches and verify that everything was OK.

http://realplain.com/php/scanner_diet.diff (Merged changes, but didn't
test yet.)
http://realplain.com/php/scanner_diet_5_3.diff

Thanks,
Matt

--

Aha - bottom of section at http://re2c.org/manual.html#lbAJ

--

Richard Quadling
Zend Certified Engineer : http://zend.com/zce.php?c=ZEND002498&r=213474731
"Standing on the shoulders of some very clever giants!"

16 years ago by Antony Dovgal — view source

unread

That's what I call 'overquoting'.

2009/4/30 Scott MacVicar scottmac@php.net:

[^] is a special case to write a portable match any character in re2c.

Scott

Dmitry Stogov wrote:

Hi Matt,

Does this patch fix EOF handling issues related to mmap()? (e.g. parsing
of files with size 4096, 8192, ...). Now we have two dirty fixes to
handle them correctly.

The patch is quite big to understand it quickly. I'll probably take a
look on weekend.

-ANY_CHAR [^\x00]
+ANY_CHAR [^]

Is [^] a correct regular expression?

Thanks. Dmitry.

Matt Wilmas wrote:

Hi Dmitry, Brian, all,

Here's a scanner patch that I mentioned awhile ago, with a possible
way to work around the re2c EOF handling issues.

The primary change is to do a "manual scan" like I talked about in
areas that match large amounts and can contain NULL bytes
(strings/comments, which are now scanned faster too), as is done for
inline HTML. I called it a "diet" :-) because it removes my
complicated string regex patterns from a couple years ago, which
doesn't make the .l file much smaller after adding the manual scan
code (easier to understand...?), but it does result in a ~34k
reduction of 5.3's generated .c file...

This fixes Bug #46817, as well as a better, more proper fix for the
older Bug #42767, both related to ending comments.

Now inline HTML chunks aren't broken up when a tag starting with "s"
is encountered (<script> for JS, <span>, etc.), since it's unlikely to
be a long PHP <script> tag.

If an opening PHP <SCRIPT> tag was used with a capital "S", it was
missed if it wasn't the first thing scanned:

var_dump(token_get_all("HTML... <SCRIPT language=php>"));

Single-line comments with a Windows newline didn't include the full \r\n:

var_dump(token_get_all("<?php // Comment\r\n?>"));

Finally, part of the optimized scanning is that, for double quoted
strings, when the first variable is encountered (making it
non-constant), the amount that's been scanned up to that point is
remembered, which can then be skipped over (up to the variable) after
returning the quote token. Previously that initial part of the string
was rescanned -- the cost dependent on how far "into" the string the
first var is.

I think that's about all -- I'll send another message if I forgot to
mention anything... Just wanted to send this along quick for to you
guys to look at or whatever. It was basically done last week, I just
had to do a couple finishing touches and verify that everything was OK.

http://realplain.com/php/scanner_diet.diff (Merged changes, but didn't
test yet.)
http://realplain.com/php/scanner_diet_5_3.diff

Thanks,
Matt

--

Aha - bottom of section at http://re2c.org/manual.html#lbAJ

--
Wbr,
Antony Dovgal

16 years ago by Matt Wilmas — view source

unread

Hi Dmitry,

----- Original Message -----
From: "Dmitry Stogov"
Sent: Thursday, April 30, 2009

Hi Matt,

Does this patch fix EOF handling issues related to mmap()? (e.g. parsing
of files with size 4096, 8192, ...). Now we have two dirty fixes to handle
them correctly.

I'm not really sure about the mmap() stuff, and haven't followed the
workarounds that you or Brian have done... But as far as scanning
strings/comments, there shouldn't be an EOF problem with them after my
changes. Those were the ones that could contain NULL and got the EOF
NULL(s) confused and scan too far. I guess that's the ZEND_MMAP_AHEAD
padding...? So I don't know if reverting the "dirty fixes" would then still
cause any problems with re2c scanning other tokens (that can't contain
NULL).

Brian would know more I guess, sooner than I could look into what to revert,
when I don't fully understand that part.

The patch is quite big to understand it quickly. I'll probably take a look
on weekend.

OK. I think the patch looks more complicated and is larger because of so
many removed lines. :-)

-ANY_CHAR [^\x00]
+ANY_CHAR [^]

Is [^] a correct regular expression?

In re2c, yes, as Scott said. [^] is what it was before Brian excluded NULL.

Thanks. Dmitry.

Matt

16 years ago by Nuno Lopes — view source

unread

The patch looks generally ok. However I'll need a few more days to review it
carefully and throughly. (you can merge it in the meantime if you want).
I'm just slighty concern with the amount of parsing we are now doing by
hand, and with the possible (local) security bugs we might be introducing..

Nuno

----- Original Message -----

Hi Dmitry, Brian, all,

Here's a scanner patch that I mentioned awhile ago, with a possible way to
work around the re2c EOF handling issues.

The primary change is to do a "manual scan" like I talked about in areas
that match large amounts and can contain NULL bytes (strings/comments,
which are now scanned faster too), as is done for inline HTML. I called
it a "diet" :-) because it removes my complicated string regex patterns
from a couple years ago, which doesn't make the .l file much smaller after
adding the manual scan code (easier to understand...?), but it does result
in a ~34k reduction of 5.3's generated .c file...

This fixes Bug #46817, as well as a better, more proper fix for the older
Bug #42767, both related to ending comments.

Now inline HTML chunks aren't broken up when a tag starting with "s" is
encountered (<script> for JS, <span>, etc.), since it's unlikely to be a
long PHP <script> tag.

If an opening PHP <SCRIPT> tag was used with a capital "S", it was missed
if it wasn't the first thing scanned:

var_dump(token_get_all("HTML... <SCRIPT language=php>"));

Single-line comments with a Windows newline didn't include the full \r\n:

var_dump(token_get_all("<?php // Comment\r\n?>"));

Finally, part of the optimized scanning is that, for double quoted
strings, when the first variable is encountered (making it non-constant),
the amount that's been scanned up to that point is remembered, which can
then be skipped over (up to the variable) after returning the quote token.
Previously that initial part of the string was rescanned -- the cost
dependent on how far "into" the string the first var is.

I think that's about all -- I'll send another message if I forgot to
mention anything... Just wanted to send this along quick for to you guys
to look at or whatever. It was basically done last week, I just had to do
a couple finishing touches and verify that everything was OK.

http://realplain.com/php/scanner_diet.diff (Merged changes, but didn't
test yet.)
http://realplain.com/php/scanner_diet_5_3.diff

Thanks,
Matt

16 years ago by Lukas Kahwe Smith — view source

unread

The patch looks generally ok. However I'll need a few more days to
review it carefully and throughly. (you can merge it in the meantime
if you want).
I'm just slighty concern with the amount of parsing we are now doing
by hand, and with the possible (local) security bugs we might be
introducing..

Am I understanding this properly, that this addresses the re2c EOF
bug? So we have an RC planned for next week (freeze Monday evening).
Can you get this fixed and released by then as Marcus is unable to do
this himself?

regards,
Lukas Kahwe Smith
mls@pooteeweet.org

16 years ago by Nuno Lopes — view source

unread

The patch looks generally ok. However I'll need a few more days to
review it carefully and throughly. (you can merge it in the meantime if
you want).
I'm just slighty concern with the amount of parsing we are now doing by
hand, and with the possible (local) security bugs we might be
introducing..

Am I understanding this properly, that this addresses the re2c EOF bug?
So we have an RC planned for next week (freeze Monday evening). Can you
get this fixed and released by then as Marcus is unable to do this
himself?

So this addresses some of the re2c EOF problems, but I don't know if it
addresses all of them or not. I haven't had the time yet for a full review.
Anyway, Matt can surelly comment on this.

Nuno

16 years ago by Matt Wilmas — view source

unread

Hi guys,

----- Original Message -----
From: "Nuno Lopes"
Sent: Thursday, April 30, 2009

The patch looks generally ok. However I'll need a few more days to
review it carefully and throughly. (you can merge it in the meantime if
you want).
I'm just slighty concern with the amount of parsing we are now doing by
hand, and with the possible (local) security bugs we might be
introducing..

Am I understanding this properly, that this addresses the re2c EOF bug?
So we have an RC planned for next week (freeze Monday evening). Can you
get this fixed and released by then as Marcus is unable to do this
himself?

So this addresses some of the re2c EOF problems, but I don't know if it
addresses all of them or not. I haven't had the time yet for a full
review.
Anyway, Matt can surelly comment on this.

Yes, it addresses the re2c EOF issues for strings and comments, as they were
the problem ones that allowed NULL bytes, and scanned past the EOF NULL. As
I said to Dmitry, I'm not sure if it's now possible to remove the temporary
mmap() fixes that he wanted removed before the next RC (??), or if there
would still be problems with re2c scanning other tokens, even though they
can't contain NULLs. I didn't attempt to make any changes there, since I'm
not familiar with what's been done.

I just wanted to finally send the patch for others to review, and decide
what to do, so I won't commit any changes yet in the meantime. :-)

Nuno

Matt

16 years ago by shire — view source

unread

Hey Matt,

Thanks for posting, sorry for not having a chance to reply to this sooner. Maybe couple things from the patch,

+/* To save initial string length after scanning to first variable, CG(doc_comment_len) can be reused */
+#define double_quotes_scanned_len CG(doc_comment_len)

(minor) Maybe we should rename this var if we're going to use it for other purposes, this doesn't really save any typing. Also if we do want the define maybe we should upper case it so it's more obvious?

while (YYCURSOR < YYLIMIT) {
   switch (*YYCURSOR++) {

In the example above, which we have a couple examples of here, we don't obey the YYFILL macro to detect if we have exceeded our EOF. This might be a problem, but only really depends on if we intend to use the YYFILL as a solution for exceeding our mmap bounds.

Regarding the ZEND_MMAP_AHEAD issue and the temp. fix that Dmitry put in we need to find a solution to that, perhaps I can play with that this week too as I think I'm seeing some related issues in my testing of 5.3. Essentially we abuse ZEND_MMAP_AHEAD by adding it to the file size and passing it to the mmap call which isn't at all valid and only really works up to PAGESIZE. We could possibly use YYFILL to re-allocate more space as necessary past the end of file to fix this.

I don't see anything glaring in the patch that's a major issue, I can probably test more on a larger code base in the next 2-3 days. As I've said before this seems to be crossing the line of us writing a scanner by hand rather than letting re2c do the heavy lifting, but without a modification to re2c to handle EOF I don't have an alternative solution currently. (If we had some way to detect which regex we where matching against in the YYFILL that would likely be able to handle these bugs, but I didn't see a way to do that easily).

-shire

16 years ago by Arnaud Le Blanc — view source

unread

Hi,

Regarding the ZEND_MMAP_AHEAD issue and the temp. fix that Dmitry put in we
need to find a solution to that, perhaps I can play with that this week too
as I think I'm seeing some related issues in my testing of 5.3. Essentially
we abuse ZEND_MMAP_AHEAD by adding it to the file size and passing it to the
mmap call which isn't at all valid and only really works up to PAGESIZE. We
could possibly use YYFILL to re-allocate more space as necessary past the
end of file to fix this.

I was thinking of doing something like that with YYFILL too. However
there is a bunch of pointers to take in to account and to update (e.g.
yy_marker, yy_text, etc).

There is mmap(MAP_FIXED) as an other temporary solution. The idea is
to map the entire file rounded up to a whole page size, and to map the
last (already mapped) page to anonymous memory using MAP_FIXED. In the
end we get a readable contiguous memory region with the file data and
ZEND_MMAP_AHEAD.
This should work on many systems as long as the requested address is
already part of the process's address space (which is always the case
here). When this fails we can fallback to malloc/read.

Regards,

Arnaud

16 years ago by shire — view source

unread

Arnaud Le Blanc wrote:

Hi,

Regarding the ZEND_MMAP_AHEAD issue and the temp. fix that Dmitry put in we
need to find a solution to that, perhaps I can play with that this week too
as I think I'm seeing some related issues in my testing of 5.3. Essentially
we abuse ZEND_MMAP_AHEAD by adding it to the file size and passing it to the
mmap call which isn't at all valid and only really works up to PAGESIZE. We
could possibly use YYFILL to re-allocate more space as necessary past the
end of file to fix this.

I was thinking of doing something like that with YYFILL too. However
there is a bunch of pointers to take in to account and to update (e.g.
yy_marker, yy_text, etc).

Yeah, I'm pretty sure that's how most of the example re2c code is setup:

#define YYFILL(n) {cursor = fill(s, cursor);}

uchar *fill(Scanner *s, uchar *cursor){
if(!s->eof){
unint cnt = s->lim - s->tok;
uchar *buf = malloc((cnt + 1)*sizeof(uchar));
memcpy(buf, s->tok, cnt);
cursor = &buf[cursor - s->tok];
s->pos = &buf[s->pos - s->tok];
s->ptr = &buf[s->ptr - s->tok];
s->lim = &buf[cnt];
s->eof = s->lim; *(s->eof)++ = '\n';
s->tok = buf;
}
return cursor;
}

-shire

16 years ago by Arnaud Le Blanc — view source

unread

Arnaud Le Blanc wrote:

Hi,
On Mon, May 4, 2009 at 9:36 AM, shireshire@php.net wrote:

Regarding the ZEND_MMAP_AHEAD issue and the temp. fix that Dmitry put in
we
need to find a solution to that, perhaps I can play with that this week
too
as I think I'm seeing some related issues in my testing of 5.3.
Essentially
we abuse ZEND_MMAP_AHEAD by adding it to the file size and passing it to
the
mmap call which isn't at all valid and only really works up to PAGESIZE.
We
could possibly use YYFILL to re-allocate more space as necessary past the
end of file to fix this.

I was thinking of doing something like that with YYFILL too. However
there is a bunch of pointers to take in to account and to update (e.g.
yy_marker, yy_text, etc).

Yeah, I'm pretty sure that's how most of the example re2c code is setup:

#define YYFILL(n) {cursor = fill(s, cursor);}

uchar *fill(Scanner *s, uchar *cursor){
if(!s->eof){
unint cnt = s->lim - s->tok;
uchar *buf = malloc((cnt + 1)*sizeof(uchar));
memcpy(buf, s->tok, cnt);
cursor = &buf[cursor - s->tok];
s->pos = &buf[s->pos - s->tok];
s->ptr = &buf[s->ptr - s->tok];
s->lim = &buf[cnt];
s->eof = s->lim; *(s->eof)++ = '\n';
s->tok = buf;
}
return cursor;
}

-shire

This is what I seen too, but this is not always applicable. The
scanner have code that refers to yy_text, yy_start, yy_cursor,
yy_marker, etc. All those pointers point to the original buffer and
must be updated by fill(). At each point in time the scanner may
rollback to yy_marker or a rule may want to fetch yy_text or yy_start
at any time. So the buffer must be large enough to contain all data
from min(all_of_them) to max(all_of_them). That makes things a little
complicated and potentially less efficient than a big buffer for the
whole file.

Regards,

Arnaud

16 years ago by Matt Wilmas — view source

unread

Hi Brian,

----- Original Message -----
From: "shire"
Sent: Monday, May 04, 2009

Hey Matt,

Thanks for posting, sorry for not having a chance to reply to this sooner.
Maybe couple things from the patch,

+/* To save initial string length after scanning to first variable,
CG(doc_comment_len) can be reused */
+#define double_quotes_scanned_len CG(doc_comment_len)

(minor) Maybe we should rename this var if we're going to use it for other
purposes, this doesn't really save any typing. Also if we do want the
define maybe we should upper case it so it's more obvious?

Yeah, I tried to think of other ways to do it, but just left it trying to
look like another variable (not to save typing). Well, it can easily be
changed later if a "cleaner" way is decided...

while (YYCURSOR < YYLIMIT) {

switch (*YYCURSOR++) {

In the example above, which we have a couple examples of here, we don't
obey the YYFILL macro to detect if we have exceeded our EOF. This
might be a problem, but only really depends on if we intend to use the
YYFILL as a solution for exceeding our mmap bounds.

I don't understand what the problem might be? The YYCURSOR < YYLIMIT check
is what the YYFILL has been doing. If you mean after changes later, as long
as the the whole thing is mmap()'d (which I'm assuming would be the case?),
it just "looks" like a standard string, with terminating '\0', right? And
there's no reading past YYLIMIT.

[...]
-shire

Matt

16 years ago by shire — view source

unread

Hey Matt,

Matt Wilmas wrote:

+/* To save initial string length after scanning to first variable,
CG(doc_comment_len) can be reused */
+#define double_quotes_scanned_len CG(doc_comment_len)

(minor) Maybe we should rename this var if we're going to use it for
other
purposes, this doesn't really save any typing. Also if we do want the
define maybe we should upper case it so it's more obvious?

Yeah, I tried to think of other ways to do it, but just left it trying
to look like another variable (not to save typing). Well, it can easily
be changed later if a "cleaner" way is decided...

Yeah I would just prefer if it was more obvious that it is not a variable ;-)

while (YYCURSOR < YYLIMIT) {

switch (*YYCURSOR++) {

In the example above, which we have a couple examples of here, we don't
obey the YYFILL macro to detect if we have exceeded our EOF. This
might be a problem, but only really depends on if we intend to use the
YYFILL as a solution for exceeding our mmap bounds.

I don't understand what the problem might be? The YYCURSOR < YYLIMIT
check is what the YYFILL has been doing. If you mean after changes
later, as long as the the whole thing is mmap()'d (which I'm assuming
would be the case?), it just "looks" like a standard string, with
terminating '\0', right? And there's no reading past YYLIMIT.

Sorry yeah this wouldn't be a problem currently, but only if we try to fix the mmap issue by using YYFILL to realloc more space into the buffer. Then that macro would change to something more complicated. (per my previous replies with Arnaud)

Have you considered using the lexer STATES and regex's instead of the manual C code for scanning the rest. It seems like if we have a one-char regex match for what the C code is doing we could handle this in the lexer without a lot of manual intervention (need to look at it more, just a thought I had earlier, the expressions are clearer now with your patch applied) ;-)

-shire

16 years ago by Matt Wilmas — view source

unread

Hi Brian,

----- Original Message -----
From: "shire"
Sent: Monday, May 04, 2009

Hey Matt,

Matt Wilmas wrote:

+/* To save initial string length after scanning to first variable,
CG(doc_comment_len) can be reused */
+#define double_quotes_scanned_len CG(doc_comment_len)

(minor) Maybe we should rename this var if we're going to use it for
other
purposes, this doesn't really save any typing. Also if we do want the
define maybe we should upper case it so it's more obvious?

Yeah, I tried to think of other ways to do it, but just left it trying
to look like another variable (not to save typing). Well, it can easily
be changed later if a "cleaner" way is decided...

Yeah I would just prefer if it was more obvious that it is not a
variable ;-)

How about this?

#define SET_DOUBLE_QUOTES_SCANNED_LENGTH(len) CG(doc_comment_len) = (len)
#define GET_DOUBLE_QUOTES_SCANNED_LENGTH() CG(doc_comment_len)

while (YYCURSOR < YYLIMIT) {

switch (*YYCURSOR++) {

In the example above, which we have a couple examples of here, we don't
obey the YYFILL macro to detect if we have exceeded our EOF. This
might be a problem, but only really depends on if we intend to use the
YYFILL as a solution for exceeding our mmap bounds.

I don't understand what the problem might be? The YYCURSOR < YYLIMIT
check is what the YYFILL has been doing. If you mean after changes
later, as long as the the whole thing is mmap()'d (which I'm assuming
would be the case?), it just "looks" like a standard string, with
terminating '\0', right? And there's no reading past YYLIMIT.

Sorry yeah this wouldn't be a problem currently, but only if we try to
fix the mmap issue by using YYFILL to realloc more space into the buffer.
Then that macro would change to something more complicated. (per my
previous replies with Arnaud)

Gotcha. If something changes, YYFILL -- or something to handle what needs
to be done -- could just be added to the manual parts as necessary, right?

Have you considered using the lexer STATES and regex's instead of the
manual C code for scanning the rest. It seems like if we have a one-char
regex match for what the C code is doing we could handle this in the
lexer without a lot of manual intervention (need to look at it more, just
a thought I had earlier, the expressions are clearer now with your patch
applied) ;-)

It seems that matching one-char-at-a-time with re2c would be more
complicated than the manual way, not to mention slower than the old
(current) way.

Do you have any objection (well, you've kinda mentioned some :-)) if I'd
commit the changes in a little while like Dmitry thought could be done?

-shire

Matt

16 years ago by shire — view source

unread

Matt Wilmas wrote:

Hi Brian,

----- Original Message -----
From: "shire"
Sent: Monday, May 04, 2009

Hey Matt,

Matt Wilmas wrote:

+/* To save initial string length after scanning to first variable,
CG(doc_comment_len) can be reused */
+#define double_quotes_scanned_len CG(doc_comment_len)

(minor) Maybe we should rename this var if we're going to use it for
other
purposes, this doesn't really save any typing. Also if we do want the
define maybe we should upper case it so it's more obvious?

Yeah, I tried to think of other ways to do it, but just left it trying
to look like another variable (not to save typing). Well, it can easily
be changed later if a "cleaner" way is decided...

Yeah I would just prefer if it was more obvious that it is not a
variable ;-)

How about this?

#define SET_DOUBLE_QUOTES_SCANNED_LENGTH(len) CG(doc_comment_len) = (len)
#define GET_DOUBLE_QUOTES_SCANNED_LENGTH() CG(doc_comment_len)

Sure, works for me ;-)

while (YYCURSOR < YYLIMIT) {

switch (*YYCURSOR++) {

In the example above, which we have a couple examples of here, we don't
obey the YYFILL macro to detect if we have exceeded our EOF. This
might be a problem, but only really depends on if we intend to use
the
YYFILL as a solution for exceeding our mmap bounds.

I don't understand what the problem might be? The YYCURSOR < YYLIMIT
check is what the YYFILL has been doing. If you mean after changes
later, as long as the the whole thing is mmap()'d (which I'm assuming
would be the case?), it just "looks" like a standard string, with
terminating '\0', right? And there's no reading past YYLIMIT.

Sorry yeah this wouldn't be a problem currently, but only if we try to
fix the mmap issue by using YYFILL to realloc more space into the buffer.
Then that macro would change to something more complicated. (per my
previous replies with Arnaud)

Gotcha. If something changes, YYFILL -- or something to handle what
needs to be done -- could just be added to the manual parts as
necessary, right?

Have you considered using the lexer STATES and regex's instead of the
manual C code for scanning the rest. It seems like if we have a one-char
regex match for what the C code is doing we could handle this in the
lexer without a lot of manual intervention (need to look at it more, just
a thought I had earlier, the expressions are clearer now with your patch
applied) ;-)

It seems that matching one-char-at-a-time with re2c would be more
complicated than the manual way, not to mention slower than the old
(current) way.

Do you have any objection (well, you've kinda mentioned some :-)) if I'd
commit the changes in a little while like Dmitry thought could be done?

Well I'm wondering if something more along these lines (just did this on-top of your patch as you cleaned up a lot) might be more appealing. (I'm not sure how much slower this would be than the current implementation, obviously it'll be somewhat slower, I'm basically just doing what you did in C but in the scanner instead of course).

<ST_IN_SCRIPTING>"#"|"//" {
BEGIN(ST_EOL_COMMENT);
yymore();
}

<ST_EOL_COMMENT>({NEWLINE}|"%>"|"?>") {
char tmp = *(YYCURSOR-1);
if ((tmp == '%' && CG(asp_tags)) | tmp == '?') {
YYCURSOR -= 2;
}
CG(zend_lineno)++;
BEGIN(ST_IN_SCRIPTING);
return T_COMMENT;
}

<ST_EOL_COMMENT>{ANY_CHAR} {
if (YYCURSOR >= YYLIMIT) {
BEGIN(ST_IN_SCRIPTING);
return T_COMMENT;
}
yymore();
}

Let me know what the thoughts are on the above, if we don't want that then I say yeah, commit away!

-shire

16 years ago by Matt Wilmas — view source

unread

Hi Brian,

----- Original Message -----
From: "shire"
Sent: Monday, May 04, 2009

Matt Wilmas wrote:

[...]
How about this?

#define SET_DOUBLE_QUOTES_SCANNED_LENGTH(len) CG(doc_comment_len) = (len)
#define GET_DOUBLE_QUOTES_SCANNED_LENGTH() CG(doc_comment_len)

Sure, works for me ;-)

Cool. :-)

[...]

Have you considered using the lexer STATES and regex's instead of the
manual C code for scanning the rest. It seems like if we have a one-char
regex match for what the C code is doing we could handle this in the
lexer without a lot of manual intervention (need to look at it more,
just
a thought I had earlier, the expressions are clearer now with your patch
applied) ;-)

It seems that matching one-char-at-a-time with re2c would be more
complicated than the manual way, not to mention slower than the old
(current) way.

Do you have any objection (well, you've kinda mentioned some :-)) if I'd
commit the changes in a little while like Dmitry thought could be done?

Well I'm wondering if something more along these lines (just did this
on-top of your patch as you cleaned up a lot) might be more appealing.
(I'm not sure how much slower this would be than the current
implementation, obviously it'll be somewhat slower, I'm basically just
doing what you did in C but in the scanner instead of course).

<ST_IN_SCRIPTING>"#"|"//" {
BEGIN(ST_EOL_COMMENT);
yymore();
}

<ST_EOL_COMMENT>({NEWLINE}|"%>"|"?>") {
char tmp = *(YYCURSOR-1);
if ((tmp == '%' && CG(asp_tags)) | tmp == '?') {
YYCURSOR -= 2;
}
CG(zend_lineno)++;
BEGIN(ST_IN_SCRIPTING);
return T_COMMENT;
}

<ST_EOL_COMMENT>{ANY_CHAR} {
if (YYCURSOR >= YYLIMIT) {
BEGIN(ST_IN_SCRIPTING);
return T_COMMENT;
}
yymore();
}

Let me know what the thoughts are on the above, if we don't want that
then I say yeah, commit away!

Wouldn't it be a little more complicated for strings/heredocs than comments?
Or not, haven't thought about it much! :-) And you still need the "manual
use" of YYCURSOR, etc. In other words, to me, the scanner rules are doing
what the manual switch ()'s case statements do, but in a slower,
"roundabout" way.

Well, I'm gonna be away for a bit now, but I guess I can commit away when I
get back.

-shire

Matt

16 years ago by shire — view source

unread

Matt Wilmas wrote:

Gotcha. If something changes, YYFILL -- or something to handle what
needs to be done -- could just be added to the manual parts as
necessary, right?

Sorry forget to reply on this one, but yeah we'd have to do a manual call to YYFILL or a check or whatever we come up with wherever we're scanning ahead manually. (there's probably some other parts that might need this as well).

-shire

16 years ago by Dmitry Stogov — view source

unread

Hi Matt,

I wasn't able to look into all details of the patch, but in general I
like it, as it fixes bugs and makes scanner smaller. I think you can
commit it.

Although this patch doesn't fix the EOF handling related to mmap().

Thanks. Dmitry.

Matt Wilmas wrote:

Hi guys,

----- Original Message -----
From: "Nuno Lopes"
Sent: Thursday, April 30, 2009

The patch looks generally ok. However I'll need a few more days to
review it carefully and throughly. (you can merge it in the
meantime if you want).
I'm just slighty concern with the amount of parsing we are now
doing by hand, and with the possible (local) security bugs we might
be introducing..

Am I understanding this properly, that this addresses the re2c EOF
bug? So we have an RC planned for next week (freeze Monday evening).
Can you get this fixed and released by then as Marcus is unable to
do this himself?

So this addresses some of the re2c EOF problems, but I don't know if
it addresses all of them or not. I haven't had the time yet for a full
review.
Anyway, Matt can surelly comment on this.

Yes, it addresses the re2c EOF issues for strings and comments, as they
were the problem ones that allowed NULL bytes, and scanned past the EOF
NULL. As I said to Dmitry, I'm not sure if it's now possible to remove
the temporary mmap() fixes that he wanted removed before the next RC
(??), or if there would still be problems with re2c scanning other
tokens, even though they can't contain NULLs. I didn't attempt to make
any changes there, since I'm not familiar with what's been done.

I just wanted to finally send the patch for others to review, and decide
what to do, so I won't commit any changes yet in the meantime. :-)

Nuno

Matt

16 years ago by Matt Wilmas — view source

unread

Hi Dmitry,

----- Original Message -----
From: "Dmitry Stogov"
Sent: Monday, May 04, 2009

Hi Matt,

I wasn't able to look into all details of the patch, but in general I like
it, as it fixes bugs and makes scanner smaller. I think you can commit it.

OK, you mean before the freeze for RC2...? I'll go ahead and do that later
unless someone says not to.

Although this patch doesn't fix the EOF handling related to mmap().

:-/

Thanks. Dmitry.

Thanks,
Matt