Hello All,
I am back from my vacation in Tanzania. I will be in Innsbruck over
the weekend for some Frisbee action, but I hope to get back into the
RM business Sunday evening or Monday. I went through all my emails
yesterday and marked several for reading, which I will do on the train
ride if all goes well. That being said, it would also be useful if
anyone who is looking after issues that are still open and that should
be addressed before the next release (beta2 or RC1) can send Johannes
and myself a quick email. If anything needs discussion raise it on
internals ASAP.
regards,
Lukas Kahwe Smith
mls@pooteeweet.org
Hello Lukas,
thanks for the detailed update on your life :-)
Friday, March 6, 2009, 9:13:16 AM, you wrote:
Hello All,
I am back from my vacation in Tanzania. I will be in Innsbruck over
the weekend for some Frisbee action, but I hope to get back into the
RM business Sunday evening or Monday. I went through all my emails
yesterday and marked several for reading, which I will do on the train
ride if all goes well. That being said, it would also be useful if
anyone who is looking after issues that are still open and that should
be addressed before the next release (beta2 or RC1) can send Johannes
and myself a quick email. If anything needs discussion raise it on
internals ASAP.
regards,
Lukas Kahwe Smith
mls@pooteeweet.org
Best regards,
Marcus
Hey Lukas,
Just a heads up that I should have a fix for this soonish, just running some more tests to make sure everything works as expected (I assume nobody else has started work on this):
- tokenizer misses last single-line comment (http://bugs.php.net/bug.php?id=46817)
What's the details on this one?
- memory leak in the scanner when a new state stack is created
-shire
Lukas Kahwe Smith wrote:
Hello All,
I am back from my vacation in Tanzania. I will be in Innsbruck over the
weekend for some Frisbee action, but I hope to get back into the RM
business Sunday evening or Monday. I went through all my emails
yesterday and marked several for reading, which I will do on the train
ride if all goes well. That being said, it would also be useful if
anyone who is looking after issues that are still open and that should
be addressed before the next release (beta2 or RC1) can send Johannes
and myself a quick email. If anything needs discussion raise it on
internals ASAP.regards,
Lukas Kahwe Smith
mls@pooteeweet.org
Hey Brian,
Hey Lukas,
Just a heads up that I should have a fix for this soonish, just
running some more tests to make sure everything works as expected (I
assume nobody else has started work on this):
- tokenizer misses last single-line comment (http://bugs.php.net/bug.php?id=46817
)What's the details on this one?
- memory leak in the scanner when a new state stack is created
When a file is included and a new state structure is created for the
scanner its pushed on to a stack for freeing later on, most of the
time this happens as soon as the scanner has parsed the included file
but not quite for the tokenizer extension though.
At the moment the destructor for the stack is comparing pointers,
which are unfortunately different because when you push something with
zend_stack it does a copy.
I have a fix which just adds an id to the state structure, but this
requires a little bit more memory.
I think the cleaner fix is to sort the tokenizer extension, just need
some more time.
Scott
Hi Brian,
----- Original Message -----
From: "shire"
Sent: Monday, March 09, 2009
Hey Lukas,
Just a heads up that I should have a fix for this soonish, just running
some more tests to make sure everything works as expected (I assume nobody
else has started work on this):
- tokenizer misses last single-line comment
(http://bugs.php.net/bug.php?id=46817)
I was going to take care of that one, as I mentioned in a previous message,
though it's been awhile since I've been delayed much longer with stuff here.
:-( (Nothing set up for building PHP on this system yet; hope to in the
next several hours finally, and do some things!)
As far as I know there's still the other comment-related issue where no
Warning is giving about "Unterminated comment ..." for unclosed /* ... It's
all of course related to the fundamental re2c issue, for now, where when the
scanned input ends while a variable length part of a rule is being matched,
it just aborts ("return 0;") in YYFILL().
And that applies to the case Lukas gave in the bug report: WHITESPACE
pattern is variable length.
The comment issue just happens to be a more obvious thing that was noticed,
and it doesn't affect actual correct code that ends with a fixed-length ";"
or "?>" Some other bits that won't be returned if at the end of a file are
T_LNUMBER, T_DNUMBER, T_STRING, T_VARIABLE, "The last part in unclosed
double-quotes or backticks, 'An unclosed single quoted string, and so on,
likely resulting in a different parse error than previous versions.
T_INLINE_HTML
isn't affected by this because it's matched with a manual
scan, rather than an re2c pattern. The manual scan may well have been used
to work around re2c...? :-)
- Matt
Hey Matt,
Matt Wilmas wrote:
- tokenizer misses last single-line comment
(http://bugs.php.net/bug.php?id=46817)I was going to take care of that one, as I mentioned in a previous
message, though it's been awhile since I've been delayed much longer
with stuff here. :-( (Nothing set up for building PHP on this system
yet; hope to in the next several hours finally, and do some things!)
Sorry I missed you're earlier email. I saw this sitting on the 5.3 todo list and it was breaking some of our parsing so I figured I'd take a stab at it. Here is my current patch http://tekrat.com/downloads/bits/php53.scanner_eof.patch, please let me know if you have some suggestions/changes. It sounds like you commented on this initially so please let me know what you/we should do ie: merging my patch/your work, commiting this, or if you had a better fix in mind etc. My biggest complaint is that my current patch requires adding \x00 to any exclusion rules ("[^").
These changes for handling EOF should probably be ported to the INI scanner as well for the above reason and to keep them similar.
As far as I know there's still the other comment-related issue where no
Warning is giving about "Unterminated comment ..." for unclosed /* ...
It's all of course related to the fundamental re2c issue, for now, where
when the scanned input ends while a variable length part of a rule is
being matched, it just aborts ("return 0;") in YYFILL().
I don't seem to see this problem, perhaps I'm not reproducing it correctly?
And that applies to the case Lukas gave in the bug report: WHITESPACE
pattern is variable length.
Didn't see/find this is there a bug # or link?
-shire
Hi Brian,
----- Original Message -----
From: "shire"
Sent: Monday, March 09, 2009
Hey Matt,
Matt Wilmas wrote:
- tokenizer misses last single-line comment
(http://bugs.php.net/bug.php?id=46817)I was going to take care of that one, as I mentioned in a previous
message, though it's been awhile since I've been delayed much longer
with stuff here. :-( (Nothing set up for building PHP on this system
yet; hope to in the next several hours finally, and do some things!)Sorry I missed you're earlier email. I saw this sitting on the 5.3 todo
list and it was breaking some of our parsing so I figured I'd take a stab
at it. Here is my current patch
http://tekrat.com/downloads/bits/php53.scanner_eof.patch, please let me
know if you have some suggestions/changes. It sounds like you commented
on this initially so please let me know what you/we should do ie: merging
my patch/your work, commiting this, or if you had a better fix in mind
etc. My biggest complaint is that my current patch requires adding \x00
to any exclusion rules ("[^").These changes for handling EOF should probably be ported to the INI
scanner as well for the above reason and to keep them similar.
I don't have much time right now, but looked at it quick, and see that
you're actually trying to work around the re2c issues in general. :-) I was
only thinking of putting a "band-aid" on the comment symptom(s), since those
are about the only ones that occur with valid code (is the tokenizer ext.
supposed to handle all tokens in code that wouldn't really compile?). And
yeah, about excluding \x00 from ANY_CHAR, it could change things, since it's
always been allowed, although it seems strange that code would have literal
NULLs in it (generated eval()'d code?). That was part of the reason I
couldn't come up with a generic fix while keeping all behavior. If re2c
would just remember the last matching state it was in at EOF like Flex!
Otherwise, I don't know what to do. :-/ I'm going to do something else
before trying to implement what I was going to do, so there's no patch
yet...
As far as I know there's still the other comment-related issue where no
Warning is giving about "Unterminated comment ..." for unclosed /* ...
It's all of course related to the fundamental re2c issue, for now, where
when the scanned input ends while a variable length part of a rule is
being matched, it just aborts ("return 0;") in YYFILL().I don't seem to see this problem, perhaps I'm not reproducing it
correctly?
As far as the Warning, with "<?php /* blah " do you get "Unterminated
comment ..." ? Of course your patch would restore it, because it's missing
last I checked (not able to right now).
And that applies to the case Lukas gave in the bug report: WHITESPACE
pattern is variable length.Didn't see/find this is there a bug # or link?
I meant the "could be related if not the same problem" comment added the
other day in Bug #46817.
-shire
- Matt
Matt Wilmas wrote:
I don't have much time right now, but looked at it quick, and see that
you're actually trying to work around the re2c issues in general. :-) I
was only thinking of putting a "band-aid" on the comment symptom(s),
since those are about the only ones that occur with valid code (is the
tokenizer ext. supposed to handle all tokens in code that wouldn't
really compile?).
Yeah I figured I should try to fix as much as a could, specifically the YYLMIIT not enforcing availability of 'n' chars makes me nervous. ;-)
I would expect that tokenizer should handle all tokens in code as long as they pass the scanner phase (not the parser phase) but I'm not sure on what the intention here is.
And yeah, about excluding \x00 from ANY_CHAR, it could
change things, since it's always been allowed, although it seems strange
that code would have literal NULLs in it (generated eval()'d code?).
That was part of the reason I couldn't come up with a generic fix while
keeping all behavior. If re2c would just remember the last matching
state it was in at EOF like Flex!
It seems to me like the crux of the problem here is that we can't integrate an EOF check (such as checking the length of data) within the regular expression. While flex allows the <<EOF>> we are expected to provide a unique identifier/token to match on. This assumes that we have a unique character, or that the data is in good form so that we can detect a token etc. Perhaps a good feature to add to re2c would be able to include a special regex/token match that would identify special conditions programatically such as (YYCURSOR == YYLIMIT) etc.
In defense of re2c I think it could be useful in situations to have to explicitly handle EOF, as it allows you more freedom for processing different data types.
I'll have to look closer at the multi-byte processing as well. I don't see a lot of cases where we would run into \x00 values in code. (Perhaps someone can provide a suggested use case that we need to watch out for?) Perhaps if someone is including binary data strings within code?.
Otherwise, I don't know what to do. :-/ I'm going to do something else
before trying to implement what I was going to do, so there's no patch
yet...
Ok, I'll keep working on this I guess then as there's a couple more tests I want to run and fix some things before I commit (like ensuring that YYLIMIT actually ensures there are 'n' bytes available to read, etc).
As far as the Warning, with "<?php /* blah " do you get "Unterminated
comment ..." ? Of course your patch would restore it, because it's
missing last I checked (not able to right now).
I didn't see this in the current, un-patched, php-5.3 build but I'll double check to make sure I wasn't still using my new binaries.
And that applies to the case Lukas gave in the bug report: WHITESPACE
pattern is variable length.Didn't see/find this is there a bug # or link?
I meant the "could be related if not the same problem" comment added the
other day in Bug #46817.
Ah, I see. Yes this was actually my friend that raised my attention about getting this fixed ;-)
-shire
shire wrote:
Hey Lukas,
Just a heads up that I should have a fix for this soonish, just running
some more tests to make sure everything works as expected (I assume
nobody else has started work on this):
- tokenizer misses last single-line comment
(http://bugs.php.net/bug.php?id=46817)
Ok, just checked this in, let me know if I missed anything or if there's any issues. I noticed that CVS HEAD seemed to have dos line feeds and some other bad merge stuff so those where fixed as a side affect of my commit. This should also correct some highlighting bugs that were introduced (see the fix to the highlight_file test).
-shire