In Bug #54089 1 a patch was applied that cuts of token_get_all()
output after a T_HALT_COMPILER
token. This was done because otherwise
PHP would keep on lexing after that and would generate errors because
of binary data (which is not valid PHP, mostly.)
The problem with the patch is, that there are some tokens after
T_HALT_COMPILER
that are of interest, namely the '(' ')' ';'. After
the patch it is impossible to get those tokens, without either
relexing the code after T_HALT_COMPILER
(that way you get the binary
data problem back, just with much more complex code) or writing a
regular expression to match it (which is really hard, as there may be
any token dropped by the PHP parser in there, i.e. whitespace,
comments, PHP tags).
This issue was pointed out by the creator of the bug report, but was
those comments were ignored for some reason.
I would like this patch to be reverted on the 5.4 and trunk branches.
I assume it's too late to revert it on 5.3, as it has been there for
some time already. It is just counterproductive. (Alternatively one
could fix token_get_all to return the (); tokens after
__halt_compiler, too, but that would be hard, probably.)
Nikita
In Bug #54089 [1] a patch was applied that cuts of
token_get_all()
output after aT_HALT_COMPILER
token. This was done because otherwise
PHP would keep on lexing after that and would generate errors because
of binary data (which is not valid PHP, mostly.)The problem with the patch is, that there are some tokens after
T_HALT_COMPILER
that are of interest, namely the '(' ')' ';'. After
the patch it is impossible to get those tokens, without either
relexing the code afterT_HALT_COMPILER
(that way you get the binary
data problem back, just with much more complex code) or writing a
regular expression to match it (which is really hard, as there may be
any token dropped by the PHP parser in there, i.e. whitespace,
comments, PHP tags).This issue was pointed out by the creator of the bug report, but was
those comments were ignored for some reason.I would like this patch to be reverted on the 5.4 and trunk branches.
I assume it's too late to revert it on 5.3, as it has been there for
some time already. It is just counterproductive. (Alternatively one
could fix token_get_all to return the (); tokens after
__halt_compiler, too, but that would be hard, probably.)
I think that it wouldn't be too hard.
From a quick glance on the code, we should introduce a new local
variable, set that to true where we break now (
http://svn.php.net/viewvc/php/php-src/trunk/ext/tokenizer/tokenizer.c?view=markup#l155
) and don't break there but for the next ';'. another maybe less
confusing solution would be to explicitly add '(', ')' and ';' to the
result in the T_HALT_COMPILER
condition before breking out of the
loop.
I will create a patch for this afternoon.
or could there be other important tokens after the __halt_compiler()
which should be present in the token_get_all()
result?
--
Ferenc Kovács
@Tyr43l - http://tyrael.hu
Thank you Nikita for take this subject here!
don't break there but for the next ';'.
You can also just count the number of semantic token after T_HALT_COMPILER
(ie excluding whitespace and comments) and once you hit 3, halt.
less confusing solution would be to explicitly add '(', ')' and ';' to the
result in the
T_HALT_COMPILER
condition before breking out of the
loop.
If you mean verifying that '(', ')' and (';' or T_CLOSE_TAG) are effectively
following T_HALT_COMPILER, I think that's part of the syntax analyser's job,
not tokenizer's.
If you're ok with this argument, then just couting 3 tokens is really the
most basic "syntax analysis" we have to do to fix the pb, don't you think?
could there be other important tokens after the __halt_compiler()
which should be present in thetoken_get_all()
result?
Maybe the binary data itself, as a big T_INLINE_HTML
for example ?
Also, if token_get_all you be made binary safe, that would be very cool !
(no more eating of \x00-\x1F inside regular code) :)
Nicolas
On Fri, Sep 9, 2011 at 10:46 AM, Nicolas Grekas
nicolas.grekas+php@gmail.com wrote:
You can also just count the number of semantic token after
T_HALT_COMPILER
(ie excluding whitespace and comments) and once you hit 3, halt.
[...]
Maybe the binary data itself, as a bigT_INLINE_HTML
for example ?
In favor of both proposals!
Returning the next 3 tokens should be quite easy 1.
Returning the rest as anT_INLINE_HTML
makes sense too, as extracting
the data is probably what you want. Though I have no idea how to
implement that ^^
I just set up an PHP environment and wrote a proper patch (including
test changes) to make it collect the next three tokens. It's a git
patch and I'm not sure whether it's compatible with SVN patches. I
would love it if this would go into 5.4 before beta. I didn't know how
one could fetch the rest into T_INLINE_HTML, so I'm hoping on help
here from someone who actually knowns C there :)
Thanks, Nikita
I just set up an PHP environment and wrote a proper patch (including test
changes) to make it collect the next three tokens. It's a git patch and I'm
not sure whether it's compatible with SVN patches. I would love it if this
would go into 5.4 before beta. I didn't know how one could fetch the rest
into T_INLINE_HTML, so I'm hoping on help here from someone who actually
knowns C there :)
I just reopened the bug and posted an excerpt of this discussion with your
patch :
https://bugs.php.net/bug.php?id=54089
Nicolas
I changed my previous patch to an SVN patch, so it is easier to apply
and added another patch (the one called "tokenizer_patch_full.txt"),
which additionally fetches the rest into a T_INLINE_HTML. (The "_full"
patch thus contains both changes. I didn't know how I could separate
them.)
Could somebody apply the patch or give me advice how to improve it, so
it can be applied?
Thanks.
On Tue, Sep 13, 2011 at 9:57 AM, Nicolas Grekas
nicolas.grekas+php@gmail.com wrote:
I just set up an PHP environment and wrote a proper patch (including test
changes) to make it collect the next three tokens. It's a git patch and I'm
not sure whether it's compatible with SVN patches. I would love it if this
would go into 5.4 before beta. I didn't know how one could fetch the rest
into T_INLINE_HTML, so I'm hoping on help here from someone who actually
knowns C there :)I just reopened the bug and posted an excerpt of this discussion with your
patch :https://bugs.php.net/bug.php?id=54089
Nicolas
I changed my previous patch to an SVN patch, so it is easier to apply
and added another patch (the one called "tokenizer_patch_full.txt"),
which additionally fetches the rest into a T_INLINE_HTML. (The "_full"
patch thus contains both changes. I didn't know how I could separate
them.)
I haven't seen much "buy-in" about your email. You might want to wait
until after Beta and revisit the topic later. If you add some concise
explanation it would be easier for people to weigh the pros and cons,
see what breaks and what gets fixed etc.
Chris
Could somebody apply the patch or give me advice how to improve it, so
it can be applied?
Thanks.On Tue, Sep 13, 2011 at 9:57 AM, Nicolas Grekas
nicolas.grekas+php@gmail.com wrote:I just set up an PHP environment and wrote a proper patch (including test
changes) to make it collect the next three tokens. It's a git patch and I'm
not sure whether it's compatible with SVN patches. I would love it if this
would go into 5.4 before beta. I didn't know how one could fetch the rest
into T_INLINE_HTML, so I'm hoping on help here from someone who actually
knowns C there :)I just reopened the bug and posted an excerpt of this discussion with your
patch :https://bugs.php.net/bug.php?id=54089
Nicolas
--
Email: christopher.jones@oracle.com
Tel: +1 650 506 8630
Blog: http://blogs.oracle.com/opal/
On Tue, Sep 13, 2011 at 9:56 PM, Christopher Jones
christopher.jones@oracle.com wrote:
I changed my previous patch to an SVN patch, so it is easier to apply
and added another patch (the one called "tokenizer_patch_full.txt"),
which additionally fetches the rest into a T_INLINE_HTML. (The "_full"
patch thus contains both changes. I didn't know how I could separate
them.)I haven't seen much "buy-in" about your email. You might want to wait
until after Beta and revisit the topic later. If you add some concise
explanation it would be easier for people to weigh the pros and cons,
see what breaks and what gets fixed etc.
I agree that the lack of response is kind of disturbing:
no response in the bugreport, no response on the mailing list, almost
zero comment on irc. :/
I can accept that if it won't make into 5.3, or even 5.4.0, but at
least it could be merged to trunk, if the patch is fine.
I think that Nicolas and Nikita did more than enough on their part,
could somebody please review the patch?
--
Ferenc Kovács
@Tyr43l - http://tyrael.hu
On Tue, Sep 13, 2011 at 9:56 PM, Christopher Jones
christopher.jones@oracle.com wrote:I changed my previous patch to an SVN patch, so it is easier to apply
and added another patch (the one called "tokenizer_patch_full.txt"),
which additionally fetches the rest into a T_INLINE_HTML. (The "_full"
patch thus contains both changes. I didn't know how I could separate
them.)I haven't seen much "buy-in" about your email. You might want to wait
until after Beta and revisit the topic later. If you add some concise
explanation it would be easier for people to weigh the pros and cons,
see what breaks and what gets fixed etc.I agree that the lack of response is kind of disturbing:
no response in the bugreport, no response on the mailing list, almost
zero comment on irc. :/
I can accept that if it won't make into 5.3, or even 5.4.0, but at
least it could be merged to trunk, if the patch is fine.
I think that Nicolas and Nikita did more than enough on their part,
could somebody please review the patch?
the change to ext/tokenizer/tests/token_get_all_variation16.phpt seems
really weird?
And I'm unsure how this should actually work.
A file containing:
<?php
__halt_compiler()
?>
stuff
Will return "stuff" after this patch.. that doesn't seem expected.
-Hannes
Hi Hannes, thanks for your response!
On Thu, Sep 15, 2011 at 9:22 AM, Hannes Magnusson
hannes.magnusson@gmail.com wrote:
the change to ext/tokenizer/tests/token_get_all_variation16.phpt seems
really weird?
I changed the test to not include a __halt_compiler statement anymore,
as it didn't make sense in this context, and included a function call
instead. The file would have been just a parse error, as
__halt_compiler can only occur in the outermost scope or in namespace
declaration score. Additionally the behavior of token_get_all on
__halt_compiler is already tested in the other test (bug54089.phpt),
so I though it wasn't necessary to test it there again.
And I'm unsure how this should actually work.
A file containing:
<?php__halt_compiler()
?>
stuffWill return "stuff" after this patch.. that doesn't seem expected.
First of all: The second patch (the _full one) isn't that important to
me, as you can fetch the data after __halt_compiler(); quite easily in
userland code, too (concat all tokens and calculate the offset from
that, then substr it from the code). So if you think it doesn't make
sense, I would be okay without it. The original reasoning was, that
fetching the contents after __halt_compiler(); is the most probable
thing you want to do if you are using token_get_all on a file with
__halt_compiler. But I'm not sure that it is true and can imagine that
in many cases you write scripts that don't really care about
__halt_compiler statements and the additionalT_INLINE_HTML
would only
hurt. Maybe Nicolas can provide further input on this.
Thanks for looking at the patch. If you have further questions, please ask :)
Nikita
PS: I updated both patches to fix an issue with which tokens are
considered dropped. I now explicitly state them in the if-statement,
as the destroy
variable wasn't quite the same (it also included the
tokens that are translated into another token.)
Hi Hannes, thanks for your response!
On Thu, Sep 15, 2011 at 9:22 AM, Hannes Magnusson
hannes.magnusson@gmail.com wrote:the change to ext/tokenizer/tests/token_get_all_variation16.phpt seems
really weird?
I changed the test to not include a __halt_compiler statement anymore,
as it didn't make sense in this context, and included a function call
instead. The file would have been just a parse error, as
__halt_compiler can only occur in the outermost scope or in namespace
declaration score. Additionally the behavior of token_get_all on
__halt_compiler is already tested in the other test (bug54089.phpt),
so I though it wasn't necessary to test it there again.And I'm unsure how this should actually work.
A file containing:
<?php__halt_compiler()
?>
stuffWill return "stuff" after this patch.. that doesn't seem expected.
First of all: The second patch (the _full one) isn't that important to
me, as you can fetch the data after __halt_compiler(); quite easily in
userland code, too (concat all tokens and calculate the offset from
that, then substr it from the code). So if you think it doesn't make
sense, I would be okay without it. The original reasoning was, that
fetching the contents after __halt_compiler(); is the most probable
thing you want to do if you are using token_get_all on a file with
Wait wait wait. Thats the point here?
COMPILER_HALT_OFFSET already tells you where the data starts.
-Hannes
On Thu, Sep 15, 2011 at 5:05 PM, Hannes Magnusson
hannes.magnusson@gmail.com wrote:
Wait wait wait. Thats the point here?
COMPILER_HALT_OFFSET already tells you where the data starts.
COMPILER_HALT_OFFSET tells you the offset if you are running the
file. It is only available in the file that does the actual
__halt_compiler(); statement. When using the Tokenizer (this is what
this is about; this is not about PHP's own lexing/parsing behavior)
you don't have that constant available.
Though, again, I don't think the second patch is strictly necessary.
I'd be happy without, if it creates too many problems / is not
intuitive.
Nikita
Hello,
On Thu, Sep 15, 2011 at 5:05 PM, Hannes Magnusson
hannes.magnusson@gmail.com wrote:Wait wait wait. Thats the point here?
COMPILER_HALT_OFFSET already tells you where the data starts.
COMPILER_HALT_OFFSET tells you the offset if you are running the
file. It is only available in the file that does the actual
__halt_compiler(); statement. When using the Tokenizer (this is what
this is about; this is not about PHP's own lexing/parsing behavior)
you don't have that constant available.Though, again, I don't think the second patch is strictly necessary.
I'd be happy without, if it creates too many problems / is not
intuitive.Nikita
What I am really hearing in this thread is some people have interest
in a way to easily access the COMPILER_HALT_OFFSET offset from
another file's context. Why don't we simply create a function for
this? Something like get_compiler_halt_offset(filename) or a similar
function specifically for this, maybe a function which returns the
data after the offset or a combination of the two. Whatever solves
these use cases. Then the tokenizer may stay precisely what it is, a
tokenizer. Changing the behavior of the tokenizer based off the
contents of the file seem very incorrect to me.
-Chris
What I am really hearing in this thread is some people have interest in a
way to easily access the COMPILER_HALT_OFFSET
Well, not exactly for me : I'm not interested strictly on getting the
offset. I'm more interested in doing generic static code analysis, and part
of that is getting the halt offset, amongst many other data coming from
tokens.
The big pb with the fix applied since 5.3.6 is that we can not do that
anymore without many complications.
So in this way, I agree with Nikita, the second aspect of the patch (getting
the binary data in a T_INLINE_HTML) is not a necessity, thought that may
help.
But the first aspect of the patch (counting 3 semantic tokens after
T_HALT_COMPILER) is really needed. That would just give us back the
power/feature we have lost since 5.3.6.
So my request would be : pleeeease apply the patch counting 3 semantic
tokens for 5.4 asap, then we can discuss the remaining later for the shake
of completeness of the discussion.
Nicolas
Wait wait wait. Thats the point here?
COMPILER_HALT_OFFSET already tells you where the data starts.-Hannes
I didn't sent this message first, but after reading the mail from
Chris, I think maybe it would clear the confusion:
It is about tokenizing a file which has __halt_compiler(); in it.
before the fix of the original bugreport, one could get the warning
"Unexpected character in input" if he tried to token_get_all()
a
script which had binary data after the __halt_compiler();
iliaa's fix was trivial: break from the tokenizer if __halt_compiler
token is found.
but it isn't good enough, because as the original bugreporter pointed out:
1, now the token_get_all()
won't return the (); after __halt_compiler,
which means that if you rebuild the code from the tokens, you will
have invalid php code.
2, you have no way to get the binary data after the __halt_compiler
via the tokenizer, so you can't rebuild the original file using only
the tokenizer. (for example one could use the tokenizer to strip the
whitespaces and comments from a given file in-place)
both problems could be hacked around from userland, but imo it still
worth fixing those.
--
Ferenc Kovács
@Tyr43l - http://tyrael.hu
So, is there any consensus about this?
Nikita
Wait wait wait. Thats the point here?
COMPILER_HALT_OFFSET already tells you where the data starts.-Hannes
I didn't sent this message first, but after reading the mail from
Chris, I think maybe it would clear the confusion:It is about tokenizing a file which has __halt_compiler(); in it.
before the fix of the original bugreport, one could get the warning
"Unexpected character in input" if he tried totoken_get_all()
a
script which had binary data after the __halt_compiler();
iliaa's fix was trivial: break from the tokenizer if __halt_compiler
token is found.
but it isn't good enough, because as the original bugreporter pointed out:
1, now thetoken_get_all()
won't return the (); after __halt_compiler,
which means that if you rebuild the code from the tokens, you will
have invalid php code.
2, you have no way to get the binary data after the __halt_compiler
via the tokenizer, so you can't rebuild the original file using only
the tokenizer. (for example one could use the tokenizer to strip the
whitespaces and comments from a given file in-place)both problems could be hacked around from userland, but imo it still
worth fixing those.--
Ferenc Kovács
@Tyr43l - http://tyrael.hu