Request for opinions: bug vs feature - change in tokenization of yield from

1 year ago by Juliette Reinders Folmer — view source

unread

Hi all,

I recently discovered a change was made to the Tokenizer in PHP 8.3.0
which now allows for a comment to exist between the yield and from
keyword from the yield from keyword.
Before PHP 8.3, this was a parse error.

This change was not documented (anywhere) nor publicized, which is why I
only recently came across it and opened a bug report on GitHub about it [1].

This change is a breaking change for projects consuming the token
stream, but more than that, it introduces an inconsistency in the
Tokenizer as there is no other single token in PHP which can contain a
comment (aside from comments tokens of course).
Comments were even explicitly forbidden in the PHP 8.0 RFC which changed
the tokenization of namespaced names [2].
Some more detailed explanation and examples can be found in the GH
thread [3] and adetailed analysis of the tokenization change can be
found in a ticket in the PHP_CodeSniffer repo [4] (details in a fold out
in the comment).

In my opinion the change is a bug and should be reverted, but opinions
on this are divided.
After a discussion in the ticket on GitHub, it was suggested to ask for
a third/fourth/fifth opinion here on the list.

Pros for reverting it:

Consistency with other tokens in the token stream, none of which allow
comments within the token.
Consistency with the 7 years before in which this was a parse error.
The change was never documented, not in the changelog, news, nor in
the upgrading guide.

Cons against reverting it:

The change has been in PHP 8.3 releases for a little over seven months.
The odd few people may have accidentally discovered the bug and taken
advantage by introducing code containing yield /*comment*/ from into
their project, which with a revert would become a parse error again.

Opinions ?

Of course, the ticket yielded discussion on whether the tokenization of
yield from should be changed anyhow, but please consider that
discussion out of scope.
The current question is only about reverting the bug versus regarding it
as a new feature and properly documenting it.

Smile,
Juliette

1: https://github.com/php/php-src/issues/14926
2:
https://wiki.php.net/rfc/namespaced_names_as_token#backward_incompatible_changes
3:
https://github.com/PHPCSStandards/PHP_CodeSniffer/issues/529#issuecomment-2209337419
4: https://github.com/php/php-src/issues/14926#issuecomment-2227152061

1 year ago by Bob Weinand — view source

unread

Am 18.07.2024 um 01:15 schrieb Juliette Reinders Folmer php-internals_nospam@adviesenzo.nl:

Hi all,

I recently discovered a change was made to the Tokenizer in PHP 8.3.0 which now allows for a comment to exist between the yield and from keyword from the yield from keyword.
Before PHP 8.3, this was a parse error.

This change was not documented (anywhere) nor publicized, which is why I only recently came across it and opened a bug report on GitHub about it [1].

This change is a breaking change for projects consuming the token stream, but more than that, it introduces an inconsistency in the Tokenizer as there is no other single token in PHP which can contain a comment (aside from comments tokens of course).
Comments were even explicitly forbidden in the PHP 8.0 RFC which changed the tokenization of namespaced names [2].
Some more detailed explanation and examples can be found in the GH thread [3] and a detailed analysis of the tokenization change can be found in a ticket in the PHP_CodeSniffer repo [4] (details in a fold out in the comment).

In my opinion the change is a bug and should be reverted, but opinions on this are divided.
After a discussion in the ticket on GitHub, it was suggested to ask for a third/fourth/fifth opinion here on the list.

Pros for reverting it:

Consistency with other tokens in the token stream, none of which allow comments within the token.
Consistency with the 7 years before in which this was a parse error.
The change was never documented, not in the changelog, news, nor in the upgrading guide.

Cons against reverting it:

The change has been in PHP 8.3 releases for a little over seven months.
The odd few people may have accidentally discovered the bug and taken advantage by introducing code containing yield /*comment*/ from into their project, which with a revert would become a parse error again.

Opinions ?

Of course, the ticket yielded discussion on whether the tokenization of yield from should be changed anyhow, but please consider that discussion out of scope.
The current question is only about reverting the bug versus regarding it as a new feature and properly documenting it.

Smile,
Juliette

1: https://github.com/php/php-src/issues/14926
2: https://wiki.php.net/rfc/namespaced_names_as_token#backward_incompatible_changes
3: https://github.com/PHPCSStandards/PHP_CodeSniffer/issues/529#issuecomment-2209337419
4: https://github.com/php/php-src/issues/14926#issuecomment-2227152061

Thanks for bringing this up.

I'm honestly not sure what the best behavior for PHP 8.3 is.

But I'm strongly thinking the behavior of allowing comments between yield and from is natural and expected. So, for PHP 8.4 I propose we ensure proper tokenization of yield /* comment */ from.
Changing this in PHP 8.3 is likely too much of a compatibility break.

So, assuming we want the proper tokenization behavior for PHP 8.4, I think changing the behavior intermittently to disallow comments between yield and from in PHP 8.3 is counterproductive.
Moreover, it can - at least - be worked around in tooling by special casing the T_YIELD_FROM token and extracting the comment from the raw parsed string:

var_dump(token_get_all('<?php yield /* comment */ from $foo;'));

will contain:

[1]=> array(3) { [0]=> int(270) [1]=> string(24) "yield /* comment */ from" [2]=> int(1) }

It's not optimal, but probably the least bad solution to leave it unchanged in PHP 8.3, have tooling special case it and properly fix it in PHP 8.4.

Bob

1 year ago by Christoph M. Becker — view source

unread

Hi Bob!

Moreover, it can - at least - be worked around in tooling by special casing the T_YIELD_FROM token and extracting the comment from the raw parsed string:

var_dump(token_get_all('<?php yield /* comment */ from $foo;'));

will contain:

[1]=> array(3) { [0]=> int(270) [1]=> string(24) "yield /* comment */ from" [2]=> int(1) }

It's not optimal, but probably the least bad solution to leave it unchanged in PHP 8.3, have tooling special case it and properly fix it in PHP 8.4.

And what about "code" like https://3v4l.org/4CLhM? Is Codesniffer
supposed to scan the result of https://3v4l.org/dKDcs for possible CS
violations?

Cheers,
Christoph

1 year ago by Bob Weinand — view source

unread

Hey Christoph,

Am 19.07.2024 um 00:51 schrieb Christoph M. Becker cmbecker69@gmx.de:

Hi Bob!

Moreover, it can - at least - be worked around in tooling by special casing the T_YIELD_FROM token and extracting the comment from the raw parsed string:

var_dump(token_get_all('<?php yield /* comment */ from $foo;'));

will contain:

[1]=> array(3) { [0]=> int(270) [1]=> string(24) "yield /* comment */ from" [2]=> int(1) }

It's not optimal, but probably the least bad solution to leave it unchanged in PHP 8.3, have tooling special case it and properly fix it in PHP 8.4.

And what about "code" like https://3v4l.org/4CLhM? Is Codesniffer
supposed to scan the result of https://3v4l.org/dKDcs for possible CS
violations?

Cheers,
Christoph

I suppose you mean https://3v4l.org/IMi8Y, (you missed the <?php tag).
If you want to scan that, it's quite easy to strip the leading yield and trailing from, and tokenize that again to extract all comments.

Sure, it's a hack, but it'll work: https://3v4l.org/8eAiV.

Bob

1 year ago by Juliette Reinders Folmer — view source

unread

Hey Christoph,

Am 19.07.2024 um 00:51 schrieb Christoph M. Becker cmbecker69@gmx.de:

Hi Bob!

Moreover, it can - at least - be worked around in tooling by special
casing the T_YIELD_FROM token and extracting the comment from the
raw parsed string:

var_dump(token_get_all('<?php yield /* comment */ from $foo;'));

will contain:

[1]=> array(3) { [0]=> int(270) [1]=> string(24) "yield /* comment
*/ from" [2]=> int(1) }

It's not optimal, but probably the least bad solution to leave it
unchanged in PHP 8.3, have tooling special case it and properly fix
it in PHP 8.4.

And what about "code" like https://3v4l.org/4CLhM? Is Codesniffer
supposed to scan the result of https://3v4l.org/dKDcs for possible CS
violations?

Cheers,
Christoph

I suppose you mean https://3v4l.org/IMi8Y, (you missed the <?php tag).
If you want to scan that, it's quite easy to strip the leading yield
and trailing from, and tokenize that again to extract all comments.

Sure, it's a hack, but it'll work: https://3v4l.org/8eAiV.

Bob

Hi Bob,

Of course, everything can be hacked around, but that still leaves the
question what should be the "proper tokenization". Having this change in
PHP 8.3 and then - as you suggest - yet another in PHP 8.4, makes it
mighty hard to have a consistent token stream in tooling, especially as
it is unclear what the "proper tokenization" should/would be.

More than anything, I find it concerning that this change sets a
precedent for tokens to include comments.

Just as an example: what does this mean for the PHP 8.0 nullsafe object
operator ? Should we now suddenly allow that to be written as ? /*comment*/ -> ?
Or what about a cast token ? Should that be allowed to be (string /*for reasons*/) ?

Allowing this change to stay in, without having the discussion about
what the "proper tokenization" should be, feels off and random to me and
opens the door for more random changes.

As for the impact on tooling: a change in the tokenization of any token
has an impact not only on tooling like PHPCS itself, but also on every
single external standard build on top of it and is a breaking change.
To give you some perspective - for PHPCS we even went as far as to
"undo" the PHP 8.0 tokenization of namespaced names for the time being
(in the PHPCS 3.x releases) and we'll only change the PHPCS tokenizer to
use the PHP 8.0 tokenization in the PHPCS 4.0 release as it would
otherwise break too many existing sniffs. [1]

Smile,
Juliette

1: https://github.com/squizlabs/PHP_CodeSniffer/issues/3041

1 year ago by tim@bastelstu.be — view source

unread

More than anything, I find it concerning that this change sets a
precedent for tokens to include comments.

Just as an example: what does this mean for the PHP 8.0 nullsafe object
operator ? Should we now suddenly allow that to be written as ? /*comment*/ -> ?
Or what about a cast token ? Should that be allowed to be (string /*for reasons*/) ?

The difference between yield from and ?-> is that the former looks
and feels like it would be two separate keywords, because of the
required whitespace between the yield and the from. The fact that
a yield keyword actually exists also contributes to that. ?-> on the
other hand looks and feels like a single operator, just like ++,
!==, <=> and others.

Except for yield from the rule where comments may be placed as far as
I can tell is "comments may appear where whitespace may appear", which
is easy enough to explain and understand. So it makes sense to allow for
comments between yield and from, but I agree that ideally those
would be emitted as separate tokens.

Best regards
Tim Düsterhus

1 year ago by Juliette Reinders Folmer — view source

unread

Hi

More than anything, I find it concerning that this change sets a
precedent for tokens to include comments.

Just as an example: what does this mean for the PHP 8.0 nullsafe object
operator ? Should we now suddenly allow that to be written as ? /*comment*/ -> ?
Or what about a cast token ? Should that be allowed to be (string /*for reasons*/) ?

The difference between yield from and ?-> is that the former looks
and feels like it would be two separate keywords, because of the
required whitespace between the yield and the from. The fact
that a yield keyword actually exists also contributes to that. ?->
on the other hand looks and feels like a single operator, just like
++, !==, <=> and others.

Except for yield from the rule where comments may be placed as far
as I can tell is "comments may appear where whitespace may appear",
which is easy enough to explain and understand. So it makes sense to
allow for comments between yield and from, but I agree that
ideally those would be emitted as separate tokens.

Tim,

"comments may appear where whitespace may appear" ?

You'd think so, except it isn't true.

I already mentioned cast tokens before. Whitespace is perfectly
acceptable within the parentheses of these. Comments are not:
https://3v4l.org/A6Sgj and https://3v4l.org/nE9H8

Now you may argue that cast tokens "feel like" a single operator, but
that's subjective and there's even a sniff to enforce no spacing within
cast parentheses as apparently people do pad them with spaces - and
doing so is allowed in PHP. *
* Qualifier: spaces and tabs are allowed inside cast parentheses, but
new lines are not...

Along the same lines and I'm beginning to repeat myself, the PHP 8.0 RFC
which changed the tokenization of namespaced names explicitly disallowed
whitespace and comments inside namespaced names tokenized as a single
token, while in the previous, multi-token situation, whitespace and
comments were allowed in namespaced names.

So to get back to my original point, as of PHP 8.3 is the only token
which allows for a comment to be tokenized as part of the token. There
is no other token which allows that.

Whitespace is one thing, comments is a different matter. And even the
whitespace is an interesting one as I've seen bug reports in PHPCS about
a sniff breaking on yield from with a new line and indentation between
the keywords. PHP allows this, the sniff in question does not handle it
correctly.

So, what "feels" natural (whitespace-wise) to one person may not be the
same for the next, but comments within tokens is different thing and
should in my opinion, not be allowed.

Smile,
Juliette

1 year ago by tim@bastelstu.be — view source

unread

I already mentioned cast tokens before. Whitespace is perfectly
acceptable within the parentheses of these. Comments are not:
https://3v4l.org/A6Sgj and https://3v4l.org/nE9H8

Now you may argue that cast tokens "feel like" a single operator, but
that's subjective and there's even a sniff to enforce no spacing within
cast parentheses as apparently people do pad them with spaces - and
doing so is allowed in PHP. *
* Qualifier: spaces and tabs are allowed inside cast parentheses, but
new lines are not...

I stand corrected. Though that indeed is another odd special case with
not allowing newlines. Looking at the lexer, there's another case where
tabs and spaces may appear, but not other whitespace. Between '<<<' and
the delimiter of a heredoc.

So to get back to my original point, as of PHP 8.3 is the only token
which allows for a comment to be tokenized as part of the token. There
is no other token which allows that.

PHP users have no idea what a token is internally. I'm looking at this
from a PHP user perspective. It looks like two keywords, it walks like
two keywords and it quacks like two keywords. I find it reasonable for
users to consider this as two keywords and not care about how it's
implemented internally.

So, what "feels" natural (whitespace-wise) to one person may not be the
same for the next, but comments within tokens is different thing and
should in my opinion, not be allowed.

As I've said: I agree that the current situation is unfortunate. But the
correct solution is not "disallow comments", but "split the T_YIELD_FROM
into T_YIELD T_WHITESPACE T_FROM_FOR_YIELD_FROM".

Best regards
Tim Düsterhus

1 year ago by Juliette Reinders Folmer — view source

unread

PHP users have no idea what a token is internally. I'm looking at this
from a PHP user perspective. It looks like two keywords, it walks like
two keywords and it quacks like two keywords. I find it reasonable for
users to consider this as two keywords and not care about how it's
implemented internally.

So, what "feels" natural (whitespace-wise) to one person may not be the
same for the next, but comments within tokens is different thing and
should in my opinion, not be allowed.

As I've said: I agree that the current situation is unfortunate. But
the correct solution is not "disallow comments", but "split the
T_YIELD_FROM into T_YIELD T_WHITESPACE T_FROM_FOR_YIELD_FROM".

Tim, you're making my point for me. This is exactly why the current
change should be reverted.

I'm not against changing the tokenization of "yield from" and the GH
ticket thread also contains an alternative proposal for this from Bob
[1], but like Matthew also said [2]: if that's something we want to do,
let's have a proper discussion about it and let it go through an RFC and
be a documented change.

And not, like it is now, an undocumented, random change creating an
inconsistency in the Tokenizer.

Smile,
Juliette

1: https://github.com/php/php-src/issues/14926#issuecomment-2228855422
2: https://externals.io/message/124462#124515

1 year ago by tim@bastelstu.be — view source

unread

Tim, you're making my point for me. This is exactly why the current
change should be reverted.

I am not sure how you read "PHP users have no idea what a token is" as
an argument in favor of reverting the change, because reverting the
change means that completely reasonable code suddenly stops working with
a parser error in a patch version and PHP users will rightfully come to
PHP's issue tracker to complain.

And not, like it is now, an undocumented, random change creating an
inconsistency in the Tokenizer.

The tokenizer is doing the right thing: It tokenizes the PHP source
code. It is absolutely normal that PHP first and second-digit updates
make changes to the token stream. New tokens are added, old tokens are
removed, tokens may appear in places where they previously could not
appear for well-formed PHP programs. Tools working on the token stream
need to adapt and this change is no different from any other change to
PHP's syntax in that regard (except that documenting the change was
forgotten).

Best regards
Tim Düsterhus

1 year ago by Deleu — view source

unread

Hi

Tim, you're making my point for me. This is exactly why the current
change should be reverted.

I am not sure how you read "PHP users have no idea what a token is" as
an argument in favor of reverting the change, because reverting the
change means that completely reasonable code suddenly stops working with
a parser error in a patch version and PHP users will rightfully come to
PHP's issue tracker to complain.

Is there any evidence that PHP users are relying on code that:

Was released just 7 months ago
Was not documented
Nobody knew about it until very recently

And furthermore, why should undocumented, unintentional, unapproved change
to PHP be supported? Even if a handful of folks come to PHP's issue tracker
to complain, the answer is plain and simple: that behavior was not approved
by PHP's RFC process, which is the only way to get a behavior change
introduced into the language. Realistically, it's highly questionable that
such hypothetical users would even show up.

--
Marco Deleu

1 year ago by tim@bastelstu.be — view source

unread

Is there any evidence that PHP users are relying on code that:

Was released just 7 months ago

Was not documented

Nobody knew about it until very recently

That is the wrong question to ask (see at the bottom).

And furthermore, why should undocumented, unintentional, unapproved change
to PHP be supported? Even if a handful of folks come to PHP's issue tracker

I do not see an issue supporting this from PHP's side. The issue was
raised by a user of PHP, not by a PHP maintainer.

The change has also been explicitly acked by Gina and Christoph before
Ilija committed it and Bob also participated in the PR without raising
concerns, so it's also not an unapproved change. The fact that none of
them anticipated this side effect, doesn't make it unapproved.

to complain, the answer is plain and simple: that behavior was not approved
by PHP's RFC process, which is the only way to get a behavior change
introduced into the language. Realistically, it's highly questionable that

This is false. Small self-contained feature additions do not need to go
through the RFC process. This is generally understood as "no one
requests an RFC being written". The deadline for such a request is not
defined as far as I know, but I believe it would be reasonable to assume
that it implicitly is no later than the release of the PHP version in
question, because otherwise any feature added this way would be at risk
of arbitrary removal.

such hypothetical users would even show up.

I agree, but you can't say this for certain.

Weighting "breaking end user programs with a syntax error in a patch
version" vs. "a power-user function emits unanticipated values in a
field containing free-form data, but still follows its defined
specification of emitting the token stream as PHP sees it", the former
clearly has a much larger possible impact.

Best regards
Tim Düsterhus

1 year ago by Rowan Tommins [IMSoP] — view source

unread

The change has also been explicitly acked by Gina and Christoph before Ilija committed it and Bob also participated in the PR without raising concerns, so it's also not an unapproved change. The fact that none of them anticipated this side effect, doesn't make it unapproved.

I note that Gina commented:

This is kinda cursed... but sure why not.

And Christoph wrote:

Let's say that I'm generally not happy with complicating the parser with work-arounds, but this is okay for me.

So, hardly a confident consensus that this was the right approach. Regardless, we are where we are, and there's general agreement that the current implementation is not ideal, and that we can and should improve it, just not on how and in what version.

Oddly, the crux of this debate seems to be less that all options have major impact, and more that none of them do. If the implementation had caused a serious security or performance regression, I don't think we'd have any hesitation in backing it out if a better implementation wasn't ready; or contrarily, if it had been a headline new feature, we wouldn't even be considering that option.

As it is, an improved implementation is proposed: https://github.com/php/php-src/pull/15041 Comparing this purely against 8.2, it seems a reasonable compromise: consumers of the token stream still need to make a change, but it's the slightly more intuitive one of "treat yield and from as separate tokens, which might have other tokens between". Personally, I think it's reasonable to consider that a bug fix to the original change, and push it into 8.3.

Projects using the token stream could detect the versions with the imperfect implementation, and emit an error explaining the "bug" if a T_YIELD_FROM token doesn't match /yield\s+from/ Projects parsing the source code on their own will have to handle this however they handle contexts where comments have always been allowed.

The Ubuntu LTS situation is unfortunate, but maybe Ondřej Surý will have an opinion on what to do with the patch there.

Regards,
Rowan Tommins
[IMSoP]

1 year ago by Deleu — view source

unread

On Sun, Jul 21, 2024 at 12:14 PM Rowan Tommins [IMSoP] imsop.php@rwec.co.uk
wrote:

Oddly, the crux of this debate seems to be less that all options have
major impact, and more that none of them do.

My only disagreement with your statement (and Tim's messages) is that one
option MAY impact some imaginary PHP users while the other option IS
impacting community members that are key contributors to the success of the
language. I don't agree with the sentiment of "given the 2 options
available, we prefer the option where we know we are negatively impacting
community members consuming token_get_all() in favor of protecting
imaginary PHP users that would have been adventurous enough to make use of
comments between yield and from."

--
Marco Deleu

1 year ago by Rowan Tommins [IMSoP] — view source

unread

I don't agree with the sentiment of "given the 2 options
available, we prefer the option where we know we are negatively impacting
community members consuming token_get_all() in favor of protecting
imaginary PHP users that would have been adventurous enough to make use of
comments between yield and from."

I understand what you're saying, but I think there are more than two groups of people, and more than two options:

As well as users who have already made use of the feature, we are implicitly talking about users who will use it in future. If nobody ever wants to use it, then it makes no difference if we revert or not, and no difference if CodeSniffer can support it or not. If they do want to use it, and we revert, they suffer. If they want to use it, and use CodeSniffer, then they will suffer if CodeSniffer decides it cannot be sensibly handled. Bob's proposed alternative implementation might mean that these users get everything they want (if JRF and others say they can consume that token stream).

On the other hand, Bob's proposal affects a different set of users: those whose code contains "parse from" without extra comments, which will now emit two tokens instead of one. They will suffer if tools are not updated to handle this change.

Reverting now, and later implementing the feature a different way will impact multiple of these groups, just at different times.

Regards,
Rowan Tommins
[IMSoP]

1 year ago by Juliette Reinders Folmer — view source

unread

Oddly, the crux of this debate seems to be less that all options have
major impact, and more that none of them do. If the implementation had
caused a serious security or performance regression, I don't think
we'd have any hesitation in backing it out if a better implementation
wasn't ready; or contrarily, if it had been a headline new feature, we
wouldn't even be considering that option.

The crux - to me - is that it is an undocumented breaking change, which
by definition is a bug.

As I've said before, I'm not against changing the tokenization, what I'm
speaking up about is that it was done in an inconsistent, semi-random
and undocumented way.

As it is, an improved implementation is proposed:
https://github.com/php/php-src/pull/15041 Comparing this purely
against 8.2, it seems a reasonable compromise: consumers of the token
stream still need to make a change, but it's the slightly more
intuitive one of "treat yield and from as separate tokens, which might
have other tokens between". Personally, I think it's reasonable to
consider that a bug fix to the original change, and push it into 8.3.

Projects using the token stream could detect the versions with the
imperfect implementation, and emit an error explaining the "bug" if a
T_YIELD_FROM token doesn't match /yield\s+from/ Projects parsing the
source code on their own will have to handle this however they handle
contexts where comments have always been allowed.

I agree the improved tokenization proposed in the PR makes better sense.

Having said that, it will make the breaking change much much larger as
it now would no longer just be a change which can be handled in the
PHPCS token stream, but one which will impact individual sniffs, what
with the removal and introduction of new tokens.

If that PR gets merged, it will mean that PHPCS will need to undo the
PHP 8.3 and PHP 8.4 tokenization for the time being (in the 3.x major)
and will only "polyfill" the new tokenization as of PHPCS 4.0. In
practice, we would be moving the breaking change to PHPCS 4.0 version to
not break sniffs.

Smile,
Juliette

1 year ago by Rowan Tommins [IMSoP] — view source

unread

The crux - to me - is that it is an undocumented breaking change, which by definition is a bug.

There are two parts of this which are bugs, in my opinion:

That it wasn't documented, e.g. with a line in UPGRADING listing the affected tokens.
That the tokenisation consumes the comment as part of the token, rather than just performing a lookahead.

One is easily fixed; the other is more subtle, but maybe fixable.

As I've said before, I'm not against changing the tokenization, what I'm speaking up about is that it was done in an inconsistent, semi-random and undocumented way.

As others have said, there is nothing unusual in the process that was followed here. A minor change was proposed via Pull Request, discussed with multiple core contributors, and wasn't deemed significant enough for a wider discussion or RFC.

The documentation probably should have been caught during that review, because it's a common checklist item. The behaviour of the token stream could have been, but we got unlucky and nobody thought of it. There's no guarantee that a different process would have done better - there have been changes which went through a whole RFC process, then a year later someone points out a flaw that could have been avoided; that's life.

Now that we have spotted it, we need to decide what to do.

Regards,
Rowan Tommins
[IMSoP]

1 year ago by Christoph M. Becker — view source

unread

And not, like it is now, an undocumented, random change creating an
inconsistency in the Tokenizer.

The tokenizer is doing the right thing: It tokenizes the PHP source
code. It is absolutely normal that PHP first and second-digit updates
make changes to the token stream. New tokens are added, old tokens are
removed, tokens may appear in places where they previously could not
appear for well-formed PHP programs. Tools working on the token stream
need to adapt and this change is no different from any other change to
PHP's syntax in that regard (except that documenting the change was
forgotten).

If the tokenizer would tokenize a whole file as a single token, would
that also be correct? Of course, I'm exaggerating, but
https://3v4l.org/qIf2c doesn't look correct to me – "yield /* comment
*/ from" shouldn't be a single token.

Cheers,
Christoph

1 year ago by tim@bastelstu.be — view source

unread

If the tokenizer would tokenize a whole file as a single token, would
that also be correct? Of course, I'm exaggerating, but

The function is documented as:

token_get_all() parses the given code string into PHP language tokens using the Zend engine's lexical scanner.

So if Zend engine's lexical scanner would tokenize a complete PHP file
as a single token, then the output would be correct, yes.

https://3v4l.org/qIf2c doesn't look correct to me – "yield /* comment
*/ from" shouldn't be a single token.

Looking at this from a user expectation perspective and not a technical
perspective, neither should yield from be a single token in:

 var_dump(token_get_all('&lt;?php yield from $foo;'));

Best regards
Tim Düsterhus

1 year ago by Christoph M. Becker — view source

unread

If the tokenizer would tokenize a whole file as a single token, would
that also be correct? Of course, I'm exaggerating, but

The function is documented as:

token_get_all() parses the given code string into PHP language tokens
using the Zend engine's lexical scanner.

So if Zend engine's lexical scanner would tokenize a complete PHP file
as a single token, then the output would be correct, yes.

Apparently we cannot even agree that there is a bug, and as such we
cannot fix it in PHP 8.3 according to the release process documentation.

https://3v4l.org/qIf2c doesn't look correct to me – "yield /* comment
*/ from" shouldn't be a single token.

Looking at this from a user expectation perspective and not a technical
perspective, neither should yield from be a single token in:

var_dump(token_get_all('<?php yield from $foo;'));

ACK

Cheers,
Christoph

1 year ago by Larry Garfield — view source

unread

Hi

Tim, you're making my point for me. This is exactly why the current
change should be reverted.

I am not sure how you read "PHP users have no idea what a token is" as
an argument in favor of reverting the change, because reverting the
change means that completely reasonable code suddenly stops working with
a parser error in a patch version and PHP users will rightfully come to
PHP's issue tracker to complain.

And not, like it is now, an undocumented, random change creating an
inconsistency in the Tokenizer.

The tokenizer is doing the right thing: It tokenizes the PHP source
code. It is absolutely normal that PHP first and second-digit updates
make changes to the token stream. New tokens are added, old tokens are
removed, tokens may appear in places where they previously could not
appear for well-formed PHP programs. Tools working on the token stream
need to adapt and this change is no different from any other change to
PHP's syntax in that regard (except that documenting the change was
forgotten).

Best regards
Tim Düsterhus

Yes, any syntax change means tools need to adapt, but that doesn't mean tokenization can change randomly and accidentally. Syntax changes require an RFC. If an RFC passes that necessitates SA tools update, so be it. (That happens almost every version.) But that's still an RFC change.

I would agree with reverting this change for now. The odds of it breaking something for someone are vanishingly small, and it's a bug, not a feature. If we want it to be a feature, we can make an RFC for it.

--Larry Garfield

1 year ago by tim@bastelstu.be — view source

unread

Yes, any syntax change means tools need to adapt, but that doesn't mean tokenization can change randomly and accidentally. Syntax changes require an RFC. If an RFC passes that necessitates SA tools update, so be it. (That happens almost every version.) But that's still an RFC change.

I would agree with reverting this change for now. The odds of it breaking something for someone are vanishingly small, and it's a bug, not a feature. If we want it to be a feature, we can make an RFC for it.

It has become a feature no later than November, 23rd, 2023, when PHP
8.3.0 has been released.

There is no bug, token_get_all() still is working within its specification.

We can't just break backwards compatibility with well-formed PHP scripts
when it causes an inconvenience for tool authors and at the same time
bend over backwards with an impact analysis to find out if the
deprecation (not removal!) of a property that literally always is null
would be too much of a backwards compatibility impact.

The PHP users writing PHP scripts accepted by PHP 8.3 and visually
fitting into PHP’s syntax with regard to comment placement are not at
fault here and shouldn't bear the consequences.

Best regards
Tim Düsterhus

1 year ago by Ilija Tovilo — view source

unread

Hi Juliette

On Sat, Jul 20, 2024 at 6:40 PM Juliette Reinders Folmer
php-internals_nospam@adviesenzo.nl wrote:

As I've said: I agree that the current situation is unfortunate. But the correct solution is not "disallow comments", but "split the T_YIELD_FROM into T_YIELD T_WHITESPACE T_FROM_FOR_YIELD_FROM".

Tim, you're making my point for me. This is exactly why the current change should be reverted.

I'm not against changing the tokenization of "yield from" and the GH ticket thread also contains an alternative proposal for this from Bob [1], but like Matthew also said [2]: if that's something we want to do, let's have a proper discussion about it and let it go through an RFC and be a documented change.

And not, like it is now, an undocumented, random change creating an inconsistency in the Tokenizer.

Thank you for raising this issue. Let me share my viewpoint as the
person who has made the change.

First of all, while allowing comments between yield and from was
obviously intentional, the side-effect of embedding the comment in the
yield_from token was something I did not sufficiently consider. That's
my bad.

I agree that it's not very likely that many people started placing
comments between yield and from, but certainly not impossible. The
main problem with reverting this change is that such code will
completely stop working. Patch updates are routinely installed on
servers without additional testing, so it's very much possible that we
would brick peoples production environments.

To compare the impact on PHP_CodeSniffer: From what I understand
(please correct me if I'm wrong), there are really only two possible
scenarios when encountering comments between yield and from:

PHP_CodeSniffer doesn't see the comment and acts as if it weren't
there, and potentially removes it when fixing errors automatically.
PHP_CodeSniffer thinks this is a syntax error and thus reports it.

While that's not optimal, it is limited to a subset of the already
small set of people using this new feature, namely the ones also using
PHP_CodeSniffer. It is also limited to development environments, where
PHP_CodeSniffer is executed.

What I'd suggest instead is fixing this for master in a way that makes
it simple for PHP_CodeSniffer to add support, or to just revert it
there. If it is indeed true that nobody uses this feature after 8
months, then waiting another 4 for a proper fix doesn't seem like an
issue.

In the likely case of not finding consensus on this issue: Setting up
a vote doesn't seem straight-forward to me either. Would we consider
8.3 the baseline, with reverting the change requiring the 2/3
majority, or do we consider 8.2 the baseline?

Ilija

1 year ago by Juliette Reinders Folmer — view source

unread

Hi Ilja,

I agree that it's not very likely that many people started placing
comments between yield and from, but certainly not impossible. The
main problem with reverting this change is that such code will
completely stop working. Patch updates are routinely installed on
servers without additional testing, so it's very much possible that we
would brick peoples production environments.
That can be said about any bug fix though. The space bar heating
anecdote comes to mind.

To compare the impact on PHP_CodeSniffer: From what I understand
(please correct me if I'm wrong), there are really only two possible
scenarios when encountering comments between yield and from:

PHP_CodeSniffer doesn't see the comment and acts as if it weren't
there, and potentially removes it when fixing errors automatically.

PHP_CodeSniffer thinks this is a syntax error and thus reports it.

Actually, the impact is a bit larger than that (false
positives/negatives for indentation checks, annotations being ignored,
comments not being checked etc), but that's a different discussion.

The more problematic impact is that this undocumented change in
tokenization can currently cause the autofixer of PHPCS to create
parse errors in code.

While that's not optimal, it is limited to a subset of the already
small set of people using this new feature, namely the ones also using
PHP_CodeSniffer. It is also limited to development environments, where
PHP_CodeSniffer is executed.

While I understand the interest in minimizing the impact, there are
millions of people using PHP_CodeSniffer every day.

In the likely case of not finding consensus on this issue: Setting up
a vote doesn't seem straight-forward to me either. Would we consider
8.3 the baseline, with reverting the change requiring the 2/3
majority, or do we consider 8.2 the baseline?

For a potential vote, I would think more along the lines of consistency
rules, i.e. "what is allowed in a token" (whitespace, new lines,
comments) ? And then, depending on the proposal, what tokens would be
impacted if this is changed (namespaced names, cast tokens,
heredoc/nowdoc openers, yield from) and how will this impact these tokens ?

Smile,
Juliette

1 year ago by deleugyn@gmail.com — view source

unread

Hi all,

I recently discovered a change was made to the Tokenizer in PHP 8.3.0 which now allows for a comment to exist between the yield and from keyword from the yield from keyword.
Before PHP 8.3, this was a parse error.

This change was not documented (anywhere) nor publicized, which is why I only recently came across it and opened a bug report on GitHub about it [1].

This change is a breaking change for projects consuming the token stream, but more than that, it introduces an inconsistency in the Tokenizer as there is no other single token in PHP which can contain a comment (aside from comments tokens of course).
Comments were even explicitly forbidden in the PHP 8.0 RFC which changed the tokenization of namespaced names [2].
Some more detailed explanation and examples can be found in the GH thread [3] and a detailed analysis of the tokenization change can be found in a ticket in the PHP_CodeSniffer repo [4] (details in a fold out in the comment).

In my opinion the change is a bug and should be reverted, but opinions on this are divided.
After a discussion in the ticket on GitHub, it was suggested to ask for a third/fourth/fifth opinion here on the list.

Pros for reverting it:

Consistency with other tokens in the token stream, none of which allow comments within the token.

Consistency with the 7 years before in which this was a parse error.

The change was never documented, not in the changelog, news, nor in the upgrading guide.

Cons against reverting it:

The change has been in PHP 8.3 releases for a little over seven months.

The odd few people may have accidentally discovered the bug and taken advantage by introducing code containing yield /*comment*/ from into their project, which with a revert would become a parse error again.

Opinions ?

Of course, the ticket yielded discussion on whether the tokenization of yield from should be changed anyhow, but please consider that discussion out of scope.
The current question is only about reverting the bug versus regarding it as a new feature and properly documenting it.

Smile,
Juliette

1: https://github.com/php/php-src/issues/14926
2: https://wiki.php.net/rfc/namespaced_names_as_token#backward_incompatible_changes
3: https://github.com/PHPCSStandards/PHP_CodeSniffer/issues/529#issuecomment-2209337419
4: https://github.com/php/php-src/issues/14926#issuecomment-2227152061

My opinion would be to revert it. Chances of someone accidentally discovering and actually using it seems unlikely at best. Even then, the parser error could be easily resolved by moving the comment out of the middle of the token with insignificant readability change. Forcing all tooling that uses token_get_all() to handle this unintentional change seems to generate more unnecessary and real busywork for something only theoretical possible to break.

1 year ago by tim@bastelstu.be — view source

unread

Forcing all tooling that uses token_get_all() to handle this unintentional change seems to generate more unnecessary and real busywork for something only theoretical possible to break.

The tools are required to handle this either way, because there are
released version with this specific tokenization and they are not going
away.

Ubuntu 24.04 LTS ships with PHP 8.3.6 and generally Ubuntu backports
security fixes instead of upgrading to newer patch versions. As an
example, Ubuntu 22.04 LTS ships with PHP 8.1.2 + security fixes, not
with 8.1.29 (which is the newest 8.1.x as of now).

Thus the ship has effectively sailed due to the inclusion in Ubuntu
24.04 LTS as the arguably most widely used Linux distro.

Best regards
Tim Düsterhus

1 year ago by deleugyn@gmail.com — view source

unread

I can't speak for Juliette's plans but I would advocate for pinning PHP 8.3.{fixed} on Composer and having an error message for PHAR. It's also worth mentioning that just because it's possible to add comments in those versions doesn't mean we must assume that it wi be done and it must be supported. If a Ubuntu LTS user gets an error, pointing out that they're relying on an accidental behavior only present in 9 patch releases of the entire PHP lifecycle seems good enough.

Hi

Forcing all tooling that uses token_get_all() to handle this unintentional change seems to generate more unnecessary and real busywork for something only theoretical possible to break.

The tools are required to handle this either way, because there are released version with this specific tokenization and they are not going away.

Ubuntu 24.04 LTS ships with PHP 8.3.6 and generally Ubuntu backports security fixes instead of upgrading to newer patch versions. As an example, Ubuntu 22.04 LTS ships with PHP 8.1.2 + security fixes, not with 8.1.29 (which is the newest 8.1.x as of now).

Thus the ship has effectively sailed due to the inclusion in Ubuntu 24.04 LTS as the arguably most widely used Linux distro.

Best regards
Tim Düsterhus

1 year ago by Christoph M. Becker — view source

unread

Hi Tim!

Forcing all tooling that uses token_get_all() to handle this
unintentional change seems to generate more unnecessary and real
busywork for something only theoretical possible to break.

The tools are required to handle this either way, because there are
released version with this specific tokenization and they are not going
away.

Well, these tools could reject such code, and tell users to update to a
version where this is no longer valid syntax.

Ubuntu 24.04 LTS ships with PHP 8.3.6 and generally Ubuntu backports
security fixes instead of upgrading to newer patch versions. As an
example, Ubuntu 22.04 LTS ships with PHP 8.1.2 + security fixes, not
with 8.1.29 (which is the newest 8.1.x as of now).

Thus the ship has effectively sailed due to the inclusion in Ubuntu
24.04 LTS as the arguably most widely used Linux distro.

I have to agree that this is a strong argument, but I don't think we're
absolutely obliged to stick with what have right now. We still can
claim that we've made an unintended behavioral change, aka. introduced a
bug, and fix it – downstream projects have to deal with this (the same
as we may have to deal with unwelcome upstream fixes).

And frankly, how much code would be affected? I mean, does anybody
actually put a comment between yield and from? Is there a case
where this may make sense? "Because we can" isn't a strong argument, in
my opinion.

Cheers,
Christoph

1 year ago by tim@bastelstu.be — view source

unread

And frankly, how much code would be affected? I mean, does anybody
actually put a comment between yield and from? Is there a case
where this may make sense? "Because we can" isn't a strong argument, in
my opinion.

I don't really follow this line of argumentation:

If folks do not use the syntax anyways, then we do not need to have this
discussion, because the tools can just ignore it existing. That also
means we do not need to revert the change in PHP.

If folks use the syntax, then reverting the change is a breaking change
for them.

So either the revert is not doing anything at all, or the revert is
actively harmful. I do not see a situation where reverting the change is
a value-add.

Best regards
Tim Düsterhus

1 year ago by deleugyn@gmail.com — view source

unread

Hi

And frankly, how much code would be affected? I mean, does anybody
actually put a comment between yield and from? Is there a case
where this may make sense? "Because we can" isn't a strong argument, in
my opinion.

I don't really follow this line of argumentation:

If folks do not use the syntax anyways, then we do not need to have this discussion, because the tools can just ignore it existing. That also means we do not need to revert the change in PHP.

If folks use the syntax, then reverting the change is a breaking change for them.

So either the revert is not doing anything at all, or the revert is actively harmful. I do not see a situation where reverting the change is a value-add.

Best regards
Tim Düsterhus

The value add of the revert is because time is a moving target. We don't think anyone is using it yet, given the circumstances that made this happen. Wait long enough and the only guarantee we have is entropy.

1 year ago by Rob Landers — view source

unread

Hi

And frankly, how much code would be affected? I mean, does anybody
actually put a comment between yield and from? Is there a case
where this may make sense? "Because we can" isn't a strong argument, in
my opinion.

I don't really follow this line of argumentation:

If folks do not use the syntax anyways, then we do not need to have this discussion, because the tools can just ignore it existing. That also means we do not need to revert the change in PHP.

If folks use the syntax, then reverting the change is a breaking change for them.

So either the revert is not doing anything at all, or the revert is actively harmful. I do not see a situation where reverting the change is a value-add.

Best regards
Tim Düsterhus

The value add of the revert is because time is a moving target. We don't think anyone is using it yet, given the circumstances that made this happen. Wait long enough and the only guarantee we have is entropy.

Even if people are using it, if fixing it is better than leaving it… Oh well.

A perfect example is the GMP class being left not “final” that allows for some really nice semantics. It’s currently being voted on and it appears an unanimous vote to make it final will pass.

The language can and will change. Sometimes in ways we don’t like.

— Rob

1 year ago by Matthew Weier O'Phinney — view source

unread

Hi

And frankly, how much code would be affected? I mean, does anybody
actually put a comment between yield and from? Is there a case
where this may make sense? "Because we can" isn't a strong argument, in
my opinion.

I don't really follow this line of argumentation:

If folks do not use the syntax anyways, then we do not need to have this
discussion, because the tools can just ignore it existing. That also
means we do not need to revert the change in PHP.

If folks use the syntax, then reverting the change is a breaking change
for them.

By this line of reasoning, any bug that makes it into the engine that
somebody finds a use for shouldn't be fixed.

I totally understand wanting to keep this in, but because it made it in
without an RFC, it's (a) inconsistent with other areas of the language, and
(b) does not have a full design which includes tokenizer support.

Considering the brief time it's been in the engine, and the fact that it's
not documented, I feel it should be reverted and only re-added if someone
takes the time and effort to propose the change properly.