Escape \0 in var_dump() output

4 years ago by Nikita Popov — view source

unread

Hi internals,

A regular annoyance with our test suite is that var_dump() prints null
bytes literally, those null bytes get into our test expectations, the test
is interpreted as a binary file by git, and diffs will not be shown on
GitHub.

Of course, the null bytes are also confusing to the reader, as many editors
will not display them, and the only indication that they are there is the
string length.

https://github.com/php/php-src/pull/7059 does a surgical change to replace
null bytes with \0.

A potential problem is that this no longer allows an easy distinction
between "\0" and "\0", as both will be printed as "\0". We could resolve
this by additionally escaping \ (and presumably "). See
https://gist.github.com/nikic/d3d74d88178ea6946305e0b2d38a84f9 for an rough
idea on the impact on Zend and ext/standard tests.

So, I'd be interested in knowing
a) whether var_dump() should be performing any escaping at all, and if yes
b) how much we should be escaping. Just \0, both \0 and \ to avoid
ambiguity, or a larger set of control characters?

Regards,
Nikita

4 years ago by Harm Smits — view source

unread

Hey,

whether var_dump() should be performing any escaping at all, and if yes

Well, since var_dump is essentially for debugging purposes, it makes sense
to escape certain characters. Like you mentioned, some IDEs do not
print/support
null bytes. Hence, I am all for it, as it simplifies debugging in that
scenario (which
is the main objective).

how much we should be escaping. Just \0, both \0 and \ to avoid ambiguity,
or a larger set of control characters?

Probably a larger set of control characters as well. As mentioned before,
var_dump
is for debugging, if you can not get full debugging information, it will
only lead to
headaches and frustrations.

Regardless, I do believe we should provide some kind of 'setting' for this
in
php.ini (like precision for example) just so the user can decide what they
want to do.

4 years ago by Sara Golemon — view source

unread

https://github.com/php/php-src/pull/7059 does a surgical change to replace
null bytes with \0.

Before delving into any other observations, I first want to state
emphatically that we should not ONLY escape chr(0). We either apply full
on addcslashes() (or similar) or do nothing at all. Half-escaped is worse
than not escaped. Sadly, this presents more BC issues than just chr(0) or
'\0' do alone since there are many more potential places where output will
change for anyone using it programmatically.

So should we?

On the one hand: The intent of the var_dump() function is for human
readable debugging. Anyone using this in a programmatic way is Doing It
Wrong™, so I'm not too fussed about changing the output of this function.
On the other hand: WE use this function programmatically in our tests
across many thousands of occurrences. Others probably rightly do as well.
Yes, the dissonance is real.

The most prudent and BC-safe thing would be to add another function
var_dump_escaped() or even to prefer/advise json_encode() when content
safety is relevant, but additional type information is not (most of the
uses of var_dump we have). Unfortunately, this doesn't actually fix the
initial problem you stated, which is just getting useful data out of CI
failures.

The PR you offered is the lightest touch and fixes the issue you cited
without causing any likely damage to current users, but I don't think we
should ignore the ambiguity of the output. Additionally, I think I could
make an easy argument in favor of escaping CR and NL at the very least. At
this point the urge to escape backslash is extra-real as with windows paths
an innocent \n is far more likely.

Actually. Maybe we're thinking of this wrong.

Rather than change the output at all, why not just have a post-process step
in our test runner that transforms the output in the test report? Then we
could be as aggressive as we want, going as far as escaping all
non-printables plus backslash since at that point we're in it for the human
readability (and knowing specific byte sequences is essential) and we break
zero BC for anyone else?

-Sara

4 years ago by Nikita Popov — view source

unread

https://github.com/php/php-src/pull/7059 does a surgical change to
replace
null bytes with \0.

Before delving into any other observations, I first want to state
emphatically that we should not ONLY escape chr(0). We either apply full
on addcslashes() (or similar) or do nothing at all. Half-escaped is worse
than not escaped. Sadly, this presents more BC issues than just chr(0) or
'\0' do alone since there are many more potential places where output will
change for anyone using it programmatically.

So should we?

On the one hand: The intent of the var_dump() function is for human
readable debugging. Anyone using this in a programmatic way is Doing It
Wrong™, so I'm not too fussed about changing the output of this function.
On the other hand: WE use this function programmatically in our tests
across many thousands of occurrences. Others probably rightly do as well.
Yes, the dissonance is real.

FWIW, updating our own tests is easy, because the bulk of the change is
automated, just need to fix up individual tests afterwards. It may be an
inconvenience for extension tests though, which require compatibility with
multiple PHP versions.

The most prudent and BC-safe thing would be to add another function
var_dump_escaped() or even to prefer/advise json_encode() when content
safety is relevant, but additional type information is not (most of the
uses of var_dump we have). Unfortunately, this doesn't actually fix the
initial problem you stated, which is just getting useful data out of CI
failures.

Well, we had https://wiki.php.net/rfc/readable_var_representation a few
weeks ago, which is basically a better var_dump() with escaping and all --
but you can tell how it went ;)

The PR you offered is the lightest touch and fixes the issue you cited
without causing any likely damage to current users, but I don't think we
should ignore the ambiguity of the output. Additionally, I think I could
make an easy argument in favor of escaping CR and NL at the very least. At
this point the urge to escape backslash is extra-real as with windows paths
an innocent \n is far more likely.

Actually. Maybe we're thinking of this wrong.

Rather than change the output at all, why not just have a post-process
step in our test runner that transforms the output in the test report?
Then we could be as aggressive as we want, going as far as escaping all
non-printables plus backslash since at that point we're in it for the human
readability (and knowing specific byte sequences is essential) and we break
zero BC for anyone else?

Transforming the test output (such that the transformed output is checked
in EXPECT) would indeed be a way to address the original motivation.
There's two caveats though:

It does not address the issue for other uses of var_dump(). People being
confused by var_dump((array) $object) output because of the null bytes in
the property names is practically a PHP classic. Of course, fixing this
would take away from the authentic PHP experience :)
Changes to run-tests.php behavior also affect extensions, and I think
this kind of change would be even harder to address for PECL extensions
than a simple var_dump() behavior change (where you can at least easily
fork two versions of the file for the tests that need it).

Regards,
Nikita

4 years ago by Sara Golemon — view source

unread

The most prudent and BC-safe thing would be to add another function

var_dump_escaped() or even to prefer/advise json_encode() when content
safety is relevant, but additional type information is not (most of the
uses of var_dump we have). Unfortunately, this doesn't actually fix the
initial problem you stated, which is just getting useful data out of CI
failures.

Well, we had https://wiki.php.net/rfc/readable_var_representation a few
weeks ago, which is basically a better var_dump() with escaping and all --
but you can tell how it went ;)

I don't specifically remember that thread, but suddenly I feel like I can
guess everything I need to know about the outcome. :(

Rather than change the output at all, why not just have a post-process step

in our test runner that transforms the output in the test report? Then we
could be as aggressive as we want, going as far as escaping all
non-printables plus backslash since at that point we're in it for the human
readability (and knowing specific byte sequences is essential) and we break
zero BC for anyone else?

Transforming the test output (such that the transformed output is checked
in EXPECT) would indeed be a way to address the original motivation.
There's two caveats though:

It does not address the issue for other uses of var_dump(). People
being confused by var_dump((array) $object) output because of the null
bytes in the property names is practically a PHP classic. Of course, fixing
this would take away from the authentic PHP experience :)

True, but neither (IMO) would just escaping null byte, which again I regard
as a half-measure. It has all the negative impact of breaking BC (your
point about extension phpts is a good one), without going far enough to
address the human-readability problem for many common cases (CR/NL chief
among them).

Insert Karate Kid "squash like a bug" reference here. Escape it all, or
escape none of it.

Changes to run-tests.php behavior also affect extensions, and I think
this kind of change would be even harder to address for PECL extensions
than a simple var_dump() behavior change (where you can at least easily
fork two versions of the file for the tests that need it).

Would it? The report at the end of the run is for human eyes, generally
speaking. A human can interpret escaping (or not) as needed.
Also, we could make the post-processing optional.

run-test.php --escape-test-output

Then we use that mode in CI where we need git-friendly diffs, and it works
"normally" elsewhere.

-Sara

4 years ago by Nikita Popov — view source

unread

The most prudent and BC-safe thing would be to add another function

var_dump_escaped() or even to prefer/advise json_encode() when content
safety is relevant, but additional type information is not (most of the
uses of var_dump we have). Unfortunately, this doesn't actually fix the
initial problem you stated, which is just getting useful data out of CI
failures.

Well, we had https://wiki.php.net/rfc/readable_var_representation a few
weeks ago, which is basically a better var_dump() with escaping and all --
but you can tell how it went ;)

I don't specifically remember that thread, but suddenly I feel like I can
guess everything I need to know about the outcome. :(

Rather than change the output at all, why not just have a post-process

step in our test runner that transforms the output in the test report?
Then we could be as aggressive as we want, going as far as escaping all
non-printables plus backslash since at that point we're in it for the human
readability (and knowing specific byte sequences is essential) and we break
zero BC for anyone else?

Transforming the test output (such that the transformed output is checked
in EXPECT) would indeed be a way to address the original motivation.
There's two caveats though:

It does not address the issue for other uses of var_dump(). People
being confused by var_dump((array) $object) output because of the null
bytes in the property names is practically a PHP classic. Of course, fixing
this would take away from the authentic PHP experience :)

True, but neither (IMO) would just escaping null byte, which again I
regard as a half-measure. It has all the negative impact of breaking BC
(your point about extension phpts is a good one), without going far enough
to address the human-readability problem for many common cases (CR/NL chief
among them).

Insert Karate Kid "squash like a bug" reference here. Escape it all, or
escape none of it.

Changes to run-tests.php behavior also affect extensions, and I think
this kind of change would be even harder to address for PECL extensions
than a simple var_dump() behavior change (where you can at least easily
fork two versions of the file for the tests that need it).

Would it? The report at the end of the run is for human eyes, generally
speaking. A human can interpret escaping (or not) as needed.
Also, we could make the post-processing optional.

run-test.php --escape-test-output

Then we use that mode in CI where we need git-friendly diffs, and it works
"normally" elsewhere.

Ah, I think I explained the original issue badly: The test runner output
isn't really a problem, or at least I never perceived it to be.

The problem is that if you change a test that contains a null byte anywhere
(read: in var_dump output), then github will not show a diff for that file.
You can see this nicely in the proposed PR itself:
https://github.com/php/php-src/pull/7059/files Most of the updated test
files show up as "Binary file not shown." As you can imagine, this makes
review hard. Using the .patch file doesn't help either, because it only
contains entries like:

diff --git a/Zend/tests/bug60569.phpt b/Zend/tests/bug60569.phpt
index
480c9c8f8a1285956616a119825520471bbe88b5..c658fa28a00017412a3a840498605efa7f8fe10b
100644
GIT binary patch
delta 25
gcmZ3<w3=zcAEp?CiGLM&Vhj{=Q;Ule(^HkW0DWKx>i_@%

delta 23
ecmZ3@w32DUA4Z0W{}j0y6mnCGixbmRmAC+9#0Sy<

To avoid this issue, we need the transformation to apply the actual
EXPECTed output, otherwise our test files will still be considered as
binary files.

Regards,
Nikita

4 years ago by Sara Golemon — view source

unread

Ah, I think I explained the original issue badly: The test runner output
isn't really a problem, or at least I never perceived it to be.

The problem is that if you change a test that contains a null byte
anywhere (read: in var_dump output), then github will not show a diff for
that file. You can see this nicely in the proposed PR itself:
https://github.com/php/php-src/pull/7059/files Most of the updated test
files show up as "Binary file not shown." As you can imagine, this makes
review hard. Using the .patch file doesn't help either, because it only
contains entries like:

AH! Yeah, I completely misunderstood the problem space. I didn't imagine
for a second we had non-ASCII in any test because that's bad and it should
feel bad.

Okay, then I still disagree with changing var_dump()'s output, but for
different reasons.

So take Zend/tests/bug60569.phpt for example, which has this EXPECT section
(where ^@ is actually the null byte):

--EXPECT--
string(20) "Some error ^@ message"

Instead of changing how var_dump() works in order to produce different
output, I'd change the EXPECT to and EXPECTF and add a new %0 pattern

--EXPECTF--
string(20) "Some error %0 message"

This creates zero BC breaks, allows extensions to continue working just
fine with raw nulls or update to use %0 if they'd like, and creates no
ambiguities.

-Sara

4 years ago by Nikita Popov — view source

unread

Ah, I think I explained the original issue badly: The test runner output
isn't really a problem, or at least I never perceived it to be.

The problem is that if you change a test that contains a null byte
anywhere (read: in var_dump output), then github will not show a diff for
that file. You can see this nicely in the proposed PR itself:
https://github.com/php/php-src/pull/7059/files Most of the updated test
files show up as "Binary file not shown." As you can imagine, this makes
review hard. Using the .patch file doesn't help either, because it only
contains entries like:

AH! Yeah, I completely misunderstood the problem space. I didn't imagine
for a second we had non-ASCII in any test because that's bad and it should
feel bad.

Okay, then I still disagree with changing var_dump()'s output, but for
different reasons.

So take Zend/tests/bug60569.phpt for example, which has this EXPECT
section (where ^@ is actually the null byte):

--EXPECT--
string(20) "Some error ^@ message"

Instead of changing how var_dump() works in order to produce different
output, I'd change the EXPECT to and EXPECTF and add a new %0 pattern

--EXPECTF--
string(20) "Some error %0 message"

This creates zero BC breaks, allows extensions to continue working just
fine with raw nulls or update to use %0 if they'd like, and creates no
ambiguities.

I like that approach. Implemented in
https://github.com/php/php-src/pull/7069.

Regards,
Nikita