PHP's handling of BOM (byte order mark)

9 years ago by Sammy Kaye Powers — view source

unread

Hey internals!

In a recent discussion on PHP Roundtable, we talked about the byte
order mark in php files. If you create a php file with the following:

<?php
header("X-foo: Bar");
echo "Foo!".PHP_EOL;

And save it as UTF-8 with BOM, interesting things happen depending on
the SAPI & configuration.

If you run it from the CLI you get an error:

PHP Warning: Cannot modify header information - headers already sent by (output started at %s:1) in %s on line %d

But it doesn't seem to return the BOM to std out (but I could be doing
this part wrong). If you run it from php -S, and load it in a
browser, the web server returns a code point \u{feff} as the first
code point of the response body.

BOM's should not be treated as characters and should not be sent to
the output. Is there any reason this should be considered the expected
behavior? If not, I'd like to create an RFC to change it. :)

Thanks,
Sammy Kaye Powers
sammyk.me

9 years ago by Stanislav Malyshev — view source

unread

Hi!

BOM's should not be treated as characters and should not be sent to
the output. Is there any reason this should be considered the expected
behavior?

The reason would be PHP does not know where surrounding output ends and
the code starts, beyond <?php. That means if there is some stuff in the
file before <?php, it would be output - and it's an intended behavior,
and so will happen with BOM too. Particular sequence of bytes being BOM
and whether it is desired or not depends on context, but PHP engine does
not have this context. Remember that pure HTML page is also a valid PHP
file.

Stas Malyshev
smalyshev@gmail.com

9 years ago by Sara Golemon — view source

unread

BOM's should not be treated as characters and should not be sent to
the output. Is there any reason this should be considered the expected
behavior?

The reason would be PHP does not know where surrounding output ends and
the code starts, beyond <?php. That means if there is some stuff in the
file before <?php, it would be output - and it's an intended behavior,
and so will happen with BOM too. Particular sequence of bytes being BOM
and whether it is desired or not depends on context, but PHP engine does
not have this context. Remember that pure HTML page is also a valid PHP
file.

I'm with Sammy on the principle that being able to have a BOM in a
given file is important to any non-ascii code development. Though we
can argue whether that's good or even necessary, I honestly don't know
how prevalent non-english coding is among PHP developers.

In fact, the idea of stripping content from a script file isn't
without precedent. Shebang lines are routinely removed from
cli/cgi/fpm, and if you want to properly output it, you need to do so
in a coded echo statement. (The stripping only applies to a literal,
non-scripting line in the file, not dynamic output).

So can we apply the same to the BOM? There's the obvious BC danger of
files which might depend on this behavior (declaring their encoding
via BOM, which happens to be the same as the script encoding).

So how about declare statement?

{U+FEFF}<?php
declare(strip_bom=true);

code(); code(); code();

It's got the advantage of being per-file (a view template might
actually want the BOM included, while some business logic piece
doesn't, for example. It's a compile-time strip, so it has no runtime
cost. It's non-surprising, since it's stated in every file for which
the BOM strip is intentional.

-Sara

9 years ago by Stanislav Malyshev — view source

unread

Hi!

In fact, the idea of stripping content from a script file isn't
without precedent. Shebang lines are routinely removed from
cli/cgi/fpm, and if you want to properly output it, you need to do so

True, because in the context of CLI we know what is expected - a CLI
script which can start with #!. It is very unlikely that we'd have a
template run directly as CLI script and we would have this template
starting with #! which we want to output. But we lack such context in a
generic script - namely, the context that would tell us if it's safe to
drop the BOM.

So can we apply the same to the BOM? There's the obvious BC danger of
files which might depend on this behavior (declaring their encoding
via BOM, which happens to be the same as the script encoding).

Given that BOM in script files is mostly useless, and BOM in UTF-8 is
useless and not recommended for use either, I don't see why we need to.

In general, I don't think BOM is a real issue worth messing with the
lexer. Surely, from time to time somebody would use weird editor which
produces BOMs, like editing PHP scripts in Word. Surely, they'd have
weird effects that would force them to spend 5 minutes googling and
fixing it. I don't think it is the reason to spend day-persons of our
collective time to find a fix to this very niche problem and risk
potential BC issues.

If it is really becoming an issue, we could probably make the lexer
treat BOM+<? the same as <?, but I'm not convinced it is a serious
enough issue.

So how about declare statement?

{U+FEFF}<?php
declare(strip_bom=true);

That presumes you know there's BOM in the beginning of your file. If so,
why don't you just delete it instead of typing a long declare directive?
If you don't know it, you'd be forced to add it to every (non-template)
file in your codebase - which sounds a bit excessive.

--
Stas Malyshev
smalyshev@gmail.com

9 years ago by Sara Golemon — view source

unread

In fact, the idea of stripping content from a script file isn't
without precedent. Shebang lines are routinely removed from
cli/cgi/fpm, and if you want to properly output it, you need to do so

True, because in the context of CLI we know what is expected - a CLI
script which can start with #!. It is very unlikely that we'd have a
template run directly as CLI script and we would have this template
starting with #! which we want to output. But we lack such context in a
generic script - namely, the context that would tell us if it's safe to
drop the BOM.

That was the idea of the declare(), to provide that context, since it
can't be reliably inferred.

So can we apply the same to the BOM? There's the obvious BC danger of
files which might depend on this behavior (declaring their encoding
via BOM, which happens to be the same as the script encoding).

Given that BOM in script files is mostly useless, and BOM in UTF-8 is
useless and not recommended for use either, I don't see why we need to.

In general, I don't think BOM is a real issue worth messing with the
lexer. Surely, from time to time somebody would use weird editor which
produces BOMs, like editing PHP scripts in Word. Surely, they'd have
weird effects that would force them to spend 5 minutes googling and
fixing it. I don't think it is the reason to spend day-persons of our
collective time to find a fix to this very niche problem and risk
potential BC issues.

Agreed it's niche, and agreed that it's mostly the editor's fault for
putting the BOM in place to begin with. Disagree on the value of the
time that would be needed to provide some sort of benefit.

I will say though, that you're almost certainly right that it's not a
significant problem (if it's one at all), and I'd want to hear from
people who encounter this on a regular basis for which there isn't a
much simpler fix available (such as disabling BOM emission in their
editor of choice).

If it is really becoming an issue, we could probably make the lexer
treat BOM+<? the same as <?, but I'm not convinced it is a serious
enough issue.

That's probably a reasonable compromise on the context issue. It
provides a clean escape hatch for intentional BOMs by echoing those
bytes from script, even if it is magic behavior which is generally to
be avoided.

That presumes you know there's BOM in the beginning of your file. If so,
why don't you just delete it instead of typing a long declare directive?

Dunno. I just like to argue.

-Sara

9 years ago by Andreas Heigl — view source

unread

Hi All.

As the BOM is only relevant on UTF-16 and UTF-32 encoded files and
UTF-8-encoded files are strongly discouraged from having one[1] - (Use
of a BOM is neither required nor recommended for UTF-8) there are two
questions that arise IMO.

Does PHP support Files encoded in UTF16 or UTF-32? If so, we need to
handle the BOM somehow. If not, is that a requirement?
Wouldn't it be an easier approach to have a userland-lib that scans
files for a BOM and raises a warning? Like have an add-on to
php-cs-fixer or something like that? Especially the UTF-8 BOM
(\xEF\xBB\xBF) right at the start of a file would be easily to spot.

Just my 0.02€

Cheers

Andreas

[1] www.unicode.org/versions/Unicode5.0.0/ch02.pdf#page=30, Am 31.05.16
um 05:52 schrieb Sara Golemon

--
,,,
(o o)
+---------------------------------------------------------ooO-(_)-Ooo-+
| Andreas Heigl |
| mailto:andreas@heigl.org N 50°22'59.5" E 08°23'58" |
| http://andreas.heigl.org http://hei.gl/wiFKy7 |
+---------------------------------------------------------------------+
| http://hei.gl/root-ca |
+---------------------------------------------------------------------+

9 years ago by Derick Rethans — view source

unread

If it is really becoming an issue, we could probably make the lexer
treat BOM+<? the same as <?, but I'm not convinced it is a serious
enough issue.

That that would break the case when somebody is trying to serve/generate
a file which starts with a BOM though....

cheers,
Derick

--
http://derickrethans.nl | http://xdebug.org
Like Xdebug? Consider a donation: http://xdebug.org/donate.php
twitter: @derickr and @xdebug
Posted with an email client that doesn't mangle email: alpine

9 years ago by Sammy Kaye Powers — view source

unread

Hey Stanislav!

In general, I don't think BOM is a real issue worth messing with the
lexer. Surely, from time to time somebody would use weird editor which
produces BOMs, like editing PHP scripts in Word. Surely, they'd have
weird effects that would force them to spend 5 minutes googling and
fixing it. I don't think it is the reason to spend day-persons of our
collective time to find a fix to this very niche problem and risk
potential BC issues.

The issue is that the BOM causes errors that are not easy to Google.
Some developers will have issues with their sessions not working.
Others with their custom headers not being sent. Others with "strange
characters" showing up everywhere. There are myriad reasons why any
one of those things could be happening that are not BOM related all
the while a BOM is sitting there in their files wearing an
"invisibility cloak" so-to-speak. :) So they potentially try 10 things
from Stack Overflow that don't fix the issue and give up.

I checked GitHub for issues related to this and a few quick searches
turned up a handful of issues possibly related to the BOM output:

https://github.com/search?l=PHP&q=%22byte+order+mark%22+headers+sent&ref=searchresults&type=Issues&utf8=%E2%9C%93
https://github.com/search?l=PHP&q=bom+headers+sent&ref=searchresults&type=Issues&utf8=%E2%9C%93

But the real hum-dinger was from Stack Overflow:

http://stackoverflow.com/search?q=php+bom

It does seem to be tripping up a lot of people, especially newbies. As
low as the learning curve is for PHP already, I'm curious if you folks
think it's advantageous to have PHP ignore the BOM in std out in the
case of {U+FEFF}<?php to remove another stumbling block.

9 years ago by Lester Caine — view source

unread

But the real hum-dinger was from Stack Overflow:

http://stackoverflow.com/search?q=php+bom

It does seem to be tripping up a lot of people, especially newbies. As
low as the learning curve is for PHP already, I'm curious if you folks
think it's advantageous to have PHP ignore the BOM in std out in the
case of {U+FEFF}<?php to remove another stumbling block.

BUT is this actually anything to do with BOM?
From stackoverflow
http://stackoverflow.com/questions/35549518/php-import-csv-file-utf8-with-bom
Check the answer.

I'm not saying PHP handles UTF8 properly, but most of the problems tend
to be more to do with the source file encoding rather than PHP? The
majority of the answers even from years back are 'don't use BOM' but I
am curious where the \u{feff} comes from in your original post since the
UTF8 BOM is \u{efbbbf} and anything else should be stripped. The
starting point should perhaps be http://unicode.org/faq/utf_bom.html
which acknowledges that it's use can be a problem and specifically ...

Some byte oriented protocols expect ASCII characters at the beginning of
a file. If UTF-8 is used with these protocols, use of the BOM as
encoding form signature should be avoided.

Which is where the advise not to use them comes from ...

--
Lester Caine - G8HFL

Contact - http://lsces.co.uk/wiki/?page=contact
L.S.Caine Electronic Services - http://lsces.co.uk
EnquirySolve - http://enquirysolve.com/
Model Engineers Digital Workshop - http://medw.co.uk
Rainbow Digital Media - http://rainbowdigitalmedia.co.uk

9 years ago by keisial@gmail.com — view source

unread

BOM's should not be treated as characters and should not be sent to
the output. Is there any reason this should be considered the expected
behavior? If not, I'd like to create an RFC to change it. :)

What about
«Hello Foo!
Today is <?= date("d F Y") ?>» ?

If there's a BOM, should it be sent?

9 years ago by Andrea Faulds — view source

unread

Hi Sammy,

Sammy Kaye Powers wrote:

If you create a php file with the following:

<?php
header("X-foo: Bar");
echo "Foo!".PHP_EOL;

And save it as UTF-8 with BOM, interesting things happen depending on
the SAPI & configuration.

If you run it from the CLI you get an error:

PHP Warning: Cannot modify header information - headers already sent by (output started at %s:1) in %s on line %d

But it doesn't seem to return the BOM to std out (but I could be doing
this part wrong). If you run it from php -S, and load it in a
browser, the web server returns a code point \u{feff} as the first
code point of the response body.

BOM's should not be treated as characters and should not be sent to
the output. Is there any reason this should be considered the expected
behavior? If not, I'd like to create an RFC to change it. :)

I suspect that this part of the Zend Engine is much-neglected, but PHP
actually can detect the BOM, and strip it from the output, if you have
zend.multibyte turned on:

https://github.com/php/php-src/blob/3b0a6dfeb2896fb204db48d11364c09942b1ad01/Zend/zend_language_scanner.l#L292

I haven't tried this myself, though.

Thanks.

Andrea Faulds
https://ajf.me/

PHP's handling of BOM (byte order mark)

-- Lester Caine - G8HFL

Thanks.

--
Lester Caine - G8HFL