Hi everybody!
There are several bug reports regarding "broken" fputcsv()
behavior in
our tracker, namely, because the $escape parameter causes unexpected
results. For instance:
<?php
$row = ['a\\"', 'bbb'];
$fp = fopen('php://memory', 'w+');
fputcsv($fp, $row);
rewind($fp);
echo stream_get_contents($fp);
fclose($fp);
?>
outputs
"a\"",bbb
instead of the expected
"a\""",bbb
I don't think the current behavior is a bug, but rather the escape
character is an extension to the CSV "standard" (RFC 7111). One can
practically disable the escape character by passing any character that
is not contained in any of the strings in the array; "\0" is usually a
good choice, so changing line 5 in the script above to the following
gives the desired behavior:
fputcsv($fp, $row, ',', '"', "\0");
Cf. https://3v4l.org/InlUv vs. https://3v4l.org/tVFBo.
It is, however, not possible to pass an empty string as $escape
argument, because fputcsv()
bails out in this case raising
Warning: fputcsv()
: escape must be a character
I suggest to allow an empty string instead, and to consider making this
the default in a future version, probably after some time of deprecating
any other $escape argument.
Thoughts?
--
Christoph M. Becker
So empty string would enable the standard behavior RFC 7111 with no escape char.
If so, I support this.
I don't know if '' or true / false / null should be this "special value".
It has to be something that was not legal before, and that people
should use intentionally and not by accident.
I guess '' is good enough for this, or not worse than other options.
One question: Does this also require a change in fgetcsv()
?
So it can read csv without escape character?
I remember that fgetcsv()
does currently tolerate some broken CSV. It
should continue to do so for BC reasons.
For the record, here are some links:
https://stackoverflow.com/questions/44427926/data-gets-garbled-when-writing-to-csv-with-fputcsv-fgetcsv/46342634#46342634
https://bugs.php.net/bug.php?id=74713&edit=2
Hi everybody!
There are several bug reports regarding "broken"
fputcsv()
behavior in
our tracker, namely, because the $escape parameter causes unexpected
results. For instance:<?php $row = ['a\\"', 'bbb']; $fp = fopen('php://memory', 'w+'); fputcsv($fp, $row); rewind($fp); echo stream_get_contents($fp); fclose($fp); ?>
outputs
"a\"",bbb
instead of the expected
"a\""",bbb
I don't think the current behavior is a bug, but rather the escape
character is an extension to the CSV "standard" (RFC 7111). One can
practically disable the escape character by passing any character that
is not contained in any of the strings in the array; "\0" is usually a
good choice, so changing line 5 in the script above to the following
gives the desired behavior:fputcsv($fp, $row, ',', '"', "\0");
Cf. https://3v4l.org/InlUv vs. https://3v4l.org/tVFBo.
It is, however, not possible to pass an empty string as $escape
argument, becausefputcsv()
bails out in this case raisingWarning:
fputcsv()
: escape must be a characterI suggest to allow an empty string instead, and to consider making this
the default in a future version, probably after some time of deprecating
any other $escape argument.Thoughts?
--
Christoph M. Becker
So empty string would enable the standard behavior RFC 7111 with no
escape char.
If so, I support this.
Just a note regarding standards: please make sure the behaviour of common applications, most notably Excel, is considered and tested. In my experience it has its own quirks, and it's likely that a large proportion of users require interoperability with it. It may be there's nothing relevant to consider, but I thought I'd mention it, in case people get too caught up with a specification that nobody actually follows.
Regards,
--
Rowan Collins
[IMSoP]
So empty string would enable the standard behavior RFC 7111 with no
escape char.
If so, I support this.Just a note regarding standards:
Actually, there is no accepted standard regarding CSV. RFC 7111 is just
an update to RFC 4180 (sorry, I've messed that up), but anyway both are
just informational.
please make sure the behaviour of common applications, most notably
Excel, is considered and tested. In my experience it has its own quirks,
and it's likely that a large proportion of users require
interoperability with it. It may be there's nothing relevant to
consider, but I thought I'd mention it, in case people get too caught up
with a specification that nobody actually follows.
That's exactly the point of this proposal. As it is, fputcsv()
outputs
CSV that won't be understood by Excel (or any other CSV aware
implementation I'm aware of), as soon as the escape character is
actually part of any string. So being able to avoid any such escaping
would be helpful wrt. major CSV implementations, and making that the
default even more so.
The only other issue that I can see is that currently our fputcsv()
hard-codes the line endings to LF (while RFC 4180 demands CRLF), but
that may not be a real problem anymore. (Well, there might be issues
with non ASCII compatible character encodings as well).
--
Christoph M. Becker
please make sure the behaviour of common applications, most notably
Excel, is considered and tested. In my experience it has its own quirks,
and it's likely that a large proportion of users require
interoperability with it. It may be there's nothing relevant to
consider, but I thought I'd mention it, in case people get too caught up
with a specification that nobody actually follows.That's exactly the point of this proposal. As it is,
fputcsv()
outputs
CSV that won't be understood by Excel (or any other CSV aware
implementation I'm aware of), as soon as the escape character is
actually part of any string. So being able to avoid any such escaping
would be helpful wrt. major CSV implementations, and making that the
default even more so.
The other problem that this proposal wants to fix is that CSV cannot
currently be used reliably to store or transport data between
different PHP scripts, due to this bug:
https://bugs.php.net/bug.php?id=74713&edit=2
If one cell of the data sent to
fputcsv()
contains "{$enclosure}{$escape_char}{$escape_char}{$enclosure}{$delimiter}", this cell will be split after a round trip offputcsv()
+fgetcsv()
.
The only other issue that I can see is that currently our
fputcsv()
hard-codes the line endings to LF (while RFC 4180 demands CRLF), but
that may not be a real problem anymore. (Well, there might be issues
with non ASCII compatible character encodings as well).
I have opened a lot of php-generated CSV files with LibreOffice Calc,
and the line endings have not been a problem.
I assume Excel and OpenOffice would also be ok with it.
I'd say stick with existing behavior for BC reasons, or discuss it separately.
Hi everybody!
There are several bug reports regarding "broken"
fputcsv()
behavior in
our tracker, namely, because the $escape parameter causes unexpected
results. For instance:
I looked at fixing some of the CSV related bugs about a year or so
ago. My conclusions were:
i) There is no way to fix the problems that wouldn't cause horrible BC
breaks for code that is only coincidentally working currently.
ii) Handling strings in C is much more error prone than handling them in PHP.
I'm reasonably certain that trying to fix the current functions is the
wrong approach, and one of the following would be much better.
Either, find a C library that has already been proven to handle CSV
parsing/generating 'correctly' and bring that into PHP core under
either new function names or namespace.
Or, write the code in PHP (or just use
https://github.com/thephpleague/csv) and find a way to make that fast
enough for people to use.
Touching the existing code is pretty certain to bring a lot of pain,
without resulting in a fully compliant csv parser/generator.
cheers
Dan
Ack
Hi everybody!
There are several bug reports regarding "broken"
fputcsv()
behavior in
our tracker, namely, because the $escape parameter causes unexpected
results. For instance:I looked at fixing some of the CSV related bugs about a year or so
ago. My conclusions were:i) There is no way to fix the problems that wouldn't cause horrible BC
breaks for code that is only coincidentally working currently.
How so?
What we can do:
If $escape_char !== '', fputcsv()
should work exactly as it does now.
It can even use the same C code.
If $escape_char === '', fputcsv()
will have a different behavior than
currently, where no character is treated as special.
If we want to be 100% sure, these two cases can use completely
separate C code, with one if() at the entry point.
ii) Handling strings in C is much more error prone than handling them in PHP.
I'm reasonably certain that trying to fix the current functions is the
wrong approach, and one of the following would be much better.Either, find a C library that has already been proven to handle CSV
parsing/generating 'correctly' and bring that into PHP core under
either new function names or namespace.
Yeah why not. But it will require a lot more work and design decisions
than just introducing a new allowed value for $enclosure parameter.
Or, write the code in PHP (or just use
https://github.com/thephpleague/csv) and find a way to make that fast
enough for people to use.
It can never be as fast as native functions.
I vaguely remember something like factor 2 or 3 when I tried it, but
for now treat this as hearsay.
Touching the existing code is pretty certain to bring a lot of pain,
without resulting in a fully compliant csv parser/generator.cheers
Dan
Ack
There are several bug reports regarding "broken"
fputcsv()
behavior in
our tracker, namely, because the $escape parameter causes unexpected
results. For instance:I looked at fixing some of the CSV related bugs about a year or so
ago. My conclusions were:i) There is no way to fix the problems that wouldn't cause horrible BC
breaks for code that is only coincidentally working currently.ii) Handling strings in C is much more error prone than handling them in PHP.
I'm reasonably certain that trying to fix the current functions is the
wrong approach, and one of the following would be much better.Either, find a C library that has already been proven to handle CSV
parsing/generating 'correctly' and bring that into PHP core under
either new function names or namespace.Or, write the code in PHP (or just use
https://github.com/thephpleague/csv) and find a way to make that fast
enough for people to use.Touching the existing code is pretty certain to bring a lot of pain,
without resulting in a fully compliant csv parser/generator.
I agree that php_fgetcsv() has serious issues, and it might not be
possible to fix it without causing severe BC breaks. php_fputcsv(), on
the other hand, is less of a problem, though.
Overall, the most demanding issue is that both functions try to regard
the current locale, but already fail that generally, since several
parameters are declared as char, which can't work for (some) multibyte
encodings. For instance, it is impossible to generate proper UTF-16
encoded CSV files, or to read them. This issue continues, because
several (mostly?) whitespace characters are hard-coded assuming an ASCII
compatible character encoding.
A minor issue are the hard-coded record terminators, which are currently
LF (RFC 4180 specifies CRLF). Apparently, this isn't a real issue
nowadays (besides the missing support for non ASCII compatible character
encodings).
Another issue concerns the escape character. Frankly, I don't even have
the slightest clue how that is supposed to work, and why it even has
been introduced in the first place. Maybe it has been introduced for
compatibility with some application requiring it; maybe it has been
introduced to support "DSV style"[1]. If the latter, at least
php_fputcsv() doesn't support it (anymore). Unless it's clear what the
escape character exactly is supposed to do, we cannot even hope to
fix the implementation.
Introducing new functions with a clearly defined behavior would be nice,
but appears to me somewhat as pie in the sky (somebody would have to do
the actual work!). But even if we do so, at least the actual behavior
of the existing functions would have to be documented.
And frankly, I don't see why it would be a problem to allow to use no
escape character for fputcsv()
. That certainly wouldn't be a BC break,
since currently the function bails out if escape_str_len < 1
. Of
course, that wouldn't fix all issues, but it appears to make the
function work as expected for ASCII compatible character encodings (for
other character encodings the function appears to be broken anyway).
Ad league/csv: rather impressive! However, including this functionality
into ext/standard is totally over the top, in my opinion. I guess that
fgetcsv()
and fputcsv()
are mostly used for importing from and exporting
to CSV, respectively, but not as replacement for an SQL database engine.
[1] http://www.catb.org/~esr/writings/taoup/html/ch05s02.html
--
Christoph M. Becker
I don't think the current behavior is a bug, but rather the escape
character is an extension to the CSV "standard" (RFC 7111).
Are you sure you mean RFC 7111 ?
I was just parrotting the number in my previous email, but now I
looked it up and only find this:
https://tools.ietf.org/html/rfc7111
This talks about uri fragments with CSV, not CSV itself.
RFC 4180 seems to be closer to what we are looking for:
https://tools.ietf.org/html/rfc4180#section-2
fputcsv()
and fgetcsv()
already have a number of extensions to this
format, which I think are not harmful and that we should keep:
- option to choose a different delimiter
- option to choose a different enclosure
- option to have a different number of cells per row.
-
fgetcsv()
has some tolerance for broken CSV, that we should continue
to support.
In the stackoverflow discussion, someone argues that line breaks are
not part of the standard / not portable:
https://stackoverflow.com/questions/44427926/data-gets-garbled-when-writing-to-csv-with-fputcsv-fgetcsv/46342634#comment75882780_44427926
However, the RFC 4180 that I found clearly mentions them:
- Fields containing line breaks (CRLF), double quotes, and commas should be enclosed in double-quotes.
So, all we would change is the escape behavior. In RFC 4180, it says:
If double-quotes are used to enclose fields, then a double-quote appearing inside a field must be escaped by preceding it with another double quote.