Hi,
The htmlspecialchars and htmlentities functions since version 5.2.5 return an
empty string when the input contains at least a single invalid or incomplete
unicode sequence.
What I understood is that this change was made to avoid reading more chars in
the buffer than it actually contained.
Should really theses functions discard the whole string for a single
incomplete sequence ?
I made a patch which changes the behavior of these functions to skip invalid
sequences, without discarding the whole string. This involves a very few
changes and makes the behavior of theses functions more
consistent with previous PHP versions, keeping the fixes that was made in the
get_next_char() internal function.
The patch: http://s3.amazonaws.com/arnaud.lb/php_htmlentities_utf.patch
The bug entry: http://bugs.php.net/bug.php?id=43896
Should really theses functions discard the whole string for a single
incomplete sequence ?
I think since it is not possible to recover true content of the string,
it is ok to return failure value. Cutting it in random places or
ignoring problems doesn't seem a good idea - it might lead to all kinds
of nasty things, such as security filtering checking one data and
database getting entirely different data.
--
Stanislav Malyshev, Zend Software Architect
stas@zend.com http://www.zend.com/
(408)253-8829 MSN: stas@zend.com
Should really theses functions discard the whole string for a single
incomplete sequence ?I think since it is not possible to recover true content of the string,
it is ok to return failure value. Cutting it in random places or
ignoring problems doesn't seem a good idea - it might lead to all kinds
of nasty things, such as security filtering checking one data and
database getting entirely different data.
Instead of using simple sanitizing function users are forced to check for
errors. How good is that? It makes code complex or unreliable.
htmlspecialchars()
and htmlentities()
are not used to sanitize database
data. What kind of errors you expect in htmlspecialchars()
? I think
supported charsets don't have alternative symbols in 0x22, 0x26, 0x27,
0x3C, 0x3E. Only CJK charsets and htmlentities might have issues. With any
other charset you know start and end byte of symbol. If you think that
broken utf-8 can cause issues, strip or sanitize broken symbols.
If users detect error in htmlspecialchars()
, they will use str_replace()
in order to provide some failsafe instead of losing whole text and it
won't solve security issues.
--
Tomas
Instead of using simple sanitizing function users are forced to check for
errors. How good is that? It makes code complex or unreliable.
Explain me again how checking for errors makes code unreliable?
Stanislav Malyshev, Zend Software Architect
stas@zend.com http://www.zend.com/
(408)253-8829 MSN: stas@zend.com
Instead of using simple sanitizing function users are forced to check
for errors. How good is that? It makes code complex or unreliable.Explain me again how checking for errors makes code unreliable?
OR unreliable. If you check for errors, sanitizing code is complex. If you
don't check and expect it to work, code can cause data loss (is
unreliable).
--
Tomas
Should really theses functions discard the whole string for a single
incomplete sequence ?I think since it is not possible to recover true content of the string,
it is ok to return failure value. Cutting it in random places or
ignoring problems doesn't seem a good idea - it might lead to all kinds
of nasty things, such as security filtering checking one data and
database getting entirely different data.
I dont think so. htmlspecialchars' job is to replace character sequences which
may be interpreted as HTML special characters by the browser. Its job is not
to validate a string or to check if it will be passed correctly to a DB.
htmlspecialchars with my patch just achieves that.
There are many chances to have an invalid unicode sequence in a user input. In
normal situations, text typed in a form element will be sent in the correct
encoding by the browser, but what about file uploads ? What if the browser
itself send invalid sequences ? (e.g. copy/paste of word documents in a form
and/or wysiwyg-enabled elements using IE). Bugs 43896, 43294 and 43549 also
report theses problems.
This new htmlspecialchars version will be a nightmare for many php users if it
is left as is.
On Fri, 25 Jan 2008 14:22:52 -0800, in php.internals stas@zend.com
(Stanislav Malyshev) wrote:
Should really theses functions discard the whole string for a single
incomplete sequence ?I think since it is not possible to recover true content of the string,
it is ok to return failure value. Cutting it in random places or
ignoring problems doesn't seem a good idea - it might lead to all kinds
of nasty things, such as security filtering checking one data and
database getting entirely different data.
On the other hand utf8_decode()
also expects the input to be UTF-8
encoded, but it replaces incomplete sequences with the character "?".
I don't know if it is a recommended standard for invalid input but I
have seen this conversion as well in a couple of other applications,
e.g. Firefox.
--
- Peter Brodersen
Peter Brodersen wrote:
On Fri, 25 Jan 2008 14:22:52 -0800, in php.internals stas@zend.com
(Stanislav Malyshev) wrote:Should really theses functions discard the whole string for a single
incomplete sequence ?
I think since it is not possible to recover true content of the string,
it is ok to return failure value. Cutting it in random places or
ignoring problems doesn't seem a good idea - it might lead to all kinds
of nasty things, such as security filtering checking one data and
database getting entirely different data.On the other hand
utf8_decode()
also expects the input to be UTF-8
encoded, but it replaces incomplete sequences with the character "?".I don't know if it is a recommended standard for invalid input but I
have seen this conversion as well in a couple of other applications,
e.g. Firefox.
utf8_decode()
doesn't replace invalid chars with a ?
eg.
php -r '$a="abcd".chr(0xE0);echo
iconv("utf-8","utf-8",$a)."\n".utf8_decode($a);' | od -t x1
0000000 61 62 63 64 0a 61 62 63 64 03
So, iconv()
when told to take utf-8 as input and spit out utf-8 as
output strips out invalid utf-8 chars whereas utf8_decode()
does who
knows what. 0xE0 gets converted to 0x03?
It would be a horrendously bad idea to replace invalid chars with some
other valid char. Way worse than returning nothing. Think about what
would happen in a regex, for example, if a user was able to inject a '?'
by sending an invalid utf-8 sequence that ends up in a regular expression.
If we are going to do anything here, it would be to strip the invalid
utf-8 bytes, but technically that's not a great solution from a security
perspective. The results could be quite unexpected. The most secure
approach is to fail on invalid input. It's your job to validate input
and feed the function the input it expects.
-Rasmus
On Mon, 28 Jan 2008 17:26:48 -0800, in php.internals
rasmus@lerdorf.com (Rasmus Lerdorf) wrote:
On the other hand
utf8_decode()
also expects the input to be UTF-8
encoded, but it replaces incomplete sequences with the character "?".utf8_decode() doesn't replace invalid chars with a ?
eg.
php -r '$a="abcd".chr(0xE0);echo
iconv("utf-8","utf-8",$a)."\n".utf8_decode($a);' | od -t x10000000 61 62 63 64 0a 61 62 63 64 03
Yes it does, but not in your case :-)
However:
$ php -r '$a="abcd".chr(0xE0)."e"; echo
iconv("utf-8","utf-8",$a)."\n".utf8_decode($a);'|hd
00000000 61 62 63 64 0a 61 62 63 64 3f |abcd.abcd?|
$ php -r 'print utf8_decode("Fløde på æblegrød");'
Fl?p?blegr?
It would be a horrendously bad idea to replace invalid chars with some
other valid char. Way worse than returning nothing. Think about what
would happen in a regex, for example, if a user was able to inject a '?'
by sending an invalid utf-8 sequence that ends up in a regular expression.
I don't disagree with you and I have thought of the same issue
(although I suppose any sanitation should happen after any given
conversion; other charsets than utf-8 might be able to encode lowbits
such as "?" as well - but this is beside the point)
I'm not fond of the "?" feature as well, but it is present in
utf8_decode()
and other non-php applications with utf-8 conversion.
My guess is still that some standard recommends this conversion as a
possible fallback for error handling.
--
- Peter Brodersen
On Mon, 28 Jan 2008 17:26:48 -0800, in php.internals rasmus@lerdorf.com
(Rasmus Lerdorf) wrote:
php -r '$a="abcd".chr(0xE0);echo
iconv("utf-8","utf-8",$a)."\n".utf8_decode($a);' | od -t x10000000 61 62 63 64 0a 61 62 63 64 03
By the way, the 03 in your result is a bit spurious. For me it seems to
differ every time I run that code.
It happens with 0xE0 but not with e.g. 0xE6 (æ). It seems to be consistent
for every run though:
$ php -r 'for($a=0;$a<20;$a++)printf("%02x ",utf8_decode(chr(0xE0)));'
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
$ php -r 'for($a=0;$a<20;$a++)printf("%02x ",utf8_decode(chr(0xE0)));'
09 09 09 09 09 09 09 09 09 09 09 09 09 09 09 09 09 09 09 09
$ php -r 'for($a=0;$a<20;$a++)printf("%02x ",utf8_decode(chr(0xE0)));'
05 05 05 05 05 05 05 05 05 05 05 05 05 05 05 05 05 05 05 05
$ php -r 'for($a=0;$a<20;$a++)printf("%02x ",utf8_decode(chr(0xE0)));'
02 02 02 02 02 02 02 02 02 02 02 02 02 02 02 02 02 02 02 02
$ for a in seq 1 20
; do php -r 'printf("%02x ",utf8_decode(chr(0xE0)));';
done
07 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 09 00
$ for a in seq 1 20
; do php -r 'printf("%02x ",utf8_decode(chr(0xE0)));';
done
08 00 00 02 00 00 00 00 00 05 00 00 00 05 00 00 07 00 09 00
$ for a in seq 1 20
; do php -r 'printf("%02x ",utf8_decode(chr(0xE0)));';
done
00 00 00 00 00 00 00 00 04 00 08 00 00 00 00 05 00 00 01 00
I don't think there is any reason for this behaviour. I'll file a bug.
--
- Peter Brodersen
On Mon, 28 Jan 2008 17:26:48 -0800, in php.internals rasmus@lerdorf.com
(Rasmus Lerdorf) wrote:
It would be a horrendously bad idea to replace invalid chars with some
other valid char. Way worse than returning nothing. Think about what
would happen in a regex, for example, if a user was able to inject a '?'
by sending an invalid utf-8 sequence that ends up in a regular expression.
By the way, unicode characters that doesn't exist in iso8859-1 are also
replaced into a question mark:
$ php -r 'print utf8_decode(pack("c*",0xe2,0x98,0x83));'|od -t x1
0000000 3f
http://php.net/xml also documents this replacement:
If PHP encounters characters in the parsed XML document that can not be
represented in the chosen target encoding, the problem characters will be
"demoted". Currently, this means that such characters are replaced by a
question mark.
http://www.cl.cam.ac.uk/~mgk25/ucs/examples/UTF-8-test.txt mentions:
According to ISO 10646-1:2000, sections D.7 and 2.3c, a device
receiving UTF-8 shall interpret a "malformed sequence in the same way
that it interprets a character that is outside the adopted subset" and
"characters that are not within the adopted subset shall be indicated
to the user" by a receiving device. A quite commonly used approach in
UTF-8 decoders is to replace any malformed UTF-8 sequence by a
replacement character (U+FFFD), which looks a bit like an inverted
question mark, or a similar symbol. It might be a good idea to
visually distinguish a malformed UTF-8 sequence from a correctly
encoded Unicode character that is just not available in the current
font but otherwise fully legal, even though ISO 10646-1 doesn't
mandate this. In any case, just ignoring malformed sequences or
unavailable characters does not conform to ISO 10646, will make
debugging more difficult, and can lead to user confusion.
--
- Peter Brodersen
Peter Brodersen wrote:
http://php.net/xml also documents this replacement:
If PHP encounters characters in the parsed XML document that can not be
represented in the chosen target encoding, the problem characters will be
"demoted". Currently, this means that such characters are replaced by a
question mark.
That was back in the expat days. We don't use that xml parser anymore.
http://www.cl.cam.ac.uk/~mgk25/ucs/examples/UTF-8-test.txt mentions:
According to ISO 10646-1:2000, sections D.7 and 2.3c, a device
receiving UTF-8 shall interpret a "malformed sequence in the same way
that it interprets a character that is outside the adopted subset" and
"characters that are not within the adopted subset shall be indicated
to the user" by a receiving device. A quite commonly used approach in
UTF-8 decoders is to replace any malformed UTF-8 sequence by a
replacement character (U+FFFD), which looks a bit like an inverted
question mark, or a similar symbol. It might be a good idea to
visually distinguish a malformed UTF-8 sequence from a correctly
encoded Unicode character that is just not available in the current
font but otherwise fully legal, even though ISO 10646-1 doesn't
mandate this. In any case, just ignoring malformed sequences or
unavailable characters does not conform to ISO 10646, will make
debugging more difficult, and can lead to user confusion.
That part is completely different. That's at the display level.
Replacing it in the backend makes no sense to me. Don't use
utf8_decode. Use iconv()
so you know what the heck is going on.
-Rasmus
That part is completely different. That's at the display level.
Replacing it in the backend makes no sense to me. Don't use
utf8_decode. Useiconv()
so you know what the heck is going on.
:)
iconv()
will stop on conversion error and return partial string or empty
string. It will require even more complex code than 5.2.5
htmlspecialchars()
does. With htmlspecialchars you check for empty string
before and after the call. With iconv you check for php errors during
iconv call.
--
Tomas
iconv()
will stop on conversion error and return partial string or empty
string. It will require even more complex code than 5.2.5
htmlspecialchars()
does. With htmlspecialchars you check for empty string
before and after the call. With iconv you check for php errors during
iconv call.
the current impl oficonv()
wrap in php is a pain in the @$$, i can
only get truncated string if there's some invalid char, or rely on
glibc "//IGNORE" flag. "string".encode() in python is much better that
u can with 'ignore', 'replace', 'xmlcharrefreplace',
'backslashreplace' as the 2nd param of encode()
i agree that ignoring the invalid char in "all case" is not good, but
truncating in all case isn't either. there're some case acceptable
like user post -> server accept it, ignoring invalid chars -> and user
have chance to review his text later.
too bad that the conversation implicit like __set/__get that you can't
add one more optional paramter. hope there's some nice way to get this
problem done.