libxml2's charset encoding auto-detection mode is broken with the push
parser in current versions of libxml2, I found that recently:
http://bugzilla.gnome.org/show_bug.cgi?id=162613
but trying to force it can trigger infinite loops in libxml2, which is
what happens in http://bugs.php.net/?id=32001
So I think it's best to not force this mode. Future versions of libxml2
will set parser->charset to XML_CHAR_ENCODING_NONE by default with the
push parser and will hence work as desired with no explicit setting of
parser->charset required.
Is this patch OK?
http://www.apache.org/~jorton/php_xmlenc.diff
Index: ext/xml/compat.c
RCS file: /repository/php-src/ext/xml/compat.c,v
retrieving revision 1.32.2.7
diff -u -r1.32.2.7 compat.c
--- ext/xml/compat.c 17 Dec 2004 12:21:34 -0000 1.32.2.7
+++ ext/xml/compat.c 17 Feb 2005 11:12:08 -0000
@@ -379,8 +379,6 @@
}
if (encoding != NULL) {
parser->parser->encoding = xmlStrdup(encoding);
- } else {
-
}parser->parser->charset = XML_CHAR_ENCODING_NONE;
parser->parser->replaceEntities = 1;
parser->parser->wellFormed = 0;
Index: ext/xml/tests/bug32001.phpt
===================================================================
RCS file: ext/xml/tests/bug32001.phpt
diff -N ext/xml/tests/bug32001.phpt
--- /dev/null 1 Jan 1970 00:00:00 -0000
+++ ext/xml/tests/bug32001.phpt 17 Feb 2005 11:12:08 -0000
@@ -0,0 +1,40 @@
+--TEST--
+Bug #32001 (infinite loop in libxml character encoding detection)
+--FILE--
+<?php
+$myparser = xml_parser_create('');
+$simple = "<para><note>simple note</note></para>";
+xml_parse_into_struct($myparser, $simple, $myvals, $mytags);
+var_dump($myvals);
+--EXPECT--
+array(3) {
- [0]=>
- array(3) {
- ["tag"]=>
- string(4) "PARA"
- ["type"]=>
- string(4) "open"
- ["level"]=>
- int(1)
- }
- [1]=>
- array(4) {
- ["tag"]=>
- string(4) "NOTE"
- ["type"]=>
- string(8) "complete"
- ["level"]=>
- int(2)
- ["value"]=>
- string(11) "simple note"
- }
- [2]=>
- array(3) {
- ["tag"]=>
- string(4) "PARA"
- ["type"]=>
- string(5) "close"
- ["level"]=>
- int(1)
- }
+}
libxml2's charset encoding auto-detection mode is broken with the push
parser in current versions of libxml2, I found that recently:http://bugzilla.gnome.org/show_bug.cgi?id=162613
but trying to force it can trigger infinite loops in libxml2, which is
what happens in http://bugs.php.net/?id=32001So I think it's best to not force this mode. Future versions of libxml2
will set parser->charset to XML_CHAR_ENCODING_NONE by default with the
push parser and will hence work as desired with no explicit setting of
parser->charset required.
Any BC breaks with that? Do I have to know now the encoding of the XML
document, before I can use the push parser? But reading your bugreport
at gnome.org, I assume it just defaults to UTF-8, right?
chregu
Is this patch OK?
http://www.apache.org/~jorton/php_xmlenc.diff
Index: ext/xml/compat.c
RCS file: /repository/php-src/ext/xml/compat.c,v
retrieving revision 1.32.2.7
diff -u -r1.32.2.7 compat.c
--- ext/xml/compat.c 17 Dec 2004 12:21:34 -0000 1.32.2.7
+++ ext/xml/compat.c 17 Feb 2005 11:12:08 -0000
@@ -379,8 +379,6 @@
}
if (encoding != NULL) {
parser->parser->encoding = xmlStrdup(encoding);
- } else {
}parser->parser->charset = XML_CHAR_ENCODING_NONE;
parser->parser->replaceEntities = 1;
parser->parser->wellFormed = 0;
Index: ext/xml/tests/bug32001.phpt
===================================================================
RCS file: ext/xml/tests/bug32001.phpt
diff -N ext/xml/tests/bug32001.phpt
--- /dev/null 1 Jan 1970 00:00:00 -0000
+++ ext/xml/tests/bug32001.phpt 17 Feb 2005 11:12:08 -0000
@@ -0,0 +1,40 @@
+--TEST--
+Bug #32001 (infinite loop in libxml character encoding detection)
+--FILE--
+<?php
+$myparser = xml_parser_create('');
+$simple = "<para><note>simple note</note></para>";
+xml_parse_into_struct($myparser, $simple, $myvals, $mytags);
+var_dump($myvals);
+--EXPECT--
+array(3) {
- [0]=>
- array(3) {
- ["tag"]=>
- string(4) "PARA"
- ["type"]=>
- string(4) "open"
- ["level"]=>
- int(1)
- }
- [1]=>
- array(4) {
- ["tag"]=>
- string(4) "NOTE"
- ["type"]=>
- string(8) "complete"
- ["level"]=>
- int(2)
- ["value"]=>
- string(11) "simple note"
- }
- [2]=>
- array(3) {
- ["tag"]=>
- string(4) "PARA"
- ["type"]=>
- string(5) "close"
- ["level"]=>
- int(1)
- }
+}
--
christian stocker | Bitflux GmbH | schoeneggstrasse 5 | ch-8004 zurich
phone +41 1 240 56 70 | mobile +41 76 561 88 60 | fax +41 1 240 56 71
http://www.bitflux.ch | chregu@bitflux.ch | gnupg-keyid 0x5CE1DECB
It looks like there would be BC breaks unless libxml with the bug fix is
used as the encoding is detected properly and no infinite loop if an xml
declaration or BOM is used in the xml. So basically with the patch there
is no more autodetecting if used with any other libxml versions (though
no more possibilities of inifinte loops).
i.e.
$simple = "<?xml version='1.0'?><para><note>simple note</note></para>";
$simple = "\xEF\xBB\xBF<para><note>simple note</note></para>";
These both currently work without going into an infinite loop.
Rob
Christian Stocker wrote:
Any BC breaks with that? Do I have to know now the encoding of the XML
document, before I can use the push parser? But reading your bugreport
at gnome.org, I assume it just defaults to UTF-8, right?
It looks like there would be BC breaks unless libxml with the bug fix is
used as the encoding is detected properly and no infinite loop if an xml
declaration or BOM is used in the xml. So basically with the patch there
is no more autodetecting if used with any other libxml versions (though
no more possibilities of inifinte loops).
That's not quite right: detection based on an ASCII <?xml prolog with an
explicit encoding= still works fine with the patch applied (e.g. for
encoding=ISO-8859-1 documents). It's only documents which have a BOM
which will then fail to parse.
So it is a bit of a tricky trade-off...
So it is a bit of a tricky trade-off...
How about #ifdef'ifying it? It's lame though...
Moriyoshi
Right, as long as you explicity include the encoding.
So with the patch if you include a BOM, it wont parse (previous behavior
worked fine) and if you include a prolog without explicit encoding it
will always use UTF-8 (previous behavior was to autodetect encoding
based on the charset used for first few chars of the prolog).
Would it be possible to ifdef it then and for older libxml (only needed
when trying to use autoencoding) see if its possible to see if
xmlDetectEndocding can be used prior to sending off to parser and if it
returns no encoding then set the charset to utf-8 to avoid the infinite
loop? This would preserver BC for anyone using prolog with no explicit
encoding or BOM.
Rob
Joe Orton wrote:
It looks like there would be BC breaks unless libxml with the bug fix is
used as the encoding is detected properly and no infinite loop if an xml
declaration or BOM is used in the xml. So basically with the patch there
is no more autodetecting if used with any other libxml versions (though
no more possibilities of inifinte loops).That's not quite right: detection based on an ASCII <?xml prolog with an
explicit encoding= still works fine with the patch applied (e.g. for
encoding=ISO-8859-1 documents). It's only documents which have a BOM
which will then fail to parse.So it is a bit of a tricky trade-off...
What about this patch? Really hacky as the charset is checked before the
inital parse and it basically duplicates the libxml code with the
correct fix, but seems to work ok. Havent tried it with any large
datasets yet which require multiple parse calls, but it should work.
Rob
Joe Orton wrote:
That's not quite right: detection based on an ASCII <?xml prolog with an
explicit encoding= still works fine with the patch applied (e.g. for
encoding=ISO-8859-1 documents). It's only documents which have a BOM
which will then fail to parse.So it is a bit of a tricky trade-off...