20 years ago by Joe Orton — view source — reply

unread

libxml2's charset encoding auto-detection mode is broken with the push
parser in current versions of libxml2, I found that recently:

http://bugzilla.gnome.org/show_bug.cgi?id=162613

but trying to force it can trigger infinite loops in libxml2, which is
what happens in http://bugs.php.net/?id=32001

So I think it's best to not force this mode. Future versions of libxml2
will set parser->charset to XML_CHAR_ENCODING_NONE by default with the
push parser and will hence work as desired with no explicit setting of
parser->charset required.

Is this patch OK?

http://www.apache.org/~jorton/php_xmlenc.diff

Index: ext/xml/compat.c

RCS file: /repository/php-src/ext/xml/compat.c,v
retrieving revision 1.32.2.7
diff -u -r1.32.2.7 compat.c
--- ext/xml/compat.c 17 Dec 2004 12:21:34 -0000 1.32.2.7
+++ ext/xml/compat.c 17 Feb 2005 11:12:08 -0000
@@ -379,8 +379,6 @@
}
if (encoding != NULL) {
parser->parser->encoding = xmlStrdup(encoding);

} else {
```
  parser->parser->charset = XML_CHAR_ENCODING_NONE;
```
}
parser->parser->replaceEntities = 1;
parser->parser->wellFormed = 0;
Index: ext/xml/tests/bug32001.phpt
===================================================================
RCS file: ext/xml/tests/bug32001.phpt
diff -N ext/xml/tests/bug32001.phpt
--- /dev/null 1 Jan 1970 00:00:00 -0000
+++ ext/xml/tests/bug32001.phpt 17 Feb 2005 11:12:08 -0000
@@ -0,0 +1,40 @@
+--TEST--
+Bug #32001 (infinite loop in libxml character encoding detection)
+--FILE--
+<?php
+$myparser = xml_parser_create('');
+$simple = "<para><note>simple note</note></para>";
+xml_parse_into_struct($myparser, $simple, $myvals, $mytags);
+var_dump($myvals);
+--EXPECT--
+array(3) {

[0]=>
array(3) {
["tag"]=>
string(4) "PARA"
["type"]=>
string(4) "open"
["level"]=>
int(1)
}
[1]=>
array(4) {
["tag"]=>
string(4) "NOTE"
["type"]=>
string(8) "complete"
["level"]=>
int(2)
["value"]=>
string(11) "simple note"
}
[2]=>
array(3) {
["tag"]=>
string(4) "PARA"
["type"]=>
string(5) "close"
["level"]=>
int(1)
}
+}

20 years ago by Christian Stocker — view source — reply

unread

libxml2's charset encoding auto-detection mode is broken with the push
parser in current versions of libxml2, I found that recently:

http://bugzilla.gnome.org/show_bug.cgi?id=162613

but trying to force it can trigger infinite loops in libxml2, which is
what happens in http://bugs.php.net/?id=32001

So I think it's best to not force this mode. Future versions of libxml2
will set parser->charset to XML_CHAR_ENCODING_NONE by default with the
push parser and will hence work as desired with no explicit setting of
parser->charset required.

Any BC breaks with that? Do I have to know now the encoding of the XML
document, before I can use the push parser? But reading your bugreport
at gnome.org, I assume it just defaults to UTF-8, right?

chregu

Is this patch OK?

http://www.apache.org/~jorton/php_xmlenc.diff

Index: ext/xml/compat.c

RCS file: /repository/php-src/ext/xml/compat.c,v
retrieving revision 1.32.2.7
diff -u -r1.32.2.7 compat.c
--- ext/xml/compat.c 17 Dec 2004 12:21:34 -0000 1.32.2.7
+++ ext/xml/compat.c 17 Feb 2005 11:12:08 -0000
@@ -379,8 +379,6 @@
}
if (encoding != NULL) {
parser->parser->encoding = xmlStrdup(encoding);
} else {
parser->parser->charset = XML_CHAR_ENCODING_NONE;
}
parser->parser->replaceEntities = 1;
parser->parser->wellFormed = 0;
Index: ext/xml/tests/bug32001.phpt
===================================================================
RCS file: ext/xml/tests/bug32001.phpt
diff -N ext/xml/tests/bug32001.phpt
--- /dev/null 1 Jan 1970 00:00:00 -0000
+++ ext/xml/tests/bug32001.phpt 17 Feb 2005 11:12:08 -0000
@@ -0,0 +1,40 @@
+--TEST--
+Bug #32001 (infinite loop in libxml character encoding detection)
+--FILE--
+<?php
+$myparser = xml_parser_create('');
+$simple = "<para><note>simple note</note></para>";
+xml_parse_into_struct($myparser, $simple, $myvals, $mytags);
+var_dump($myvals);
+--EXPECT--
+array(3) {
[0]=>

array(3) {

["tag"]=>

string(4) "PARA"

["type"]=>

string(4) "open"

["level"]=>

int(1)

}

[1]=>

array(4) {

["tag"]=>

string(4) "NOTE"

["type"]=>

string(8) "complete"

["level"]=>

int(2)

["value"]=>

string(11) "simple note"

}

[2]=>

array(3) {

["tag"]=>

string(4) "PARA"

["type"]=>

string(5) "close"

["level"]=>

int(1)

}
+}

20 years ago by Rob Richards — view source — reply

unread

It looks like there would be BC breaks unless libxml with the bug fix is
used as the encoding is detected properly and no infinite loop if an xml
declaration or BOM is used in the xml. So basically with the patch there
is no more autodetecting if used with any other libxml versions (though
no more possibilities of inifinte loops).

i.e.
$simple = "<?xml version='1.0'?><para><note>simple note</note></para>";
$simple = "\xEF\xBB\xBF<para><note>simple note</note></para>";

These both currently work without going into an infinite loop.

Rob

Christian Stocker wrote:

Any BC breaks with that? Do I have to know now the encoding of the XML
document, before I can use the push parser? But reading your bugreport
at gnome.org, I assume it just defaults to UTF-8, right?

20 years ago by Joe Orton — view source — reply

unread

It looks like there would be BC breaks unless libxml with the bug fix is
used as the encoding is detected properly and no infinite loop if an xml
declaration or BOM is used in the xml. So basically with the patch there
is no more autodetecting if used with any other libxml versions (though
no more possibilities of inifinte loops).

That's not quite right: detection based on an ASCII <?xml prolog with an
explicit encoding= still works fine with the patch applied (e.g. for
encoding=ISO-8859-1 documents). It's only documents which have a BOM
which will then fail to parse.

So it is a bit of a tricky trade-off...

20 years ago by Moriyoshi Koizumi — view source — reply

unread

So it is a bit of a tricky trade-off...

How about #ifdef'ifying it? It's lame though...

Moriyoshi

20 years ago by Rob Richards — view source — reply

unread

Right, as long as you explicity include the encoding.
So with the patch if you include a BOM, it wont parse (previous behavior
worked fine) and if you include a prolog without explicit encoding it
will always use UTF-8 (previous behavior was to autodetect encoding
based on the charset used for first few chars of the prolog).

Would it be possible to ifdef it then and for older libxml (only needed
when trying to use autoencoding) see if its possible to see if
xmlDetectEndocding can be used prior to sending off to parser and if it
returns no encoding then set the charset to utf-8 to avoid the infinite
loop? This would preserver BC for anyone using prolog with no explicit
encoding or BOM.

Rob

Joe Orton wrote:

It looks like there would be BC breaks unless libxml with the bug fix is
used as the encoding is detected properly and no infinite loop if an xml
declaration or BOM is used in the xml. So basically with the patch there
is no more autodetecting if used with any other libxml versions (though
no more possibilities of inifinte loops).

That's not quite right: detection based on an ASCII <?xml prolog with an
explicit encoding= still works fine with the patch applied (e.g. for
encoding=ISO-8859-1 documents). It's only documents which have a BOM
which will then fail to parse.

So it is a bit of a tricky trade-off...

20 years ago by Rob Richards — view source — reply

unread

What about this patch? Really hacky as the charset is checked before the
inital parse and it basically duplicates the libxml code with the
correct fix, but seems to work ok. Havent tried it with any large
datasets yet which require multiple parse calls, but it should work.

Rob

Joe Orton wrote:

That's not quite right: detection based on an ASCII <?xml prolog with an
explicit encoding= still works fine with the patch applied (e.g. for
encoding=ISO-8859-1 documents). It's only documents which have a BOM
which will then fail to parse.

So it is a bit of a tricky trade-off...

[PATCH] ext/xml/compat.c fix for #32001

Index: ext/xml/compat.c

Index: ext/xml/compat.c