Run-tests.php JUnit format issue

13 years ago by Anthony Ferrara — view source

unread

Hey all,

I've run into an issue with run-tests.php with the junit format. The XML
that it generates can be invalid because of invalid UTF-8 characters and
invalid XML characters. This means that trying to parse it using something
like Jenkins gives a huge stack-trace because of invalid XML. I've been
digging through how to fix it, and I think I've come up with a solution.
But I'm not too happy with it, so I'd like some feedback.

https://github.com/php/php-src/blob/master/run-tests.php#L2096

Right now, the diff for a failed test is just injected in cdata tags, and
stuck unencoded in the result XML. For tests that are testing invalid UTF-8
bytes (or other character sets), that diff can contain bad byte sequences.

$diff = empty($diff) ? '' : "<![CDATA[\n " . preg_replace('/\e/',

'<esc>', $diff) . "\n]]>";

What I'm proposing is to escape all non-UTF8 and non-XML safe bytes with
their value wrapped by <>. So chr(0xFF) (which is invalid in UTF8) would
become <xFF>

Now, to implement it is a bit more interesting. I've come up with a single
regex that will do it:

    $diff = preg_replace_callback(

'/(
[\x0-\x8] # Control Characters
| [\xB-\xC] # Invalid XML Characters
| [\xE-\x19] # Invalid XML Characters
| [\xF8-\xFF] # Invalid UTF-8 Bytes
| \xC0-\xDF # Invalid UTF-8 Sequence Start
| \xE0-\xEF # Invalid UTF-8 Sequence Start
| \xF0-\xF7 # Invalid UTF-8 Sequence Start
| (?<=[\x0-\x7F\xF8-\xFF])[\x80-\xBF] # Invalid UTF-8 Sequence Middle
| (?<!
[\xC0-\xDF] # Not Byte 2 of 2 Byte Sequence
| [\xE0-\xEF] # Not Byte 2 of 3 Byte Sequence
| [\xE0-\xEF][\x80-\xBF] # Not Byte 3 of 3 Byte Sequence
| [\xF0-\xF7] # Not Byte 2 of 4 Byte Sequence
| [\xF0-\xF7][\x80-\xBF] # Not Byte 3 of 4 Byte Sequence
| [\xF0-\xF7][\x80-\xBF]{2} # Not Byte 4 of 4 Byte Sequence
)[\x80-\xBF] #
Overlong Sequence
| (?<=[\xE0-\xEF])\x80-\xBF # Short 3 byte
sequence
| (?<=[\xF0-\xF7])\x80-\xBF # Short 4 byte
sequence
| (?<=[\xF0-\xF7][\x80-\xBF])\x80-\xBF # Short 4 byte
sequence

)/x',
function($match) { return sprintf('<x%02x>', ord($match[1])); },
$diff
);

But given the size and complexity of it, I'm hesitant to go with it.

What do you think?

Anthony

12 years ago by Ferenc Kovacs — view source

unread

Hey all,

I've run into an issue with run-tests.php with the junit format. The XML
that it generates can be invalid because of invalid UTF-8 characters and
invalid XML characters. This means that trying to parse it using something
like Jenkins gives a huge stack-trace because of invalid XML. I've been
digging through how to fix it, and I think I've come up with a solution.
But I'm not too happy with it, so I'd like some feedback.

https://github.com/php/php-src/blob/master/run-tests.php#L2096

Right now, the diff for a failed test is just injected in cdata tags, and
stuck unencoded in the result XML. For tests that are testing invalid UTF-8
bytes (or other character sets), that diff can contain bad byte sequences.
    $diff = empty($diff) ? '' : "<![CDATA[\n " . preg_replace('/\e/',
'<esc>', $diff) . "\n]]>";

What I'm proposing is to escape all non-UTF8 and non-XML safe bytes with
their value wrapped by <>. So chr(0xFF) (which is invalid in UTF8) would
become <xFF>

Now, to implement it is a bit more interesting. I've come up with a single
regex that will do it:
    $diff = preg_replace_callback(
'/(
[\x0-\x8] # Control Characters
| [\xB-\xC] # Invalid XML Characters
| [\xE-\x19] # Invalid XML Characters
| [\xF8-\xFF] # Invalid UTF-8 Bytes
| \xC0-\xDF # Invalid UTF-8 Sequence Start
| \xE0-\xEF # Invalid UTF-8 Sequence Start
| \xF0-\xF7 # Invalid UTF-8 Sequence Start
| (?<=[\x0-\x7F\xF8-\xFF])[\x80-\xBF] # Invalid UTF-8 Sequence Middle
| (?<!
[\xC0-\xDF] # Not Byte 2 of 2 Byte Sequence
| [\xE0-\xEF] # Not Byte 2 of 3 Byte Sequence
| [\xE0-\xEF][\x80-\xBF] # Not Byte 3 of 3 Byte Sequence
| [\xF0-\xF7] # Not Byte 2 of 4 Byte Sequence
| [\xF0-\xF7][\x80-\xBF] # Not Byte 3 of 4 Byte Sequence
| [\xF0-\xF7][\x80-\xBF]{2} # Not Byte 4 of 4 Byte Sequence
)[\x80-\xBF] #
Overlong Sequence
| (?<=[\xE0-\xEF])\x80-\xBF # Short 3 byte
sequence
| (?<=[\xF0-\xF7])\x80-\xBF # Short 4 byte
sequence
| (?<=[\xF0-\xF7][\x80-\xBF])\x80-\xBF # Short 4 byte
sequence

)/x',
function($match) { return sprintf('<x%02x>', ord($match[1]));
},
$diff
);

But given the size and complexity of it, I'm hesitant to go with it.

What do you think?

Anthony

we bumped into a similar problem a while back, Felipe and myself tried to
come up with an elegant solution, but failed to do so, so we ended up doing
a replace as well (
https://github.com/php/php-src/commit/78742a33a7a3c43b71d20e4b06665694c89b4c11
)
as far as I know, there is no better solution, that convert those
characters outside of the valid xml chars into some kind of textual
representation.
(and your suggestion is similar to what for example the subversion devs
implemented for their own suite:
dev@subversion.apache.org/msg00397.html" rel="nofollow" target="_blank">http://www.mail-archive.com/dev@subversion.apache.org/msg00397.html)
so I think it is ok.

ps: there are a few control characters which are allowed by the 1.0 xml
spec, see http://www.w3.org/TR/2006/REC-xml-20060816/#charsets
ps2: I think there would be still one hiccup in the code: afair we didn't
handled the case when a CDATA closure happens to be in the test output,
that could be handled by
http://www.lshift.net/blog/2007/10/25/xml-cdata-and-escaping I guess.

--
Ferenc Kovács
@Tyr43l - http://tyrael.hu

12 years ago by Gustavo Lopes — view source

unread

Em Sat, 28 Jul 2012 01:40:33 +0200, Ferenc Kovacs tyra3l@gmail.com
escreveu:

ps2: I think there would be still one hiccup in the code: afair we didn't
handled the case when a CDATA closure happens to be in the test output,
that could be handled by
http://www.lshift.net/blog/2007/10/25/xml-cdata-and-escaping I guess.

This same solution (temporarily leaving CDATA) can be applied to encoding
those special characters. See:

http://ideone.com/csB6b

If you turned say, \xF8 to ø, that would mean you be considering \xF8
to be representing U+00F8, which is actually represented with a different
byte sequence in UTF-8. Of course, the underlying problem is that many
tests are not encoded in UTF-8.

I think Anthony's solution is fine.

--
Gustavo Lopes