Newsgroups: php.internals Path: news.php.net Xref: news.php.net php.internals:61847 Return-Path: Mailing-List: contact internals-help@lists.php.net; run by ezmlm Delivered-To: mailing list internals@lists.php.net Received: (qmail 78778 invoked from network); 27 Jul 2012 23:40:39 -0000 Received: from unknown (HELO lists.php.net) (127.0.0.1) by localhost with SMTP; 27 Jul 2012 23:40:39 -0000 Authentication-Results: pb1.pair.com header.from=tyra3l@gmail.com; sender-id=pass Authentication-Results: pb1.pair.com smtp.mail=tyra3l@gmail.com; spf=pass; sender-id=pass Received-SPF: pass (pb1.pair.com: domain gmail.com designates 209.85.212.42 as permitted sender) X-PHP-List-Original-Sender: tyra3l@gmail.com X-Host-Fingerprint: 209.85.212.42 mail-vb0-f42.google.com Received: from [209.85.212.42] ([209.85.212.42:61332] helo=mail-vb0-f42.google.com) by pb1.pair.com (ecelerity 2.1.1.9-wez r(12769M)) with ESMTP id 5B/22-14209-4F623105 for ; Fri, 27 Jul 2012 19:40:36 -0400 Received: by vbbfs19 with SMTP id fs19so3672159vbb.29 for ; Fri, 27 Jul 2012 16:40:33 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :cc:content-type; bh=500DkKMFZSCRlZ/UQE8hfaZBjo2+fWKcKgtJKpSWU8Y=; b=TH0U6BgVYvVu89gYDGywiiO/xUZbacNShuFOs4LO3/N2Th/ViWwgAvE0hC5D7PEHzp uiLaah/tydi1+GpBHUpiaq3VPGySzHzPuKsX9Q7MBoSQ/hPKOVUBKg2YrNpCSwNHriti ppANv+WgLuxkwADdBa3W2OPRz28TvP18e5GHeQlwWNU0oxRwzC0NVBjKskiHuoEltipD vfZ4ZJGfxqk20S/tZae5cTb7U/HKpqBlrtjLz4NhvqgIwth30e/0eHiEKtBEONt4cpjP wcM7Sth72ayRt2mnqxsULTMyHb4AGzEgCQBJF22ADkWCUyn2g8HJTOQoYpRM30X7/k4v jwnQ== MIME-Version: 1.0 Received: by 10.52.94.147 with SMTP id dc19mr3684132vdb.74.1343432433593; Fri, 27 Jul 2012 16:40:33 -0700 (PDT) Received: by 10.58.1.225 with HTTP; Fri, 27 Jul 2012 16:40:33 -0700 (PDT) In-Reply-To: References: Date: Sat, 28 Jul 2012 01:40:33 +0200 Message-ID: To: Anthony Ferrara Cc: internals@lists.php.net Content-Type: multipart/alternative; boundary=bcaec5016467fbed5d04c5d83ce8 Subject: Re: [PHP-DEV] Run-tests.php JUnit format issue From: tyra3l@gmail.com (Ferenc Kovacs) --bcaec5016467fbed5d04c5d83ce8 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable On Sat, Jul 7, 2012 at 4:41 PM, Anthony Ferrara wrote= : > Hey all, > > I've run into an issue with run-tests.php with the junit format. The XML > that it generates can be invalid because of invalid UTF-8 characters and > invalid XML characters. This means that trying to parse it using somethin= g > like Jenkins gives a huge stack-trace because of invalid XML. I've been > digging through how to fix it, and I think I've come up with a solution. > But I'm not too happy with it, so I'd like some feedback. > > https://github.com/php/php-src/blob/master/run-tests.php#L2096 > > Right now, the diff for a failed test is just injected in cdata tags, and > stuck unencoded in the result XML. For tests that are testing invalid UTF= -8 > bytes (or other character sets), that diff can contain bad byte sequences= . > > $diff =3D empty($diff) ? '' : " '', $diff) . "\n]]>"; > > > What I'm proposing is to escape all non-UTF8 and non-XML safe bytes with > their value wrapped by <>. So chr(0xFF) (which is invalid in UTF8) would > become > > Now, to implement it is a bit more interesting. I've come up with a singl= e > regex that will do it: > > $diff =3D preg_replace_callback( > '/( > [\x0-\x8] # Control Characters > | [\xB-\xC] # Invalid XML Characte= rs > | [\xE-\x19] # Invalid XML Characte= rs > | [\xF8-\xFF] # Invalid UTF-8 Bytes > | [\xC0-\xDF](?![\x80-\xBF]) # Invalid UTF-8 Sequence Star= t > | [\xE0-\xEF](?![\x80-\xBF]{2}) # Invalid UTF-8 Sequence Start > | [\xF0-\xF7](?![\x80-\xBF]{3}) # Invalid UTF-8 Sequence Star= t > | (?<=3D[\x0-\x7F\xF8-\xFF])[\x80-\xBF] # Invalid UTF-8 Sequence Middle > | (? [\xC0-\xDF] # Not Byte 2 of 2 Byte Sequence > | [\xE0-\xEF] # Not Byte 2 of 3 Byte Sequence > | [\xE0-\xEF][\x80-\xBF] # Not Byte 3 of 3 Byte Sequence > | [\xF0-\xF7] # Not Byte 2 of 4 Byte Sequence > | [\xF0-\xF7][\x80-\xBF] # Not Byte 3 of 4 Byte Sequence > | [\xF0-\xF7][\x80-\xBF]{2} # Not Byte 4 of 4 Byte Sequence > )[\x80-\xBF] = # > Overlong Sequence > | (?<=3D[\xE0-\xEF])[\x80-\xBF](?![\x80-\xBF]) # Short 3 b= yte > sequence > | (?<=3D[\xF0-\xF7])[\x80-\xBF](?![\x80-\xBF]{2}) # Short 4 b= yte > sequence > | (?<=3D[\xF0-\xF7][\x80-\xBF])[\x80-\xBF](?![\x80-\xBF]) # Short 4 byte > sequence > > )/x', > function($match) { return sprintf('', ord($match[1])); > }, > $diff > ); > > But given the size and complexity of it, I'm hesitant to go with it. > > What do you think? > > Anthony > we bumped into a similar problem a while back, Felipe and myself tried to come up with an elegant solution, but failed to do so, so we ended up doing a replace as well ( https://github.com/php/php-src/commit/78742a33a7a3c43b71d20e4b06665694c89b4= c11 ) as far as I know, there is no better solution, that convert those characters outside of the valid xml chars into some kind of textual representation. (and your suggestion is similar to what for example the subversion devs implemented for their own suite: http://www.mail-archive.com/dev@subversion.apache.org/msg00397.html) so I think it is ok. ps: there are a few control characters which are allowed by the 1.0 xml spec, see http://www.w3.org/TR/2006/REC-xml-20060816/#charsets ps2: I think there would be still one hiccup in the code: afair we didn't handled the case when a CDATA closure happens to be in the test output, that could be handled by http://www.lshift.net/blog/2007/10/25/xml-cdata-and-escaping I guess. --=20 Ferenc Kov=C3=A1cs @Tyr43l - http://tyrael.hu --bcaec5016467fbed5d04c5d83ce8--