Newsgroups: php.internals Path: news.php.net Xref: news.php.net php.internals:61112 Return-Path: Mailing-List: contact internals-help@lists.php.net; run by ezmlm Delivered-To: mailing list internals@lists.php.net Received: (qmail 34528 invoked from network); 7 Jul 2012 14:42:02 -0000 Received: from unknown (HELO lists.php.net) (127.0.0.1) by localhost with SMTP; 7 Jul 2012 14:42:02 -0000 Authentication-Results: pb1.pair.com header.from=ircmaxell@gmail.com; sender-id=pass Authentication-Results: pb1.pair.com smtp.mail=ircmaxell@gmail.com; spf=pass; sender-id=pass Received-SPF: pass (pb1.pair.com: domain gmail.com designates 209.85.216.170 as permitted sender) X-PHP-List-Original-Sender: ircmaxell@gmail.com X-Host-Fingerprint: 209.85.216.170 mail-qc0-f170.google.com Received: from [209.85.216.170] ([209.85.216.170:40695] helo=mail-qc0-f170.google.com) by pb1.pair.com (ecelerity 2.1.1.9-wez r(12769M)) with ESMTP id 5D/10-33041-9BA48FF4 for ; Sat, 07 Jul 2012 10:42:01 -0400 Received: by qcmt36 with SMTP id t36so6598729qcm.29 for ; Sat, 07 Jul 2012 07:41:58 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:date:message-id:subject:from:to:content-type; bh=kLieJZXuNFEhYTJ3eTPZuE+RqEEDbWZPmsDn5uVksUc=; b=NIrSXiWvOk4lgnCUV37AJ1EBrI5ItWN9SV9HG7Z99toxAyu7TZTfg8xHtJFNrOr0Tf ggW4qiLnebT/oV9GOx43wvaH04Twna/2lWLt/LJYxqzCLgiOqJlk7l5CNCyAgnX9Al+F zlDOzJOeR0TWiKcBMUbjFCIiMKqMYtm7XVecqow5KYTe4Oijrtmz1Ga0Vmm+hD/lhpaG a+Psh12Oxbr+7xMYy41pvo3GIonfzxbAFIkTz5sRIMpZ5P7gHjjCx+mwqEG2ipCv3jsQ tFj6+47Ol/8zWEmWioXE3JETLDDsn6TCbk9CRke6ddUK8+wm9YTE+7rR04uAHVq0Ju1h KrwQ== MIME-Version: 1.0 Received: by 10.229.136.81 with SMTP id q17mr17768299qct.115.1341672118757; Sat, 07 Jul 2012 07:41:58 -0700 (PDT) Received: by 10.229.232.11 with HTTP; Sat, 7 Jul 2012 07:41:58 -0700 (PDT) Date: Sat, 7 Jul 2012 10:41:58 -0400 Message-ID: To: internals@lists.php.net Content-Type: multipart/alternative; boundary=00248c76910e0b2b8704c43e620b Subject: Run-tests.php JUnit format issue From: ircmaxell@gmail.com (Anthony Ferrara) --00248c76910e0b2b8704c43e620b Content-Type: text/plain; charset=ISO-8859-1 Hey all, I've run into an issue with run-tests.php with the junit format. The XML that it generates can be invalid because of invalid UTF-8 characters and invalid XML characters. This means that trying to parse it using something like Jenkins gives a huge stack-trace because of invalid XML. I've been digging through how to fix it, and I think I've come up with a solution. But I'm not too happy with it, so I'd like some feedback. https://github.com/php/php-src/blob/master/run-tests.php#L2096 Right now, the diff for a failed test is just injected in cdata tags, and stuck unencoded in the result XML. For tests that are testing invalid UTF-8 bytes (or other character sets), that diff can contain bad byte sequences. $diff = empty($diff) ? '' : "', $diff) . "\n]]>"; What I'm proposing is to escape all non-UTF8 and non-XML safe bytes with their value wrapped by <>. So chr(0xFF) (which is invalid in UTF8) would become Now, to implement it is a bit more interesting. I've come up with a single regex that will do it: $diff = preg_replace_callback( '/( [\x0-\x8] # Control Characters | [\xB-\xC] # Invalid XML Characters | [\xE-\x19] # Invalid XML Characters | [\xF8-\xFF] # Invalid UTF-8 Bytes | [\xC0-\xDF](?![\x80-\xBF]) # Invalid UTF-8 Sequence Start | [\xE0-\xEF](?![\x80-\xBF]{2}) # Invalid UTF-8 Sequence Start | [\xF0-\xF7](?![\x80-\xBF]{3}) # Invalid UTF-8 Sequence Start | (?<=[\x0-\x7F\xF8-\xFF])[\x80-\xBF] # Invalid UTF-8 Sequence Middle | (?', ord($match[1])); }, $diff ); But given the size and complexity of it, I'm hesitant to go with it. What do you think? Anthony --00248c76910e0b2b8704c43e620b--