Newsgroups: php.internals Path: news.php.net Xref: news.php.net php.internals:40064 Return-Path: Mailing-List: contact internals-help@lists.php.net; run by ezmlm Delivered-To: mailing list internals@lists.php.net Received: (qmail 21151 invoked from network); 21 Aug 2008 23:21:25 -0000 Received: from unknown (HELO lists.php.net) (127.0.0.1) by localhost with SMTP; 21 Aug 2008 23:21:25 -0000 Authentication-Results: pb1.pair.com smtp.mail=wrowe@rowe-clan.net; spf=permerror; sender-id=unknown Authentication-Results: pb1.pair.com header.from=wrowe@rowe-clan.net; sender-id=unknown Received-SPF: error (pb1.pair.com: domain rowe-clan.net from 64.202.165.33 cause and error) X-PHP-List-Original-Sender: wrowe@rowe-clan.net X-Host-Fingerprint: 64.202.165.33 smtpauth11.prod.mesa1.secureserver.net Received: from [64.202.165.33] ([64.202.165.33:33183] helo=smtpauth11.prod.mesa1.secureserver.net) by pb1.pair.com (ecelerity 2.1.1.9-wez r(12769M)) with ESMTP id E4/62-07575-378FDA84 for ; Thu, 21 Aug 2008 19:21:24 -0400 Received: (qmail 9867 invoked from network); 21 Aug 2008 23:21:21 -0000 Received: from unknown (98.212.183.150) by smtpauth11.prod.mesa1.secureserver.net (64.202.165.33) with ESMTP; 21 Aug 2008 23:21:20 -0000 Message-ID: <48ADF86F.2070106@rowe-clan.net> Date: Thu, 21 Aug 2008 18:21:19 -0500 User-Agent: Thunderbird 2.0.0.16 (X11/20080723) MIME-Version: 1.0 To: =?UTF-8?B?RGF2aWQgWsO8bGtl?= CC: Rasmus Lerdorf , Stanislav Malyshev , 'PHP Internals' References: <48ACC389.2030801@zend.com> <48ACC638.1030904@rowe-clan.net> <7C51580F-C656-47D9-9269-CA140AA9EBC2@bitextender.com> <48AD9312.9050903@lerdorf.com> <1D87B84E-1502-4BBA-8CDB-0A9E73A8196F@bitextender.com> <48AD9AA6.9040805@lerdorf.com> <48AD9CCC.8070208@lerdorf.com> <158D158E-8A72-4DE2-81D1-49A01BC948B2@bitextender.com> In-Reply-To: <158D158E-8A72-4DE2-81D1-49A01BC948B2@bitextender.com> X-Enigmail-Version: 0.95.6 Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 8bit Subject: Re: [PHP-DEV] bug #43941 From: wrowe@rowe-clan.net ("William A. Rowe, Jr.") David Zülke wrote: > Am 21.08.2008 um 18:50 schrieb Rasmus Lerdorf: > >> David Zülke wrote: >>> Am 21.08.2008 um 18:41 schrieb Rasmus Lerdorf: >>> >>>> David Zülke wrote: >>>>> Am 21.08.2008 um 18:08 schrieb Rasmus Lerdorf: >>>>> >>>>>> David Zülke wrote: >>>>>>> Am 21.08.2008 um 03:34 schrieb William A. Rowe, Jr.: >>>>>>> >>>>>>>> Stanislav Malyshev wrote: >>>>>>>>> Hi! >>>>>>>>> Are there any objections to incorporating bugfix for #43941 >>>>>>>>> (fix for >>>>>>>>> how json handles invalid UTF-8 sequences) into 5.2? I had some >>>>>>>>> requests about it, right now it's only in 5.3+. >>>>>>>> >>>>>>>> Is there the alternative of substituting an unmappable character >>>>>>>> FFFD in >>>>>>>> place of the invalid sequence? This a a reasonable alternative >>>>>>>> behavior >>>>>>>> for some less stringent cases. >>>>>>>> >>>>>>>> (Yes, the fix is better than the status quo, but just taking this a >>>>>>>> step >>>>>>>> further). >>>>>>> >>>>>>> I agree, that would be quite reasonable and also more consistent >>>>>>> with >>>>>>> how UTF-8 works in other apps (browsers etc). >>>>>> >>>>>> Well, using browsers as the benchmark here is a bad idea. IE is >>>>>> absolutely braindead about dealing with illegal UTF-8 chars. It will >>>>>> accept just about any sequence of bytes as a valid UTF-8 char which >>>>>> causes all sorts of problems. >>>>> >>>>> I was talking about the common representation of an invalid sequence. >>>>> That's the question mark sign you usually see in a browser when the >>>>> encoding is incorrect. >>>> >>>> Yes, but it all comes down to how you do it. Say you have a 3 byte >>>> sequence that starts with 0xE0 (E0 indicates the start of a 3-byte >>>> utf-8 char) but the 3 bytes together don't actually make up a valid >>>> utf-8 char. Id you substitute those 3 bytes with a ? or some other >>>> character you have just created a nasty XSS vector for web apps. >>> >>> You don't substitute it with "a ? or some other character", you replace >>> it with U+FFFD (0xEF 0xBF 0xBD in UTF-8). I'd love to hear how that >>> causes an attack vector. >> >> It doesn't matter what you replace it with. If the byte sequence is: >> >> 0xE0 " > >> >> And you replace those bytes with some other byte in this sort of context: >> >> >> >> >> Now do your silly replacement: >> >> >> >> That now means that IE interprets the value attribute of the foo >> element as: value="0xEF 0xBF 0xBD > > should never be regarded a valid sequence since neither " nor > are in > the range above 0x7F... This is (obviously) given to multiple intepretations. But when I suggested the feature, I mentioned for "less stringent apps". Rasmus' case, the URL, should be more stringent and reject those which contained wholly invalid utf-8 sequences, for short sequences, overlong sequences and outright unmappable bytes.