Newsgroups: php.internals
Path: news.php.net
Xref: news.php.net php.internals:40064
Mailing-List: contact internals-help@lists.php.net; run by ezmlm
Received-SPF: error (pb1.pair.com: domain rowe-clan.net from 64.202.165.33 cause and error)
Message-ID: <48ADF86F.2070106@rowe-clan.net>
Date: Thu, 21 Aug 2008 18:21:19 -0500
User-Agent: Thunderbird 2.0.0.16 (X11/20080723)
MIME-Version: 1.0
To: =?UTF-8?B?RGF2aWQgWsO8bGtl?= <david.zuelke@bitextender.com>
CC: Rasmus Lerdorf <rasmus@lerdorf.com>, 
 Stanislav Malyshev <stas@zend.com>,
 'PHP Internals' <internals@lists.php.net>
References: <48ACC389.2030801@zend.com> <48ACC638.1030904@rowe-clan.net> <7C51580F-C656-47D9-9269-CA140AA9EBC2@bitextender.com> <48AD9312.9050903@lerdorf.com> <1D87B84E-1502-4BBA-8CDB-0A9E73A8196F@bitextender.com> <48AD9AA6.9040805@lerdorf.com> <D33CCE4A-E4B1-493A-8479-7659898F005D@bitextender.com> <48AD9CCC.8070208@lerdorf.com> <158D158E-8A72-4DE2-81D1-49A01BC948B2@bitextender.com>
In-Reply-To: <158D158E-8A72-4DE2-81D1-49A01BC948B2@bitextender.com>
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
Subject: Re: [PHP-DEV] bug #43941
From: wrowe@rowe-clan.net ("William A. Rowe, Jr.")

David Zülke wrote:
> Am 21.08.2008 um 18:50 schrieb Rasmus Lerdorf:
> 
>> David Zülke wrote:
>>> Am 21.08.2008 um 18:41 schrieb Rasmus Lerdorf:
>>>
>>>> David Zülke wrote:
>>>>> Am 21.08.2008 um 18:08 schrieb Rasmus Lerdorf:
>>>>>
>>>>>> David Zülke wrote:
>>>>>>> Am 21.08.2008 um 03:34 schrieb William A. Rowe, Jr.:
>>>>>>>
>>>>>>>> Stanislav Malyshev wrote:
>>>>>>>>> Hi!
>>>>>>>>> Are there any objections to incorporating bugfix for #43941 
>>>>>>>>> (fix for
>>>>>>>>> how json handles invalid UTF-8 sequences) into 5.2? I had some
>>>>>>>>> requests about it, right now it's only in 5.3+.
>>>>>>>>
>>>>>>>> Is there the alternative of substituting an unmappable character
>>>>>>>> FFFD in
>>>>>>>> place of the invalid sequence? This a a reasonable alternative
>>>>>>>> behavior
>>>>>>>> for some less stringent cases.
>>>>>>>>
>>>>>>>> (Yes, the fix is better than the status quo, but just taking this a
>>>>>>>> step
>>>>>>>> further).
>>>>>>>
>>>>>>> I agree, that would be quite reasonable and also more consistent 
>>>>>>> with
>>>>>>> how UTF-8 works in other apps (browsers etc).
>>>>>>
>>>>>> Well, using browsers as the benchmark here is a bad idea. IE is
>>>>>> absolutely braindead about dealing with illegal UTF-8 chars. It will
>>>>>> accept just about any sequence of bytes as a valid UTF-8 char which
>>>>>> causes all sorts of problems.
>>>>>
>>>>> I was talking about the common representation of an invalid sequence.
>>>>> That's the question mark sign you usually see in a browser when the
>>>>> encoding is incorrect.
>>>>
>>>> Yes, but it all comes down to how you do it. Say you have a 3 byte
>>>> sequence that starts with 0xE0 (E0 indicates the start of a 3-byte
>>>> utf-8 char) but the 3 bytes together don't actually make up a valid
>>>> utf-8 char. Id you substitute those 3 bytes with a ? or some other
>>>> character you have just created a nasty XSS vector for web apps.
>>>
>>> You don't substitute it with "a ? or some other character", you replace
>>> it with U+FFFD (0xEF 0xBF 0xBD in UTF-8). I'd love to hear how that
>>> causes an attack vector.
>>
>> It doesn't matter what you replace it with.  If the byte sequence is:
>>
>> 0xE0 " >
>>
>> And you replace those bytes with some other byte in this sort of context:
>>
>> <input type=text name=foo value="0xE0">
>> <input type=text name=bar value="$data">
>>
>> Now do your silly replacement:
>>
>> <input type=text name=foo value="0xEF 0xBF 0xBD
>> <input type=text name=bar value="$data">
>>
>> That now means that IE interprets the value attribute of the foo 
>> element as: value="0xEF 0xBF 0xBD <input type=text name=bar value="
>> And now $data is suddenly outside the quoted value attribute!  Oops! 
>> Major XSS.  Google Groups and Yahoo were both hit by this last year.
> 
> Interesting. I assume that was a weakness in the respective 
> implementation, right? Since
> 
> 0xE0 " >
> 
> should never be regarded a valid sequence since neither " nor > are in 
> the range above 0x7F...

This is (obviously) given to multiple intepretations.

But when I suggested the feature, I mentioned for "less stringent apps".
Rasmus' case, the URL, should be more stringent and reject those which
contained wholly invalid utf-8 sequences, for short sequences, overlong
sequences and outright unmappable bytes.