Newsgroups: php.internals Path: news.php.net Xref: news.php.net php.internals:62429 Return-Path: Mailing-List: contact internals-help@lists.php.net; run by ezmlm Delivered-To: mailing list internals@lists.php.net Received: (qmail 53126 invoked from network); 23 Aug 2012 16:10:17 -0000 Received: from unknown (HELO lists.php.net) (127.0.0.1) by localhost with SMTP; 23 Aug 2012 16:10:17 -0000 Authentication-Results: pb1.pair.com smtp.mail=ajf@ajf.me; spf=pass; sender-id=pass Authentication-Results: pb1.pair.com header.from=ajf@ajf.me; sender-id=pass Received-SPF: pass (pb1.pair.com: domain ajf.me designates 64.22.89.133 as permitted sender) X-PHP-List-Original-Sender: ajf@ajf.me X-Host-Fingerprint: 64.22.89.133 oxmail.registrar-servers.com Received: from [64.22.89.133] ([64.22.89.133:35005] helo=oxmail.registrar-servers.com) by pb1.pair.com (ecelerity 2.1.1.9-wez r(12769M)) with ESMTP id 30/B2-40468-8E556305 for ; Thu, 23 Aug 2012 12:10:17 -0400 Received: from [192.168.0.200] (5ad3285b.bb.sky.com [90.211.40.91]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (No client certificate requested) by oxmail.registrar-servers.com (Postfix) with ESMTPSA id CADAF7580EF; Thu, 23 Aug 2012 12:10:13 -0400 (EDT) Message-ID: <503655C8.9070406@ajf.me> Date: Thu, 23 Aug 2012 17:09:44 +0100 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:14.0) Gecko/20120714 Thunderbird/14.0 MIME-Version: 1.0 To: Rasmus Lerdorf CC: PHP internals References: <5036551E.1030804@lerdorf.com> In-Reply-To: <5036551E.1030804@lerdorf.com> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Subject: Re: [PHP-DEV] Default input encoding for htmlspecialchars/htmlentities From: ajf@ajf.me (Andrew Faulds) On 23/08/12 17:06, Rasmus Lerdorf wrote: > htmlspecialchars(), htmlentities(), html_entity_decode() and > get_html_translation_table() all take an encoding parameter that used to > default to iso-8859-1. We changed the default in PHP 5.4 to UTF-8. This > is a much more sensible default and in the case of the encoding > functions more secure as it prevents invalid UTF-8 from getting through. > If you use 8859-1 as the default but your app is actually in UTF-8 or > worse, some encoding that isn't low-ascii compatible then > htmlspecialchars()/htmlentities() aren't doing what you think they are > and you have a glaring security hole in your app. > > However, people are understandably lazy and don't want to think about > this stuff. They don't want to explicitly provide their input encoding > to these calls. We provided a solution to this and a way to write > portable apps and that was to pass in an empty string "" as the > encoding. If we saw this we would set the input encoding to match the > output encoding specified by the "default_charset" ini setting. We > couldn't just default to this default_charset because input and output > encodings may very well be different and we would risk making existing > apps insecure. For example an app using BIG5/CJK for its output encoding > might very well be pulling data from 8859/UTF-8 data sources and if we > invisibly switched htmlspecialchars/entities to match their output > encoding we would have problems. Invisibly switching them from 8859-1 to > UTF-8 could still be problematic, but it at least it fails safe in that > it doesn't let invalid UTF-8 through and encodes low-ascii the same way > it did before. > > The problem is that there is a lot of legacy code out there that doesn't > explicitly set the encoding on those calls and it is a lot of work to go > through and specify it on each call. I still personally prefer to have > people be explicit here, but I think it is slowing 5.4 adoption (see bug > 61354). > > In PHP 6 we tried to introduce separate input, script and output > encoding settings. Currently in 5.4 we don't have that, but we have > those 3 separately for mbstring and for iconv: > > iconv.input_encoding > iconv.internal_encoding > iconv.output_encoding > mbstring.http_input > mbstring.internal_encoding > mbstring.http_output > > Ideally we should be getting rid of the per-feature encoding settings > and have a single set of them that we refer to when we need them. This > is one of these places where we really need a default input encoding > setting. We could have it check mbstring.http_input, but there is a > wrinkle here that it has a fancy "auto" setting which we don't really > want in this case. So we could set it to iconv.input_encoding, but that > seems rather random and unintuitive. > > So do we create a new default_input_encoding ini directive mid-stream in > 5.4 for this? Of course with the longer-term in mind that this will be > part of a unified set of encoding settings in 5.5 and beyond. > > -Rasmus > Personally, I think you should have just two encodings: page_encoding and internal_encoding. The former is for form input and page output (could be latin-1, for instance), and internal_encoding is the internal representation (default to utf-8 - you can deal with all of, say, latin-1, as well as unicode entities). Input and output, on the web at least, are almost always going to match. -- Andrew Faulds http://ajf.me/