Newsgroups: php.internals Path: news.php.net Xref: news.php.net php.internals:62428 Return-Path: Mailing-List: contact internals-help@lists.php.net; run by ezmlm Delivered-To: mailing list internals@lists.php.net Received: (qmail 51780 invoked from network); 23 Aug 2012 16:07:01 -0000 Received: from unknown (HELO lists.php.net) (127.0.0.1) by localhost with SMTP; 23 Aug 2012 16:07:01 -0000 Authentication-Results: pb1.pair.com smtp.mail=rasmus@lerdorf.com; spf=permerror; sender-id=unknown Authentication-Results: pb1.pair.com header.from=rasmus@lerdorf.com; sender-id=unknown Received-SPF: error (pb1.pair.com: domain lerdorf.com from 209.85.212.42 cause and error) X-PHP-List-Original-Sender: rasmus@lerdorf.com X-Host-Fingerprint: 209.85.212.42 mail-vb0-f42.google.com Received: from [209.85.212.42] ([209.85.212.42:44990] helo=mail-vb0-f42.google.com) by pb1.pair.com (ecelerity 2.1.1.9-wez r(12769M)) with ESMTP id 3E/52-40468-32556305 for ; Thu, 23 Aug 2012 12:07:01 -0400 Received: by vbbfs19 with SMTP id fs19so1077332vbb.29 for ; Thu, 23 Aug 2012 09:06:57 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20120113; h=message-id:date:from:user-agent:mime-version:to:subject :x-enigmail-version:content-type:content-transfer-encoding :x-gm-message-state; bh=ZPRGrppIsE2alHYT3llsoz0/EpxtLM2e8QcRaMs6g0Y=; b=Sz8GuNZK8jiUM+qgp7eK1wQtcbQT2+6tmJUmiEbK1HlcEWAoNiMXTHqdUaujLaxwJv qwC/jYZauFvGFjiriWAzvvC/rB5hODbnzNiQA9FwL8IaibJLHWQNuM9lBar4sB+Agwnt ZlCkS3+82xqbKk/MNhl57Q46dev1KwhSYUsgKup9/tQ6rptSsESqKf7eBcx9vQUWodMH Y4llqwoVAwbYJ8hyzd3ciRPIqyzKukl0RBCYZ5+9gg6KmMPhgmkc+AyiOeNtCXSWFC9t +rAAMZcNB+jikjPRK3fqqt69xbuWMUWrhgrVuPLXPShYWFPhi8ka7pyyGiNKIXA2/764 /ypg== Received: by 10.220.8.17 with SMTP id f17mr1733957vcf.20.1345738017410; Thu, 23 Aug 2012 09:06:57 -0700 (PDT) Received: from [192.168.200.148] (c-50-131-44-225.hsd1.ca.comcast.net. [50.131.44.225]) by mx.google.com with ESMTPS id l12sm3920057vdh.8.2012.08.23.09.06.55 (version=SSLv3 cipher=OTHER); Thu, 23 Aug 2012 09:06:56 -0700 (PDT) Message-ID: <5036551E.1030804@lerdorf.com> Date: Thu, 23 Aug 2012 09:06:54 -0700 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:14.0) Gecko/20120714 Thunderbird/14.0 MIME-Version: 1.0 To: PHP internals X-Enigmail-Version: 1.5a1pre Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit X-Gm-Message-State: ALoCoQmFpRLHVUFn0rFVdQ0M0hh1EdBoxrHISFNrXrmD7vUWwf5XBs+Ey8A6X7+6Oi5kC0Rdizit Subject: Default input encoding for htmlspecialchars/htmlentities From: rasmus@lerdorf.com (Rasmus Lerdorf) htmlspecialchars(), htmlentities(), html_entity_decode() and get_html_translation_table() all take an encoding parameter that used to default to iso-8859-1. We changed the default in PHP 5.4 to UTF-8. This is a much more sensible default and in the case of the encoding functions more secure as it prevents invalid UTF-8 from getting through. If you use 8859-1 as the default but your app is actually in UTF-8 or worse, some encoding that isn't low-ascii compatible then htmlspecialchars()/htmlentities() aren't doing what you think they are and you have a glaring security hole in your app. However, people are understandably lazy and don't want to think about this stuff. They don't want to explicitly provide their input encoding to these calls. We provided a solution to this and a way to write portable apps and that was to pass in an empty string "" as the encoding. If we saw this we would set the input encoding to match the output encoding specified by the "default_charset" ini setting. We couldn't just default to this default_charset because input and output encodings may very well be different and we would risk making existing apps insecure. For example an app using BIG5/CJK for its output encoding might very well be pulling data from 8859/UTF-8 data sources and if we invisibly switched htmlspecialchars/entities to match their output encoding we would have problems. Invisibly switching them from 8859-1 to UTF-8 could still be problematic, but it at least it fails safe in that it doesn't let invalid UTF-8 through and encodes low-ascii the same way it did before. The problem is that there is a lot of legacy code out there that doesn't explicitly set the encoding on those calls and it is a lot of work to go through and specify it on each call. I still personally prefer to have people be explicit here, but I think it is slowing 5.4 adoption (see bug 61354). In PHP 6 we tried to introduce separate input, script and output encoding settings. Currently in 5.4 we don't have that, but we have those 3 separately for mbstring and for iconv: iconv.input_encoding iconv.internal_encoding iconv.output_encoding mbstring.http_input mbstring.internal_encoding mbstring.http_output Ideally we should be getting rid of the per-feature encoding settings and have a single set of them that we refer to when we need them. This is one of these places where we really need a default input encoding setting. We could have it check mbstring.http_input, but there is a wrinkle here that it has a fancy "auto" setting which we don't really want in this case. So we could set it to iconv.input_encoding, but that seems rather random and unintuitive. So do we create a new default_input_encoding ini directive mid-stream in 5.4 for this? Of course with the longer-term in mind that this will be part of a unified set of encoding settings in 5.5 and beyond. -Rasmus