Newsgroups: php.internals Path: news.php.net Xref: news.php.net php.internals:58853 Return-Path: Mailing-List: contact internals-help@lists.php.net; run by ezmlm Delivered-To: mailing list internals@lists.php.net Received: (qmail 11843 invoked from network); 12 Mar 2012 06:49:35 -0000 Received: from unknown (HELO lists.php.net) (127.0.0.1) by localhost with SMTP; 12 Mar 2012 06:49:35 -0000 Authentication-Results: pb1.pair.com smtp.mail=rasmus@lerdorf.com; spf=permerror; sender-id=unknown Authentication-Results: pb1.pair.com header.from=rasmus@lerdorf.com; sender-id=unknown Received-SPF: error (pb1.pair.com: domain lerdorf.com from 209.85.213.170 cause and error) X-PHP-List-Original-Sender: rasmus@lerdorf.com X-Host-Fingerprint: 209.85.213.170 mail-yx0-f170.google.com Received: from [209.85.213.170] ([209.85.213.170:60850] helo=mail-yx0-f170.google.com) by pb1.pair.com (ecelerity 2.1.1.9-wez r(12769M)) with ESMTP id 74/E1-33887-D7C9D5F4 for ; Mon, 12 Mar 2012 01:49:34 -0500 Received: by yenl5 with SMTP id l5so2511111yen.29 for ; Sun, 11 Mar 2012 23:49:31 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20120113; h=message-id:date:from:user-agent:mime-version:to:subject :x-enigmail-version:content-type:content-transfer-encoding :x-gm-message-state; bh=s8ymVkvuIF4Vh1GgPHgwSpJHQlwGQWPKm/Hs2dQqmlY=; b=Zb443jAlWpsz1bxjb0OYY9DaHHaF6Q3vdGQiJwh5jdCK57q1+vdwhXXIHeisr6Ena/ yv71i5o4agHUCwI7U6JVs652/4MASRQIW0VmG4KVZ5wn+R2BSSKNtGz6I8ea50pa2fee TFqimpH3e1J34Lejfq36ciKwuXQ+bBGzBz0HBIZYCxJL+q6c6W6GNcBm7qyChwTtkb7p qgTvwCZQGKAgUgzInvikzRLxuXVVhwXcVxs+F/RhQU6pEOUFu2OkrXlA4iATcR/BQOYJ RSc1lgvJUiSAKnT5enduSnFM3n77YEr4cCixGlKIHx1pj92oX8IakRU1T/jFGpNVLtT0 GMqw== Received: by 10.60.7.102 with SMTP id i6mr6407688oea.9.1331534971110; Sun, 11 Mar 2012 23:49:31 -0700 (PDT) Received: from [192.168.200.5] (c-50-131-44-225.hsd1.ca.comcast.net. [50.131.44.225]) by mx.google.com with ESMTPS id b3sm19139887obp.6.2012.03.11.23.49.29 (version=SSLv3 cipher=OTHER); Sun, 11 Mar 2012 23:49:30 -0700 (PDT) Message-ID: <4F5D9C77.3030000@lerdorf.com> Date: Sun, 11 Mar 2012 23:49:27 -0700 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:10.0.2) Gecko/20120216 Thunderbird/10.0.2 MIME-Version: 1.0 To: PHP internals X-Enigmail-Version: 1.3.5 Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit X-Gm-Message-State: ALoCoQnp39kVCeY7d5Ue5LdCfbG/CAZSYx53pGoGKzNTD7Qe9cOWwxE2yPWO2PnkzH+S72cVQLaX Subject: default charset confusion From: rasmus@lerdorf.com (Rasmus Lerdorf) I caused this situation myself by not explicitly differentiating between the default charset for the internal htmlspecialchars() and htmlentities() functions and the output charset directive ini directive default_charset. The idea behind the default_charset ini directive was to act as the charset that gets specified in the HTTP Content-type header if you do not explicitly send your own Content-type header with the header() function. This has been muddied a bit by the fact that htmlspecialchars/htmlentities can take it into account when it is trying to choose which encoding to use when handling data passed to it. This isn't done by default since it actually makes little sense. It is only done if you pass an empty string as the encoding argument. If you don't pass anything at all the default is UTF-8 in 5.4. In 5.3 this was ISO-8859-1. And here is where the confusion comes in. We, myself included, have told people that they can get the 5.3 behaviour back by setting the default_charset ini directive to iso-8859-1. But, this is only true if they are forcing htmlspecialchars/htmlentities to check that setting with an empty string as the encoding arg. Most apps just do htmlspecialchars($str) and nothing else. Plus, it is really not a good idea to tie the internal encoding of data being passed to these functions to the output charset. You should be able to change the output charset without worrying about your runtime encoding at that level. What this effectively means is that we are asking people to go through their code and add an explicit charset to all htmlspecialchars() and htmlentities() calls. I think this will be a hurdle for 5.4 adoption. What we really need is what we added in PHP 6. A runtime encoding ini setting that is distinct from the output charset which we can use here. That would allow people to fix all their legacy code to a specific runtime encoding with a single ini setting instead of changing thousands of lines of code. I propose that we add such a directive to 5.4.1 to ease migration. See https://bugs.php.net/61354 for the first signs of grumbling about this one. As more people migrate I have a feeling this will end up being the most difficult part of the migration. -Rasmus