Newsgroups: php.internals Path: news.php.net Xref: news.php.net php.internals:58859 Return-Path: Mailing-List: contact internals-help@lists.php.net; run by ezmlm Delivered-To: mailing list internals@lists.php.net Received: (qmail 23764 invoked from network); 12 Mar 2012 07:41:16 -0000 Received: from unknown (HELO lists.php.net) (127.0.0.1) by localhost with SMTP; 12 Mar 2012 07:41:16 -0000 Authentication-Results: pb1.pair.com smtp.mail=rasmus@lerdorf.com; spf=permerror; sender-id=unknown Authentication-Results: pb1.pair.com header.from=rasmus@lerdorf.com; sender-id=unknown Received-SPF: error (pb1.pair.com: domain lerdorf.com from 209.85.160.170 cause and error) X-PHP-List-Original-Sender: rasmus@lerdorf.com X-Host-Fingerprint: 209.85.160.170 mail-gy0-f170.google.com Received: from [209.85.160.170] ([209.85.160.170:63751] helo=mail-gy0-f170.google.com) by pb1.pair.com (ecelerity 2.1.1.9-wez r(12769M)) with ESMTP id FF/A0-20445-A98AD5F4 for ; Mon, 12 Mar 2012 02:41:15 -0500 Received: by ghbg2 with SMTP id g2so2543593ghb.29 for ; Mon, 12 Mar 2012 00:41:12 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20120113; h=message-id:date:from:user-agent:mime-version:to:cc:subject :references:in-reply-to:x-enigmail-version:content-type :content-transfer-encoding:x-gm-message-state; bh=zCYeZE69vKkR72RW2/VpepGy+gE/L8K4n+Zirp/Yb58=; b=QrPnjEe/XQeppwwREIvRA2B+nCHzQC2Oe/2zHcS5g8KHjZ0wT6xnbyezRO/lfGcrJ8 H5n05KSB9PKHbKDqxUOQPraO9UwiHmMm1BpSib+i1/6nF+qVt/en8qqqNBZg5YzrkY2U b0tDJTsafkG9gXY0qlkaJryb68chsoh6W/3kkWv2xyUQ4G48J6cSn/YLUkCse8LPuDfy PM4r/kUXuq6qIjnivKYXPIvBEgy735ZJUBFSlYCN61bkU0Yb1EL7YgUb/u0f2Mj2jot2 3Ag7fq7S/pipAaPd8sHOYJ+8lq3zyKZ2hbIjmc5x8rpP3FbJYv6bDc2nb/kUGRoIW8q7 IUQw== Received: by 10.182.11.99 with SMTP id p3mr6538888obb.5.1331538072516; Mon, 12 Mar 2012 00:41:12 -0700 (PDT) Received: from [192.168.200.5] (c-50-131-44-225.hsd1.ca.comcast.net. [50.131.44.225]) by mx.google.com with ESMTPS id xh3sm16927791obb.13.2012.03.12.00.41.10 (version=SSLv3 cipher=OTHER); Mon, 12 Mar 2012 00:41:11 -0700 (PDT) Message-ID: <4F5DA894.8060606@lerdorf.com> Date: Mon, 12 Mar 2012 00:41:08 -0700 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:10.0.2) Gecko/20120216 Thunderbird/10.0.2 MIME-Version: 1.0 To: Stas Malyshev CC: PHP internals References: <4F5D9C77.3030000@lerdorf.com> <4F5DA152.10109@sugarcrm.com> In-Reply-To: <4F5DA152.10109@sugarcrm.com> X-Enigmail-Version: 1.3.5 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit X-Gm-Message-State: ALoCoQkcfoAJFupuE+RgxmLLFWCwIpq5mbgbFuHZdGl6RjDhOZBYHRLHEiSJGhDafiBUAgYmrkNz Subject: Re: [PHP-DEV] default charset confusion From: rasmus@lerdorf.com (Rasmus Lerdorf) On 03/12/2012 12:10 AM, Stas Malyshev wrote: > Hi! > >> What we really need is what we added in PHP 6. A runtime encoding ini >> setting that is distinct from the output charset which we can use here. >> That would allow people to fix all their legacy code to a specific >> runtime encoding with a single ini setting instead of changing thousands >> of lines of code. I propose that we add such a directive to 5.4.1 to >> ease migration. > > One more charset INI setting? I'm not sure I like this. We have tons of > INIs already, and adding a new one each time we change something makes > both writing applications and configuring servers harder. > But as the manual says, ISO-8859-1 and UTF-8 are the same for > htmlspecialchars() - is it wrong? If yes, what exactly is the different > between old and new behavior? I tried to read #61354 but could make > little sense out of it, it lacks expected result and I have hard time > understanding what is the problem there. Could you explain? Yes, it is a bit hard to understand from the bug report because bugs.php.net is all utf-8, but we are talking about non utf-8 apps here. This script should illustrate it: ( https://gist.github.com/2020502 ) $gb2312 = iconv('UTF-8','GB2312','我是测试'); $string = $string = "

$gb2312

"; echo htmlspecialchars($string); If you run that in PHP 5.3 you get: <pre><p>���Dz���</p></pre> The garbage-like chars there - if you don't see them, see https://gist.github.com/2020442 - is the expected output. In PHP 5.4 the output is nothing. The function recognizes that this is not valid UTF-8 and dumps the entire string. Ignoring 5.4 for a second, if you in 5.3 do this: echo htmlspecialchars($string); echo htmlspecialchars($string, NULL, "ISO-8859-1"); echo htmlspecialchars($string, NULL, "UTF-8"); You will see that the first two output the escaped string with the GB2312 bytes intact within it and the UTF-8 calls returns false because it correctly recognizes that GB2312 is not UTF-8. We don't have any such check for 8859-1, so yes, saying UTF-8 and 8859-1 are the same for htmlspecialchars() is wrong for PHP 5.3 as well as for 5.4. And as expected, under 5.4 because the default is now the UTF-8 behaviour only the second echo gives a result. -Rasmus