Newsgroups: php.internals Path: news.php.net Xref: news.php.net php.internals:58863 Return-Path: Mailing-List: contact internals-help@lists.php.net; run by ezmlm Delivered-To: mailing list internals@lists.php.net Received: (qmail 30317 invoked from network); 12 Mar 2012 08:12:04 -0000 Received: from unknown (HELO lists.php.net) (127.0.0.1) by localhost with SMTP; 12 Mar 2012 08:12:04 -0000 Authentication-Results: pb1.pair.com header.from=rasmus@lerdorf.com; sender-id=unknown Authentication-Results: pb1.pair.com smtp.mail=rasmus@lerdorf.com; spf=permerror; sender-id=unknown Received-SPF: error (pb1.pair.com: domain lerdorf.com from 209.85.213.42 cause and error) X-PHP-List-Original-Sender: rasmus@lerdorf.com X-Host-Fingerprint: 209.85.213.42 mail-yw0-f42.google.com Received: from [209.85.213.42] ([209.85.213.42:38030] helo=mail-yw0-f42.google.com) by pb1.pair.com (ecelerity 2.1.1.9-wez r(12769M)) with ESMTP id 2A/02-20445-3DFAD5F4 for ; Mon, 12 Mar 2012 03:12:04 -0500 Received: by yhfq11 with SMTP id q11so2556696yhf.29 for ; Mon, 12 Mar 2012 01:12:01 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20120113; h=message-id:date:from:user-agent:mime-version:to:cc:subject :references:in-reply-to:x-enigmail-version:content-type :content-transfer-encoding:x-gm-message-state; bh=juBygU/04yTB54jzTSO7M+LF8NoaCN3QfB7XwutIgCw=; b=L/5epAW+jlekRjOhgRcK9su6dnud/iJ7RsHy6UYf+XXD8+NtSKYEp9enJIIrTAZHxa uaLqLcTv254blSjwZJtm+xeFAJBrvUhovamezrDPXgdCl3k7qy6eVyvV19o8RKDH6tUY 440GnrD16H5A/2sONlkek+fh1r30QJX/39IHbGMIND1rz/RTlvV4t2My12rcF4YT53JB lDiV706NNqSuVYkl3IOTFk3Oy4MHubMDkPqEcDODFDawDrUh/+i8UHLreMbnGmT0pnL2 3bL5Kcq1Yueiap6KCEPkuueDSVCvVWqKVLlE0bYVUlLHUg3w84tT2GmQlVA/vjJMhs4J W/jw== Received: by 10.182.44.73 with SMTP id c9mr2132271obm.41.1331539921379; Mon, 12 Mar 2012 01:12:01 -0700 (PDT) Received: from [192.168.200.5] (c-50-131-44-225.hsd1.ca.comcast.net. [50.131.44.225]) by mx.google.com with ESMTPS id n7sm10039427oeh.4.2012.03.12.01.12.00 (version=SSLv3 cipher=OTHER); Mon, 12 Mar 2012 01:12:00 -0700 (PDT) Message-ID: <4F5DAFCE.8020600@lerdorf.com> Date: Mon, 12 Mar 2012 01:11:58 -0700 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:10.0.2) Gecko/20120216 Thunderbird/10.0.2 MIME-Version: 1.0 To: Stas Malyshev CC: PHP internals References: <4F5D9C77.3030000@lerdorf.com> <4F5DA152.10109@sugarcrm.com> <4F5DA894.8060606@lerdorf.com> <4F5DAB49.3030808@sugarcrm.com> In-Reply-To: <4F5DAB49.3030808@sugarcrm.com> X-Enigmail-Version: 1.3.5 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 7bit X-Gm-Message-State: ALoCoQnFELCRQkx0d+gNdzAFfGNzlrwwMlCW2BlYThdMA1svt9Pz0lOgT2c8JSadoCZxpze5ZsIB Subject: Re: [PHP-DEV] default charset confusion From: rasmus@lerdorf.com (Rasmus Lerdorf) On 03/12/2012 12:52 AM, Stas Malyshev wrote: > Hi! > >> Ignoring 5.4 for a second, if you in 5.3 do this: >> >> echo htmlspecialchars($string); >> echo htmlspecialchars($string, NULL, "ISO-8859-1"); >> echo htmlspecialchars($string, NULL, "UTF-8"); >> >> You will see that the first two output the escaped string with the >> GB2312 bytes intact within it and the UTF-8 calls returns false because >> it correctly recognizes that GB2312 is not UTF-8. We don't have any such >> check for 8859-1, so yes, saying UTF-8 and 8859-1 are the same for >> htmlspecialchars() is wrong for PHP 5.3 as well as for 5.4. > > So the difference is that ISO8859-1 does not validate but UTF-8 validates? > I'm not sure what GB2312 encoding does but isn't it dangerous to do > htmlspecialchars() with wrong encoding? Wouldn't htmlentities() also > produce wrong result when used with wrong encoding? Not sure you can validate 8859-1 since it isn't multibyte, can you? Is there any byte that is explicitly forbidden in 8859-1? And yes, it may very well be dangerous to use the wrong charset and now that we have better support for GB2312 and other asian charsets in the entities functions in 5.4 it is even more prudent to choose the right one so we should provide some way to help people get it right short of changing every call. Gustavo suggested we could use the multibyte encoding setting. Unfortunately only zend.script_encoding is available and I think internal_encoding is closer to what we need here, but that is only available as mbstring.internal_encoding. -Rasmus