Newsgroups: php.internals Path: news.php.net Xref: news.php.net php.internals:58890 Return-Path: Mailing-List: contact internals-help@lists.php.net; run by ezmlm Delivered-To: mailing list internals@lists.php.net Received: (qmail 34334 invoked from network); 12 Mar 2012 20:12:11 -0000 Received: from unknown (HELO lists.php.net) (127.0.0.1) by localhost with SMTP; 12 Mar 2012 20:12:11 -0000 Authentication-Results: pb1.pair.com smtp.mail=rasmus@lerdorf.com; spf=permerror; sender-id=unknown Authentication-Results: pb1.pair.com header.from=rasmus@lerdorf.com; sender-id=unknown Received-SPF: error (pb1.pair.com: domain lerdorf.com from 209.85.213.42 cause and error) X-PHP-List-Original-Sender: rasmus@lerdorf.com X-Host-Fingerprint: 209.85.213.42 mail-yw0-f42.google.com Received: from [209.85.213.42] ([209.85.213.42:44968] helo=mail-yw0-f42.google.com) by pb1.pair.com (ecelerity 2.1.1.9-wez r(12769M)) with ESMTP id FC/54-13375-A985E5F4 for ; Mon, 12 Mar 2012 15:12:11 -0500 Received: by yhfq11 with SMTP id q11so3363347yhf.29 for ; Mon, 12 Mar 2012 13:12:07 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20120113; h=message-id:date:from:user-agent:mime-version:to:cc:subject :references:in-reply-to:x-enigmail-version:content-type :content-transfer-encoding:x-gm-message-state; bh=WtZnVEQ0t7MAdXhX97hgz60jidNzELpNjMOHIma62KQ=; b=P+cswGVLETOt+zweL+p+7dnwTphyODIVk6t4Y6GbEB8XEJN9DJj5MA8qeh57TgeTl3 tBa7j9qNszxjsUsfaeLTdfEysfehhd8eVEX/lWHJyGd+3lg8Eog2h0/dTBL3axIAiKfY yMF7oxTa9r03/Q7p4veQNDJXnUi0S3B5pWF5TtutnN3nqYk+rFQO8p4fN1XZTLsbfvjw Kp+pWEuSue/jIRG2uOhSuYHu3/c2ExHhKeNNpmsksFumJeRXRZoa5wSZKSIi4Pt0byJc kSzG7MsLQnt17FQK0Cp5lgq4GJPNQATxk1Wn9J9DuGltYNyGo6eXeP7qfbUnmTvmUREd 51Ag== Received: by 10.224.73.12 with SMTP id o12mr9556535qaj.98.1331583127785; Mon, 12 Mar 2012 13:12:07 -0700 (PDT) Received: from [172.16.21.6] ([38.106.64.245]) by mx.google.com with ESMTPS id cw5sm12453413qab.20.2012.03.12.13.12.06 (version=SSLv3 cipher=OTHER); Mon, 12 Mar 2012 13:12:07 -0700 (PDT) Message-ID: <4F5E5893.9030903@lerdorf.com> Date: Mon, 12 Mar 2012 13:12:03 -0700 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:10.0.2) Gecko/20120216 Thunderbird/10.0.2 MIME-Version: 1.0 To: Stas Malyshev CC: PHP internals References: <4F5D9C77.3030000@lerdorf.com> <4F5DA152.10109@sugarcrm.com> <4F5DA894.8060606@lerdorf.com> <4F5DAB49.3030808@sugarcrm.com> <4F5DAFCE.8020600@lerdorf.com> <4F5E5148.4030106@sugarcrm.com> <4F5E5219.7080501@lerdorf.com> <4F5E53C3.8060502@sugarcrm.com> In-Reply-To: <4F5E53C3.8060502@sugarcrm.com> X-Enigmail-Version: 1.3.5 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 7bit X-Gm-Message-State: ALoCoQk0+gL8A/tqN5t1X1+4pNHMs3s0NlXj981X97a1l2/QjQYqTbyf/6ym/xw7nPPrpCkoF81Y Subject: Re: [PHP-DEV] default charset confusion From: rasmus@lerdorf.com (Rasmus Lerdorf) On 03/12/2012 12:51 PM, Stas Malyshev wrote: > Hi! > >> But you can't necessarily hardcode the encoding if you are writing >> portable code. That's a bit like hardcoding a timezone. In order to >> write portable code you need to give people the ability to localize it. > > No, it's not like timezone at all. I have to support all timezones in a > global app, but I don't have to internally support every encoding on > Earth - having everything internally in UTF-8 works quite well, and a > lot of applications do exactly that - they have everything internally in > UTF-8 and only may convert when importing or exporting the data. I don't > see anything in using UTF-8 throughout the app/library that makes it > non-portable. However, if we allow to change defaults in > htmlspecialchars() etc. that essentially makes having defaults useless > as I'd have so explicitly specify UTF-8 each time - otherwise it's a > gamble what encoding I'd actually get. If everything was UTF-8 we wouldn't have any of these issues. Unfortunately that isn't the case. The question is what to do with apps that need to deal with non UTF-8 data. Are we going to provide any help to them beyond just telling them to convert everything to UTF-8? We took steps in 5.4 to improve htmlspecialchars to understand more encodings and we have the concept of script_encoding and internal_encoding that is used both in the engine and in mbstring. Currently internal_encoding isn't checked by htmlspecialchars. If you pass it '' it checks script_encoding and default_charset which is a bit odd since neither directly relate to the encoding of the internal data you are feeding to it. So maybe a way to tackle this is to use the mbstring internal encoding when it is set as the htmlspecialchars default when it is called without an encoding arg. -Rasmus