Newsgroups: php.internals Path: news.php.net Xref: news.php.net php.internals:79160 Return-Path: Mailing-List: contact internals-help@lists.php.net; run by ezmlm Delivered-To: mailing list internals@lists.php.net Received: (qmail 69380 invoked from network); 25 Nov 2014 11:25:25 -0000 Received: from unknown (HELO lists.php.net) (127.0.0.1) by localhost with SMTP; 25 Nov 2014 11:25:25 -0000 Authentication-Results: pb1.pair.com header.from=ajf@ajf.me; sender-id=pass Authentication-Results: pb1.pair.com smtp.mail=ajf@ajf.me; spf=pass; sender-id=pass Received-SPF: pass (pb1.pair.com: domain ajf.me designates 198.187.29.245 as permitted sender) X-PHP-List-Original-Sender: ajf@ajf.me X-Host-Fingerprint: 198.187.29.245 imap11-3.ox.privateemail.com Received: from [198.187.29.245] ([198.187.29.245:35222] helo=imap11-3.ox.privateemail.com) by pb1.pair.com (ecelerity 2.1.1.9-wez r(12769M)) with ESMTP id 73/05-40624-32764745 for ; Tue, 25 Nov 2014 06:25:24 -0500 Received: from localhost (localhost [127.0.0.1]) by mail.privateemail.com (Postfix) with ESMTP id 1C19B8800F8; Tue, 25 Nov 2014 06:25:21 -0500 (EST) X-Virus-Scanned: Debian amavisd-new at imap11.ox.privateemail.com Received: from mail.privateemail.com ([127.0.0.1]) by localhost (imap11.ox.privateemail.com [127.0.0.1]) (amavisd-new, port 10024) with LMTP id jzLyvhNgIFLP; Tue, 25 Nov 2014 06:25:20 -0500 (EST) Received: from oa-res-27-210.wireless.abdn.ac.uk (oa-res-27-210.wireless.abdn.ac.uk [137.50.27.210]) (using TLSv1 with cipher ECDHE-RSA-AES256-SHA (256/256 bits)) (No client certificate requested) by mail.privateemail.com (Postfix) with ESMTPSA id 671808800ED; Tue, 25 Nov 2014 06:25:19 -0500 (EST) Content-Type: text/plain; charset=utf-8 Mime-Version: 1.0 (Mac OS X Mail 8.1 \(1993\)) In-Reply-To: <20141125112050.GF6315@phcomp.co.uk> Date: Tue, 25 Nov 2014 11:25:17 +0000 Cc: internals@lists.php.net Content-Transfer-Encoding: quoted-printable Message-ID: References: <24EE758F-BF8F-4AE9-B793-20739CD9875D@ajf.me> <20141125112050.GF6315@phcomp.co.uk> To: Alain Williams X-Mailer: Apple Mail (2.1993) Subject: Re: [PHP-DEV] [RFC] Unicode Escape Syntax From: ajf@ajf.me (Andrea Faulds) > On 25 Nov 2014, at 11:20, Alain Williams wrote: >=20 > I think that we need to clarify what we are talking about. >=20 > What Andrea has proposed is a way of writing string constants. These = characters > in these strings will still be 8 bits big, this means that there needs = to be > some way of encoding characters with code points that will not fit in = 8 bits. > The only way of avoiding that would be to use, internally, 32 bit = characters -- > which would be a huge change. >=20 > So: we need to have some form of encoding. >=20 > As I started ''a way of writing string constants'' - ie a *compile* = time action. >=20 > With the code below it is likely that at *run-time* = mb_internal_encoding() has > been called before the echo is executed or the 'Content-Type:' header = specifies > some encoding. >=20 >> echo "ma=C3=B1ana \u{1F602}"; // won't output anything useful if = script >> encoding is not UTF-8 >=20 > This is not something that the compiler can guess. Well, we *do* already have a compile-time system for declaring encoding, = the declare() construct. > It is even worse if my proposal of \U{arabic letter alef} types is = added, how is > that encoded ? UTF-8 or iso-8859-6 or .... ? >=20 > So, how do we fix the problem ? >=20 > * mb_internal_encoding($new_encoding) finds every string (variable and = constant) > and converts from the previous encoding to the $new_encoding. >=20 > Possible, but horribly slow and would prob break things (eg strings = that > contain binary data). >=20 > Not a good idea. I also agree this isn=E2=80=99t a good idea. > * Decide that UTF-8 is king. > That is what I have decided - but I do not have any legacy code to = worry about > -- being a Brit I don't have to worry much. >=20 > * Rely on the programmer to understand encoding and know what the = eventual > output encoding will be and if it is not UTF-8 write characters using = \Xxx or > use mb_convert_encoding($string, $output_encoding, 'utf-8'). >=20 > If we decide to support non-utf-8 encoding at compile time then we = could extend > the syntax a bit to allow the encoding to be specified, eg: >=20 > \U{utf-8: arabic letter alef} >=20 > \U{iso-8859-6: arabic letter alef} >=20 > Ie, allow this to be optionally specified and terminated by ':'. If = not > specified then assume utf-8. There are only two sane options: * Always UTF-8 * Whatever source file encoding we=E2=80=99ve specified with declare() Of those, I=E2=80=99d prefer UTF-8, as nobody=E2=80=99s using UTF-16 or = UTF-32. -- Andrea Faulds http://ajf.me/