Newsgroups: php.internals Path: news.php.net Xref: news.php.net php.internals:79157 Return-Path: Mailing-List: contact internals-help@lists.php.net; run by ezmlm Delivered-To: mailing list internals@lists.php.net Received: (qmail 65635 invoked from network); 25 Nov 2014 11:19:04 -0000 Received: from unknown (HELO lists.php.net) (127.0.0.1) by localhost with SMTP; 25 Nov 2014 11:19:04 -0000 Authentication-Results: pb1.pair.com smtp.mail=ajf@ajf.me; spf=pass; sender-id=pass Authentication-Results: pb1.pair.com header.from=ajf@ajf.me; sender-id=pass Received-SPF: pass (pb1.pair.com: domain ajf.me designates 192.64.116.216 as permitted sender) X-PHP-List-Original-Sender: ajf@ajf.me X-Host-Fingerprint: 192.64.116.216 imap10-3.ox.privateemail.com Received: from [192.64.116.216] ([192.64.116.216:34279] helo=imap10-3.ox.privateemail.com) by pb1.pair.com (ecelerity 2.1.1.9-wez r(12769M)) with ESMTP id 2D/24-40624-7A564745 for ; Tue, 25 Nov 2014 06:19:03 -0500 Received: from localhost (localhost [127.0.0.1]) by mail.privateemail.com (Postfix) with ESMTP id 72AD32400D0; Tue, 25 Nov 2014 06:19:00 -0500 (EST) X-Virus-Scanned: Debian amavisd-new at imap10.ox.privateemail.com Received: from mail.privateemail.com ([127.0.0.1]) by localhost (imap10.ox.privateemail.com [127.0.0.1]) (amavisd-new, port 10024) with LMTP id VFpar1pAi_9u; Tue, 25 Nov 2014 06:19:00 -0500 (EST) Received: from oa-res-27-210.wireless.abdn.ac.uk (oa-res-27-210.wireless.abdn.ac.uk [137.50.27.210]) (using TLSv1 with cipher ECDHE-RSA-AES256-SHA (256/256 bits)) (No client certificate requested) by mail.privateemail.com (Postfix) with ESMTPSA id D49212400C2; Tue, 25 Nov 2014 06:18:59 -0500 (EST) Content-Type: text/plain; charset=utf-8 Mime-Version: 1.0 (Mac OS X Mail 8.1 \(1993\)) In-Reply-To: Date: Tue, 25 Nov 2014 11:18:57 +0000 Cc: PHP Internals Content-Transfer-Encoding: quoted-printable Message-ID: References: <24EE758F-BF8F-4AE9-B793-20739CD9875D@ajf.me> To: Dmitry Stogov X-Mailer: Apple Mail (2.1993) Subject: Re: [PHP-DEV] [RFC] Unicode Escape Syntax From: ajf@ajf.me (Andrea Faulds) > On 25 Nov 2014, at 10:41, Dmitry Stogov wrote: >=20 > u8"string" tells that the whole string is UTF-8 encoded. > Your escape Unicode proposal assumes just UTF-8 codepoint, but the = whole string encoding is still undefined. True. There=E2=80=99s an assumption there that you=E2=80=99re using a = UTF-8-compatible source file. Actually, for other encodings, do we even = guarantee that =E2=80=9C\n=E2=80=9D produces an ASCII LF just now? It = certainly will on most Windows and Unix systems, but since we=E2=80=99re = just using C=E2=80=99s =E2=80=98\n=E2=80=99 = (http://lxr.php.net/xref/PHP_TRUNK/Zend/zend_language_scanner.l#885), it = might produce the newline character of some other encoding like EBCDIC = in the right environment. > > If you're using other encodings, why do you want to use a Unicode = codepoints? Most Unicode codepoints will not supported by another = character set. >=20 > Agree, this Unicode escapes are not going to be used for anything = except UTF-8 encoded strings. > I'm not completely against it. It's just an incomplete solution. >=20 > echo "\u{1F602}"; // won't output =F0=9F=98=82 if the output encoding = is not UTF-8 >=20 > echo "=D0=9F=D1=80=D0=B8=D0=B2=D0=B5=D1=82 \u{1F602}"; // won't output = anything useful if script encoding is not UTF-8 > The second problem present even for European counties that use = Windows-1250 codepage. > echo "ma=C3=B1ana \u{1F602}"; // won't output anything useful if = script encoding is not UTF-8 > Thanks. Dmitry. Yeah, that=E2=80=99s unfortunate. Although I don=E2=80=99t think = there=E2=80=99s much we can do about it here. We can=E2=80=99t really = convert, as if most Unicode characters won=E2=80=99t be available in the = codepage you=E2=80=99re using. Even if we did have Unicode strings like the fabled PHP6 would have had, = you still have this problem when you=E2=80=99re outputting in = non-Unicode encodings. Although it=E2=80=99s worth noting that mbstring *should* handle this, = since if you have an internal encoding of UTF-8 and an output encoding = of, say, Windows-1250, you can use UTF-8 in your strings it should = convert that for you on output. How well this works in practice, = however, I have no idea. -- Andrea Faulds http://ajf.me/