Newsgroups: php.internals Path: news.php.net Xref: news.php.net php.internals:79166 Return-Path: Mailing-List: contact internals-help@lists.php.net; run by ezmlm Delivered-To: mailing list internals@lists.php.net Received: (qmail 80908 invoked from network); 25 Nov 2014 12:06:18 -0000 Received: from unknown (HELO lists.php.net) (127.0.0.1) by localhost with SMTP; 25 Nov 2014 12:06:18 -0000 Authentication-Results: pb1.pair.com smtp.mail=dmitry@zend.com; spf=pass; sender-id=pass Authentication-Results: pb1.pair.com header.from=dmitry@zend.com; sender-id=pass Received-SPF: pass (pb1.pair.com: domain zend.com designates 209.85.220.179 as permitted sender) X-PHP-List-Original-Sender: dmitry@zend.com X-Host-Fingerprint: 209.85.220.179 mail-vc0-f179.google.com Received: from [209.85.220.179] ([209.85.220.179:64479] helo=mail-vc0-f179.google.com) by pb1.pair.com (ecelerity 2.1.1.9-wez r(12769M)) with ESMTP id 19/67-40624-6B074745 for ; Tue, 25 Nov 2014 07:06:16 -0500 Received: by mail-vc0-f179.google.com with SMTP id le20so181690vcb.10 for ; Tue, 25 Nov 2014 04:06:10 -0800 (PST) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:mime-version:in-reply-to:references:date :message-id:subject:from:to:cc:content-type; bh=ecZgBRRaj6JwQpm/iDkhILZkkspRSpno+W3TAkuQRvg=; b=P28DBXRQax2xtvx6CZeGTwCnFIsI0lL3omXJDZ4tM9RSc40w3qAtAKP+6eJOehuFVZ RYnc2ohBBJqCamTqTCXr3ZEgAu2nOOsn+L8o6vv76IGbQznjNLUaCds+o8IOK3R4DHpJ TKxx0LFHMljcyTooLW2TVsB/nkahMQMhFWoNhB5GV+e6xQiZ88M2QUIYnmlLRwfCqb00 ngMXEjYbGub7kT5LYKMKq67ytVa8bAimNEks+z+yAagWl2HcPfY2TP1RiZp6VoeLVOMF HK9+xYEWgmQ63llKOQHt+xuFTcK95n58Tf/tF6Ju+EXtqX3cWqBo0arnCtFtyM9SR0Mm 4I6A== X-Gm-Message-State: ALoCoQmWOL4eeVBSVuM2Cgr9BpeAg/RJ3Tu2Mr1LmStS7KazLNG+odtj1GLs/t7fWKhMRqY4XVvvL32dECWTBjUGhi9BwreT+Xewl/+ihpeF9uDeiOm6E25HhkIJ7Ux5FH17pIeG2FysOc8tRzU8P+4dJQ6QVuEzkw== MIME-Version: 1.0 X-Received: by 10.220.143.16 with SMTP id s16mr14477505vcu.53.1416917170719; Tue, 25 Nov 2014 04:06:10 -0800 (PST) Received: by 10.52.176.231 with HTTP; Tue, 25 Nov 2014 04:06:10 -0800 (PST) In-Reply-To: References: <24EE758F-BF8F-4AE9-B793-20739CD9875D@ajf.me> Date: Tue, 25 Nov 2014 16:06:10 +0400 Message-ID: To: Andrea Faulds Cc: PHP Internals Content-Type: multipart/alternative; boundary=047d7b33db6aa328b90508adbc19 Subject: Re: [PHP-DEV] [RFC] Unicode Escape Syntax From: dmitry@zend.com (Dmitry Stogov) --047d7b33db6aa328b90508adbc19 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable On Tue, Nov 25, 2014 at 2:18 PM, Andrea Faulds wrote: > > > On 25 Nov 2014, at 10:41, Dmitry Stogov wrote: > > > > u8"string" tells that the whole string is UTF-8 encoded. > > Your escape Unicode proposal assumes just UTF-8 codepoint, but the > whole string encoding is still undefined. > > True. There=E2=80=99s an assumption there that you=E2=80=99re using a UTF= -8-compatible > source file. Actually, for other encodings, do we even guarantee that =E2= =80=9C\n=E2=80=9D > produces an ASCII LF just now? It certainly will on most Windows and Unix > systems, but since we=E2=80=99re just using C=E2=80=99s =E2=80=98\n=E2=80= =99 ( > http://lxr.php.net/xref/PHP_TRUNK/Zend/zend_language_scanner.l#885), it > might produce the newline character of some other encoding like EBCDIC in > the right environment. > > > > If you're using other encodings, why do you want to use a Unicode > codepoints? Most Unicode codepoints will not supported by another charact= er > set. > > > > Agree, this Unicode escapes are not going to be used for anything excep= t > UTF-8 encoded strings. > > I'm not completely against it. It's just an incomplete solution. > > > > echo "\u{1F602}"; // won't output =F0=9F=98=82 if the output encoding i= s not UTF-8 > > > > echo "=D0=9F=D1=80=D0=B8=D0=B2=D0=B5=D1=82 \u{1F602}"; // won't output = anything useful if script > encoding is not UTF-8 > > The second problem present even for European counties that use > Windows-1250 codepage. > > echo "ma=C3=B1ana \u{1F602}"; // won't output anything useful if script > encoding is not UTF-8 > > Thanks. Dmitry. > ot sy > Yeah, that=E2=80=99s unfortunate. Although I don=E2=80=99t think there=E2= =80=99s much we can do > about it here. We can=E2=80=99t really convert, as if most Unicode charac= ters won=E2=80=99t > be available in the codepage you=E2=80=99re using. > If character is not available in codepage it's replaced with "?" or something, but in you case we will get unexpected UTF-8 sequence. > > Even if we did have Unicode strings like the fabled PHP6 would have had, > you still have this problem when you=E2=80=99re outputting in non-Unicode= encodings. > Right, but just for output we already have HTML entities echo "😂" // HTML entities already work independently from encodings= . I know, it's not completely the same as "\u{1F602}", but "\u{...} assumes UTF-8 is used everywhere and it's not true. PHP6 was able to use Unicode escapes with any script encodings, because it converted all the strings into some internal encoding anyway. If we convert all strings from string encoding into the same internal encoding (e.g. UTF-8 or user defined) than "\u{...}" will really work. Thanks. Dmitry. > > Although it=E2=80=99s worth noting that mbstring *should* handle this, si= nce if > you have an internal encoding of UTF-8 and an output encoding of, say, > Windows-1250, you can use UTF-8 in your strings it should convert that fo= r > you on output. How well this works in practice, however, I have no idea. > -- > Andrea Faulds > http://ajf.me/ > > > > > --047d7b33db6aa328b90508adbc19--