Newsgroups: php.internals
Path: news.php.net
Xref: news.php.net php.internals:79142
Mailing-List: contact internals-help@lists.php.net; run by ezmlm
Received-SPF: pass (pb1.pair.com: domain phcomp.co.uk designates 78.32.209.33 as permitted sender)
Date: Mon, 24 Nov 2014 23:29:11 +0000
To: internals@lists.php.net
Message-ID: <20141124232911.GB6315@phcomp.co.uk>
Mail-Followup-To: internals@lists.php.net
References: <C2A085AA-3E3A-405F-954B-4C1F68A46012@ajf.me>
 <CAESVnVo=mHD1yNSdmHAaho4Emg2qLLB6hrb_pvVXo8h68-OpVw@mail.gmail.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Disposition: inline
Content-Transfer-Encoding: 8bit
In-Reply-To: <CAESVnVo=mHD1yNSdmHAaho4Emg2qLLB6hrb_pvVXo8h68-OpVw@mail.gmail.com>
Organization: Parliament Hill Computers Ltd
User-Agent: Mutt/1.5.20 (2009-12-10)
Subject: Re: [PHP-DEV] [RFC] Unicode Escape Syntax
From: addw@phcomp.co.uk (Alain Williams)

On Mon, Nov 24, 2014 at 02:21:37PM -0800, Sara Golemon wrote:
> On Mon, Nov 24, 2014 at 2:09 PM, Andrea Faulds <ajf@ajf.me> wrote:
> > Here’s a new RFC: https://wiki.php.net/rfc/unicode_escape
> >
> I'm okay with producing UTF-8 even though our strings are technically
> binary.  As you state, UTF-8 is the de-facto encoding, and recognizing
> this is pretty reasonable.
> 
> You may want to make it a requirement that strings containing \u
> escapes are denoted as:   u"blah blah"    We set aside this format
> back in the PHP6 days (note that b"blah" is equivalent to "blah" for
> binary strings).
> 
> On the BMP versus SMP issue of \uXXXX styles, we addressed this in
> PHP6 by making \u denote 4 hexit BMP codepoints, while \U denoted six
> hexit codepoints.   e.g.    "\u1234" === "\U001234"   I'd rather
> follow this style than making \u special and different from hex and
> octal notations by using braces.

There is a big difference with \u or \U and \x or \o and that is the number of
characters that follow the escape. \x has 2, \o has 3 - both are short and easy
to count with the eye. \U012345 is quite long and it is not so visually obvious
where it should end.

Ergo: I prefer Andrea's "\u{0123}" as it is going to be more robust against typos.


One other thing that we could do is to allow code points to be named, with \U
(capital 'U') eg:

echo "\U{arabic letter alef}\n";

If you think that it is a bad idea, please update the RFC to say why this is a
bad idea and so why it is not going to happen - for now.

It would be nice since a code point is just a big number without any really obvious
meaning, but a name makes for greater clarity.

However: I suspect that interpretting this might be considerably slower which
means slower compilation.

Regards

-- 
Alain Williams
Linux/GNU Consultant - Mail systems, Web sites, Networking, Programmer, IT Lecturer.
+44 (0) 787 668 0256  http://www.phcomp.co.uk/
Parliament Hill Computers Ltd. Registration Information: http://www.phcomp.co.uk/contact.php
#include <std_disclaimer.h>