Newsgroups: php.internals Path: news.php.net Xref: news.php.net php.internals:79142 Return-Path: Mailing-List: contact internals-help@lists.php.net; run by ezmlm Delivered-To: mailing list internals@lists.php.net Received: (qmail 13906 invoked from network); 24 Nov 2014 23:29:16 -0000 Received: from unknown (HELO lists.php.net) (127.0.0.1) by localhost with SMTP; 24 Nov 2014 23:29:16 -0000 Authentication-Results: pb1.pair.com header.from=addw@phcomp.co.uk; sender-id=permerror Authentication-Results: pb1.pair.com smtp.mail=addw@phcomp.co.uk; spf=pass; sender-id=pass Received-SPF: pass (pb1.pair.com: domain phcomp.co.uk designates 78.32.209.33 as permitted sender) X-PHP-List-Original-Sender: addw@phcomp.co.uk X-Host-Fingerprint: 78.32.209.33 freshmint.phcomp.co.uk Received: from [78.32.209.33] ([78.32.209.33:41489] helo=mint.phcomp.co.uk) by pb1.pair.com (ecelerity 2.1.1.9-wez r(12769M)) with ESMTP id 64/A4-21335-B4FB3745 for ; Mon, 24 Nov 2014 18:29:15 -0500 Received: from addw by mint.phcomp.co.uk with local (Exim 4.72) (envelope-from ) id 1Xt34B-00028d-OZ for internals@lists.php.net; Mon, 24 Nov 2014 23:29:11 +0000 Date: Mon, 24 Nov 2014 23:29:11 +0000 To: internals@lists.php.net Message-ID: <20141124232911.GB6315@phcomp.co.uk> Mail-Followup-To: internals@lists.php.net References: MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: Organization: Parliament Hill Computers Ltd User-Agent: Mutt/1.5.20 (2009-12-10) Subject: Re: [PHP-DEV] [RFC] Unicode Escape Syntax From: addw@phcomp.co.uk (Alain Williams) On Mon, Nov 24, 2014 at 02:21:37PM -0800, Sara Golemon wrote: > On Mon, Nov 24, 2014 at 2:09 PM, Andrea Faulds wrote: > > Here’s a new RFC: https://wiki.php.net/rfc/unicode_escape > > > I'm okay with producing UTF-8 even though our strings are technically > binary. As you state, UTF-8 is the de-facto encoding, and recognizing > this is pretty reasonable. > > You may want to make it a requirement that strings containing \u > escapes are denoted as: u"blah blah" We set aside this format > back in the PHP6 days (note that b"blah" is equivalent to "blah" for > binary strings). > > On the BMP versus SMP issue of \uXXXX styles, we addressed this in > PHP6 by making \u denote 4 hexit BMP codepoints, while \U denoted six > hexit codepoints. e.g. "\u1234" === "\U001234" I'd rather > follow this style than making \u special and different from hex and > octal notations by using braces. There is a big difference with \u or \U and \x or \o and that is the number of characters that follow the escape. \x has 2, \o has 3 - both are short and easy to count with the eye. \U012345 is quite long and it is not so visually obvious where it should end. Ergo: I prefer Andrea's "\u{0123}" as it is going to be more robust against typos. One other thing that we could do is to allow code points to be named, with \U (capital 'U') eg: echo "\U{arabic letter alef}\n"; If you think that it is a bad idea, please update the RFC to say why this is a bad idea and so why it is not going to happen - for now. It would be nice since a code point is just a big number without any really obvious meaning, but a name makes for greater clarity. However: I suspect that interpretting this might be considerably slower which means slower compilation. Regards -- Alain Williams Linux/GNU Consultant - Mail systems, Web sites, Networking, Programmer, IT Lecturer. +44 (0) 787 668 0256 http://www.phcomp.co.uk/ Parliament Hill Computers Ltd. Registration Information: http://www.phcomp.co.uk/contact.php #include