Newsgroups: php.internals Path: news.php.net Xref: news.php.net php.internals:79158 Return-Path: Mailing-List: contact internals-help@lists.php.net; run by ezmlm Delivered-To: mailing list internals@lists.php.net Received: (qmail 67077 invoked from network); 25 Nov 2014 11:20:54 -0000 Received: from unknown (HELO lists.php.net) (127.0.0.1) by localhost with SMTP; 25 Nov 2014 11:20:54 -0000 Authentication-Results: pb1.pair.com smtp.mail=addw@phcomp.co.uk; spf=pass; sender-id=pass Authentication-Results: pb1.pair.com header.from=addw@phcomp.co.uk; sender-id=permerror Received-SPF: pass (pb1.pair.com: domain phcomp.co.uk designates 78.32.209.33 as permitted sender) X-PHP-List-Original-Sender: addw@phcomp.co.uk X-Host-Fingerprint: 78.32.209.33 freshmint.phcomp.co.uk Received: from [78.32.209.33] ([78.32.209.33:45894] helo=mint.phcomp.co.uk) by pb1.pair.com (ecelerity 2.1.1.9-wez r(12769M)) with ESMTP id E1/84-40624-51664745 for ; Tue, 25 Nov 2014 06:20:54 -0500 Received: from addw by mint.phcomp.co.uk with local (Exim 4.72) (envelope-from ) id 1XtEAs-000794-Vm for internals@lists.php.net; Tue, 25 Nov 2014 11:20:51 +0000 Date: Tue, 25 Nov 2014 11:20:50 +0000 To: internals@lists.php.net Message-ID: <20141125112050.GF6315@phcomp.co.uk> Mail-Followup-To: internals@lists.php.net References: <24EE758F-BF8F-4AE9-B793-20739CD9875D@ajf.me> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: Organization: Parliament Hill Computers Ltd User-Agent: Mutt/1.5.20 (2009-12-10) Subject: Re: [PHP-DEV] [RFC] Unicode Escape Syntax From: addw@phcomp.co.uk (Alain Williams) On Tue, Nov 25, 2014 at 02:41:48PM +0400, Dmitry Stogov wrote: > I'm not completely against it. It's just an incomplete solution. > > echo "\u{1F602}"; // won't output ? if the output encoding is not UTF-8 > > echo "Привет \u{1F602}"; // won't output anything useful if script > encoding is not UTF-8 > > The second problem present even for European counties that use Windows-1250 > codepage. I think that we need to clarify what we are talking about. What Andrea has proposed is a way of writing string constants. These characters in these strings will still be 8 bits big, this means that there needs to be some way of encoding characters with code points that will not fit in 8 bits. The only way of avoiding that would be to use, internally, 32 bit characters -- which would be a huge change. So: we need to have some form of encoding. As I started ''a way of writing string constants'' - ie a *compile* time action. With the code below it is likely that at *run-time* mb_internal_encoding() has been called before the echo is executed or the 'Content-Type:' header specifies some encoding. > echo "mañana \u{1F602}"; // won't output anything useful if script > encoding is not UTF-8 This is not something that the compiler can guess. It is even worse if my proposal of \U{arabic letter alef} types is added, how is that encoded ? UTF-8 or iso-8859-6 or .... ? So, how do we fix the problem ? * mb_internal_encoding($new_encoding) finds every string (variable and constant) and converts from the previous encoding to the $new_encoding. Possible, but horribly slow and would prob break things (eg strings that contain binary data). Not a good idea. * Decide that UTF-8 is king. That is what I have decided - but I do not have any legacy code to worry about -- being a Brit I don't have to worry much. * Rely on the programmer to understand encoding and know what the eventual output encoding will be and if it is not UTF-8 write characters using \Xxx or use mb_convert_encoding($string, $output_encoding, 'utf-8'). If we decide to support non-utf-8 encoding at compile time then we could extend the syntax a bit to allow the encoding to be specified, eg: \U{utf-8: arabic letter alef} \U{iso-8859-6: arabic letter alef} Ie, allow this to be optionally specified and terminated by ':'. If not specified then assume utf-8. -- Alain Williams Linux/GNU Consultant - Mail systems, Web sites, Networking, Programmer, IT Lecturer. +44 (0) 787 668 0256 http://www.phcomp.co.uk/ Parliament Hill Computers Ltd. Registration Information: http://www.phcomp.co.uk/contact.php #include