Newsgroups: php.internals Path: news.php.net Xref: news.php.net php.internals:100692 Return-Path: Mailing-List: contact internals-help@lists.php.net; run by ezmlm Delivered-To: mailing list internals@lists.php.net Received: (qmail 23808 invoked from network); 17 Sep 2017 13:52:58 -0000 Received: from unknown (HELO lists.php.net) (127.0.0.1) by localhost with SMTP; 17 Sep 2017 13:52:58 -0000 Authentication-Results: pb1.pair.com smtp.mail=lester@lsces.co.uk; spf=pass; sender-id=pass Authentication-Results: pb1.pair.com header.from=lester@lsces.co.uk; sender-id=pass Received-SPF: pass (pb1.pair.com: domain lsces.co.uk designates 185.153.204.204 as permitted sender) X-PHP-List-Original-Sender: lester@lsces.co.uk X-Host-Fingerprint: 185.153.204.204 mail4.serversure.net Linux 2.6 Received: from [185.153.204.204] ([185.153.204.204:47003] helo=mail4.serversure.net) by pb1.pair.com (ecelerity 2.1.1.9-wez r(12769M)) with ESMTP id 3F/3B-19300-73E7EB95 for ; Sun, 17 Sep 2017 09:52:56 -0400 Received: (qmail 6697 invoked by uid 89); 17 Sep 2017 13:52:52 -0000 Received: by simscan 1.3.1 ppid: 6689, pid: 6693, t: 0.0480s scanners: attach: 1.3.1 clamav: 0.96/m:52/d:10677 Received: from unknown (HELO ?10.0.0.7?) (lester@rainbowdigitalmedia.org.uk@81.138.11.136) by mail4.serversure.net with ESMTPA; 17 Sep 2017 13:52:52 -0000 To: internals@lists.php.net References: <7E527061-26D5-4E0C-BAF7-A6F1A940053B@gmail.com> Message-ID: <7d8ff8dc-ac6c-0021-b5e8-720e6fd35115@lsces.co.uk> Date: Sun, 17 Sep 2017 14:52:52 +0100 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:52.0) Gecko/20100101 Thunderbird/52.2.1 MIME-Version: 1.0 In-Reply-To: <7E527061-26D5-4E0C-BAF7-A6F1A940053B@gmail.com> Content-Type: text/plain; charset=utf-8 Content-Language: en-GB Content-Transfer-Encoding: 8bit Subject: Re: [PHP-DEV] Progress or just 'a mess'? From: lester@lsces.co.uk (Lester Caine) On 17/09/17 11:53, Rowan Collins wrote: > On 17 September 2017 09:54:54 BST, Lester Caine wrote: >> Just what character set is PHP7 >> designed >> to work with. > > Focusing on the answerable part of this, PHP actually allows a very wide variety of characters in identifiers (names of variables, classes, functions, etc). > > I checked the PHP lang-spec repo expecting to find a set of Unicode classes, but it currently mentions "U+0080-U+00FF": https://github.com/php/php-langspec/blob/master/spec/09-lexical-structure.md#names That seems wrong to me, unless I'm looking at the wrong definition - the first part of that range is control characters, and you can have variables called things like $? (with an emoji as the entire name). > > That would definitely be the place to document the allowed characters, though, and a rigorous definition of "case insensitive" could also be added. I was wrong, by the way, to say that using "to case fold" rather than "to lower case" would solve the Turkish I problem - the key for that is to define a single locale whose case folding you are using, independent of runtime locale settings. I think this is actually the problem. Unicode is simply NOT a general solution! Normalizing is another aspect, and that can result in differences between strings if one also 'case folds'. On top of which one has to add the collation one is using to provide sort order which is another can of worms? Sorting array keys in order depends on the character set used ... which is perhaps why there seems to be a drive to replace associative arrays with simple numeric ones? "U+0020-U+007F" gives the Basic Latin set of characters (ASCII) "U+0080-U+00FF" add the "Latin-1 Supplement" The problem is that the second 128 characters is avoiding overlaying the "U+0000-U+001F" control character block, while single byte character sets WOULD be more productive if they followed the extra character convention instead. One of the irritating compromises made by Unicode? It would perhaps also be nice if the file naming convention used 'nbsp' for spaces rather than 'sp' and eliminate the need for quotes around file and directory names, but adding quotes is used by SQL to indicate 'case-sensitive' strings, yet another convention to be given a nod to? If you get an associative key from a quoted field name it is NOT case-insensitive and while a second field with the same combination of characters would be 'silly' it is something that can happen for many reasons ... and explode() falls over in some instances as a result. -- Lester Caine - G8HFL ----------------------------- Contact - http://lsces.co.uk/wiki/?page=contact L.S.Caine Electronic Services - http://lsces.co.uk EnquirySolve - http://enquirysolve.com/ Model Engineers Digital Workshop - http://medw.co.uk Rainbow Digital Media - http://rainbowdigitalmedia.co.uk