Progress or just 'a mess'?

8 years ago by Lester Caine — view source — reply

unread

It's a question I've asked before, but there still does not seem to be a
proper answer ... just where is PHP in relation to unicode? The thread
on 'case-insensitive constants' cherry picks a particular aspect without
picking up on the base problem? Just what character set is PHP7 designed
to work with.

The SQL standard provides a working solution to the problem and one that
is still applied 25 years on ... it lists the subset of characters
available for writing SQL code. Essentially the Latin character set with
well defined special characters. The irritating part of cause is that
this standard is one you have to pay for copies off, but the principle
can easily be copied along perhaps with some of the extensions relating
to handling unicode data within the constrained framework.

Everything in SQL is essentially 'upper case' although I still have fun
moving datasets to PHP arrays where the keys end up as lower case'
versions of the default UPPER CASE returned by the standard. THIS is an
area where case-insensitive operations would be very useful, but that is
not going to happen any time soon.

For PHP8 is it not time to lay out a similar set of rules as provided by
SQL and identify just what 'case-insensitive' means and where it does apply?

--
Lester Caine - G8HFL

Contact - http://lsces.co.uk/wiki/?page=contact
L.S.Caine Electronic Services - http://lsces.co.uk
EnquirySolve - http://enquirysolve.com/
Model Engineers Digital Workshop - http://medw.co.uk
Rainbow Digital Media - http://rainbowdigitalmedia.co.uk

8 years ago by Rowan Collins — view source — reply

unread

Just what character set is PHP7
designed
to work with.

Focusing on the answerable part of this, PHP actually allows a very wide variety of characters in identifiers (names of variables, classes, functions, etc).

I checked the PHP lang-spec repo expecting to find a set of Unicode classes, but it currently mentions "U+0080-U+00FF": https://github.com/php/php-langspec/blob/master/spec/09-lexical-structure.md#names That seems wrong to me, unless I'm looking at the wrong definition - the first part of that range is control characters, and you can have variables called things like $? (with an emoji as the entire name).

That would definitely be the place to document the allowed characters, though, and a rigorous definition of "case insensitive" could also be added. I was wrong, by the way, to say that using "to case fold" rather than "to lower case" would solve the Turkish I problem - the key for that is to define a single locale whose case folding you are using, independent of runtime locale settings.

Regards,

--
Rowan Collins
[IMSoP]

8 years ago by Christoph M. Becker — view source — reply

unread

I checked the PHP lang-spec repo expecting to find a set of Unicode classes, but it currently mentions "U+0080-U+00FF": https://github.com/php/php-langspec/blob/master/spec/09-lexical-structure.md#names That seems wrong to me, unless I'm looking at the wrong definition - the first part of that range is control characters, and you can have variables called things like $? (with an emoji as the entire name).

The specification in the PHP manual[1] appears to be more appropriate
for our current implementation:

| As a regular expression, it would be expressed thus: '[a-zA-Z_\x7f-
| \xff][a-zA-Z0-9_\x7f-\xff]*'

With regard to control characters: that depends on the chosen character
encoding; for instance in Windows-1252 the ¢ character is mapped to \xA2.

[1] http://php.net/manual/en/language.variables.basics.php

--
Christoph M. Becker

8 years ago by Rowan Collins — view source — reply

unread

I checked the PHP lang-spec repo expecting to find a set of Unicode
classes, but it currently mentions "U+0080-U+00FF":
https://github.com/php/php-langspec/blob/master/spec/09-lexical-structure.md#names
That seems wrong to me, unless I'm looking at the wrong definition -
the first part of that range is control characters, and you can have
variables called things like $? (with an emoji as the entire name).

The specification in the PHP manual[1] appears to be more appropriate
for our current implementation:

| As a regular expression, it would be expressed thus: '[a-zA-Z_\x7f-
| \xff][a-zA-Z0-9_\x7f-\xff]*'

With regard to control characters: that depends on the chosen character
encoding; for instance in Windows-1252 the ¢ character is mapped to
\xA2.

[1] http://php.net/manual/en/language.variables.basics.php

Ah, so the mistake in the spec is that these aren't actually Unicode code points at all, but allowed bytes, which happen to allow for the UTF8 encoding of pretty much any Unicode codepoints.

That makes much more sense, but doesn't answer the other question, of if there's a working definition of what we mean by "case insensitive".

Regards,

--
Rowan Collins
[IMSoP]

8 years ago by Christoph M. Becker — view source — reply

unread

That makes much more sense, but doesn't answer the other question, of if there's a working definition of what we mean by "case insensitive".

For case-insensitive constants zend_register_constant() uses
zend_str_tolower_copy() which uses zend_tolower_ascii() which looks up
in tolower_map:
https://github.com/php/php-src/blob/php-7.0.23/Zend/zend_operators.c#L46-L63.
As the name already says, this is a simple ASCII lower case mapping
(A-Z are mapped to a-z; all others map to themselves). So only
identifiers consisting solely of ASCII characters can actually be
case-insensitive.

I presume that this map is also used for other case-insensitive identifiers.

--
Christoph M. Becker

8 years ago by Christoph M. Becker — view source — reply

unread

That makes much more sense, but doesn't answer the other question, of if there's a working definition of what we mean by "case insensitive".

For case-insensitive constants zend_register_constant() uses
zend_str_tolower_copy() which uses zend_tolower_ascii() which looks up
in tolower_map:
https://github.com/php/php-src/blob/php-7.0.23/Zend/zend_operators.c#L46-L63.
As the name already says, this is a simple ASCII lower case mapping
(A-Z are mapped to a-z; all others map to themselves). So only
identifiers consisting solely of ASCII characters can actually be
case-insensitive.

I presume that this map is also used for other case-insensitive identifiers.

See also Sara's reply to the other thread:
http://news.php.net/php.internals/100602.

--
Christoph M. Becker

8 years ago by Lester Caine — view source — reply

unread

Just what character set is PHP7
designed
to work with.

Focusing on the answerable part of this, PHP actually allows a very wide variety of characters in identifiers (names of variables, classes, functions, etc).

I checked the PHP lang-spec repo expecting to find a set of Unicode classes, but it currently mentions "U+0080-U+00FF": https://github.com/php/php-langspec/blob/master/spec/09-lexical-structure.md#names That seems wrong to me, unless I'm looking at the wrong definition - the first part of that range is control characters, and you can have variables called things like $? (with an emoji as the entire name).

That would definitely be the place to document the allowed characters, though, and a rigorous definition of "case insensitive" could also be added. I was wrong, by the way, to say that using "to case fold" rather than "to lower case" would solve the Turkish I problem - the key for that is to define a single locale whose case folding you are using, independent of runtime locale settings.

I think this is actually the problem. Unicode is simply NOT a general
solution! Normalizing is another aspect, and that can result in
differences between strings if one also 'case folds'. On top of which
one has to add the collation one is using to provide sort order which is
another can of worms? Sorting array keys in order depends on the
character set used ... which is perhaps why there seems to be a drive to
replace associative arrays with simple numeric ones?

"U+0020-U+007F" gives the Basic Latin set of characters (ASCII)
"U+0080-U+00FF" add the "Latin-1 Supplement"
The problem is that the second 128 characters is avoiding overlaying the
"U+0000-U+001F" control character block, while single byte character
sets WOULD be more productive if they followed the extra character
convention instead. One of the irritating compromises made by Unicode?

It would perhaps also be nice if the file naming convention used 'nbsp'
for spaces rather than 'sp' and eliminate the need for quotes around
file and directory names, but adding quotes is used by SQL to indicate
'case-sensitive' strings, yet another convention to be given a nod to?
If you get an associative key from a quoted field name it is NOT
case-insensitive and while a second field with the same combination of
characters would be 'silly' it is something that can happen for many
reasons ... and explode() falls over in some instances as a result.

--
Lester Caine - G8HFL

8 years ago by Stanislav Malyshev — view source — reply

unread

Hi!

picking up on the base problem? Just what character set is PHP7 designed
to work with.

What do you mean by "work with"?

For PHP8 is it not time to lay out a similar set of rules as provided by
SQL and identify just what 'case-insensitive' means and where it does apply?

I'm not sure which problem you are trying to solve here. Could you
explain what you'd be using these rules for?

Stas Malyshev
smalyshev@gmail.com

8 years ago by Lester Caine — view source — reply

unread

picking up on the base problem? Just what character set is PHP7 designed
to work with.

What do you mean by "work with"?

Actually that HAS already been identified in this thread, and it is only
the basic ASCII character set, but this is not actually specified anywhere?

For PHP8 is it not time to lay out a similar set of rules as provided by
SQL and identify just what 'case-insensitive' means and where it does apply?

I'm not sure which problem you are trying to solve here. Could you
explain what you'd be using these rules for?

Having established that the only characters that are case-insensitive in
PHP7 ... the unicode basic latin set ... the discussion SHOULD be on
either expanding that to cover all case folding or simply removing this
rather limited case? Tony Marston is making an impassioned demand to
retain this very limited case, and therefore expand it to cover all
character sets, and as a fellow 'English only' coder, I can accept that
argument. However many of my clients do not use English as a first
language so any data handling has to be unicode based, and case in that
data can be important, so is case-insensitive really as universal as
Tony thinks? Certainly we need data case-insensitivity to handle unicode
properly and not just a few english characters ( should I really add a
capital 'E' to english just to please the spell checker? )

People are using their own languages when writing PHP variables and
function names, and apart from a few edge cases this does seem to be
working for them. As with SQL, the key programming words are in English,
and I don't think anybody would suggest adding aliases for them, so
restricting keywords to 'unicode basic latin set' can be defined, but
does THEN making that case-insensitive add to the problems of making PHP
more user friendly in handling unicode names elsewhere? I am seeing SQL
field names coming in with unicode content, and these are then array
keys in PHP ... the latin characters get lower cased at times and this
DOES cause a problem if the metadata defines upper case and I suspect
that is something that will never be changed now, but the actual rules
applied would be nice to know?

--
Lester Caine - G8HFL

8 years ago by Stanislav Malyshev — view source — reply

unread

Hi!

Having established that the only characters that are case-insensitive in
PHP7 ... the unicode basic latin set ... the discussion SHOULD be on
either expanding that to cover all case folding or simply removing this
rather limited case?

Why? Does anybody seriously need Russian case folding in PHP constants?
I mean, sure, nice demo, but does anybody need it? I don't see much
code on github - in any language - that uses Russian identifiers, for
example.

argument. However many of my clients do not use English as a first
language so any data handling has to be unicode based, and case in that

You seem to be mixing data and code here. So what you are talking about

data or code?
--
Stas Malyshev
smalyshev@gmail.com

Progress or just 'a mess'?

-- Lester Caine - G8HFL

-- Lester Caine - G8HFL

I'm not sure which problem you are trying to solve here. Could you explain what you'd be using these rules for?

-- Lester Caine - G8HFL

--
Lester Caine - G8HFL

--
Lester Caine - G8HFL

I'm not sure which problem you are trying to solve here. Could you
explain what you'd be using these rules for?

--
Lester Caine - G8HFL