Unicode strings?

11 years ago by Lester Caine — view source

unread

I'm slowly working through a long list of things relating to unicode strings
trying to work out just where the main problems are.

The very first problem I hit is ICU's limitation to 32bit string lengths. How
does the switch to 64bit string length on 64 bit platforms impinge on this.
While I can see the advantage of this particular change, would that also now
require our own version of ICU capable of also handling longer strings? This
probably falls out in the wash of my next point ...

Currently strings are simply strings? I'm sure we have already had this
discussion, and it will be necessary to switch from simple strings to a string
object which can handle the intricacies of unicode?

Pierre - I presume that it's this distinction that is where I'm crossing over
between variable and similar names which just remain as simple stings while
'data' that is unicode is provided by sting objects. These then need to work
nicely with areas that expect a simple string? Where a string object is returned
an ASCII version will be created when a simple string is necessary?

The 'leak' of unicode currently into name strings is simply that there is
nothing currently stopping them from storing UTF-8? That this works is more by
luck than design, but results in subtle problems with case conversion and the
like which does not expect unicode strings? BUT people can currently use any
format data in a string even one using a 64 bit pointer as long as it does not
go through a path that does expect ASCII?

If the simple string is isolated from UTF-8 and unicode is kept to it's own data
type such as an improved integrated mbstring package then this make a suitable
'half way' house for PHP6?

I don't NEED unicode variable names, but I can see that this would be a nice to
have in non-English speaking countries. In much the same way we provide
translated versions of web pages, I can even see the advantage of function name
aliases in different languages as having more relevance that simply changing the
current English names for picky reasons, but that is not likely to happen in my
lifetime! Perhaps PHP10 :)

--
Lester Caine - G8HFL

Contact - http://lsces.co.uk/wiki/?page=contact
L.S.Caine Electronic Services - http://lsces.co.uk
EnquirySolve - http://enquirysolve.com/
Model Engineers Digital Workshop - http://medw.co.uk
Rainbow Digital Media - http://rainbowdigitalmedia.co.uk

11 years ago by Crypto Compress — view source

unread

Hi,

Am 11.03.2014 11:31, schrieb Lester Caine:

I'm slowly working through a long list of things relating to unicode
strings trying to work out just where the main problems are.

The very first problem I hit is ICU's limitation to 32bit string
lengths. How does the switch to 64bit string length on 64 bit
platforms impinge on this. While I can see the advantage of this
particular change, would that also now require our own version of ICU
capable of also handling longer strings? This probably falls out in
the wash of my next point ...

Where have you found this information? Can you please provide source for
this?

Currently strings are simply strings? I'm sure we have already had
this discussion, and it will be necessary to switch from simple
strings to a string object which can handle the intricacies of unicode?

Yes, currently we have so called binary strings (simple bytes, 8 bits).
No, we should not create an string-object to handle all intricacies of
unicode.

Pierre - I presume that it's this distinction that is where I'm
crossing over between variable and similar names which just remain as
simple stings while 'data' that is unicode is provided by sting
objects. These then need to work nicely with areas that expect a
simple string? Where a string object is returned an ASCII version will
be created when a simple string is necessary?

The 'leak' of unicode currently into name strings is simply that there
is nothing currently stopping them from storing UTF-8? That this works
is more by luck than design, but results in subtle problems with case
conversion and the like which does not expect unicode strings? BUT
people can currently use any format data in a string even one using a
64 bit pointer as long as it does not go through a path that does
expect ASCII?

If the simple string is isolated from UTF-8 and unicode is kept to
it's own data type such as an improved integrated mbstring package
then this make a suitable 'half way' house for PHP6?

I don't NEED unicode variable names, but I can see that this would be
a nice to have in non-English speaking countries. In much the same way
we provide translated versions of web pages, I can even see the
advantage of function name aliases in different languages as having
more relevance that simply changing the current English names for
picky reasons, but that is not likely to happen in my lifetime!
Perhaps PHP10 :)

We should not discuss this till Pierre/we clarified the core problems.

cryptocompress

11 years ago by Lester Caine — view source

unread

Crypto Compress wrote:

I'm slowly working through a long list of things relating to unicode strings
trying to work out just where the main problems are.

The very first problem I hit is ICU's limitation to 32bit string lengths. How
does the switch to 64bit string length on 64 bit platforms impinge on this.
While I can see the advantage of this particular change, would that also now
require our own version of ICU capable of also handling longer strings? This
probably falls out in the wash of my next point ...

Where have you found this information? Can you please provide source for this?

This information has been published in several places on the list and in the
wiki already ...
http://userguide.icu-project.org/strings/utf-8 for the ICU, and the RFC's here
for 64 bit improvements to PHP ...

Currently strings are simply strings? I'm sure we have already had this
discussion, and it will be necessary to switch from simple strings to a string
object which can handle the intricacies of unicode?

Yes, currently we have so called binary strings (simple bytes, 8 bits).
No, we should not create an string-object to handle all intricacies of unicode.

How do you provide a holder for the various additional items required for a
unicode 'object'? While I can see one would get away with calling functions all
the time on a single string object, having calculated different versions of the
same string or complex character counts, they need to be cached so they can be
used again? Or does one maintain each answer in different variables?

--
Lester Caine - G8HFL

11 years ago by Andrea Faulds — view source

unread

How do you provide a holder for the various additional items required for a unicode 'object'? While I can see one would get away with calling functions all the time on a single string object, having calculated different versions of the same string or complex character counts, they need to be cached so they can be used again? Or does one maintain each answer in different variables?

????

What other data? You could stick a Unicode string in the same space we currently stick a byte string, just encode it in UTF-8. That’s all you need.

--
Andrea Faulds
http://ajf.me/

11 years ago by Crypto Compress — view source

unread

Hi,

Am 11.03.2014 13:27, schrieb Lester Caine:

Crypto Compress wrote:

I'm slowly working through a long list of things relating to unicode
strings
trying to work out just where the main problems are.

The very first problem I hit is ICU's limitation to 32bit string
lengths. How
does the switch to 64bit string length on 64 bit platforms impinge
on this.
While I can see the advantage of this particular change, would that
also now
require our own version of ICU capable of also handling longer
strings? This
probably falls out in the wash of my next point ...

Where have you found this information? Can you please provide source
for this?

This information has been published in several places on the list and
in the wiki already ...
http://userguide.icu-project.org/strings/utf-8 for the ICU, and the
RFC's here for 64 bit improvements to PHP ...

Quote #1: "You can request 64 or 32 bits with the --with-library-bits=
option, ..."
Quote #2: "Strings are represented as UChar * as the base string type."

http://userguide.icu-project.org/icufaq#TOC-How-do-I-get-32--or-64-bit-versions-of-the-ICU-libraries-

String length is platform dependent.

Currently strings are simply strings? I'm sure we have already had this
discussion, and it will be necessary to switch from simple strings
to a string
object which can handle the intricacies of unicode?

Yes, currently we have so called binary strings (simple bytes, 8 bits).
No, we should not create an string-object to handle all intricacies
of unicode.

How do you provide a holder for the various additional items required
for a unicode 'object'? While I can see one would get away with
calling functions all the time on a single string object, having
calculated different versions of the same string or complex character
counts, they need to be cached so they can be used again? Or does one
maintain each answer in different variables?

I think of this as a "immutable ValueObject". If a string is converted,
there is no reason to cache the old string.

binary
=> convert to utf-8 as de_de.iso-8859-15@euro
=> {"utf-8", "de_DE_EURO", binary}
=> convert to utf-32
=> {"utf-32", "de_DE_EURO", bigger-binary}

What other data is needed in here to be doubtless unicode?
Is locale needed at all? Should it be nullable? Case-(in)sensitive?

cryptocompress

11 years ago by Lester Caine — view source

unread

Crypto Compress wrote:

The very first problem I hit is ICU's limitation to 32bit string lengths. How
does the switch to 64bit string length on 64 bit platforms impinge on this.
While I can see the advantage of this particular change, would that also now
require our own version of ICU capable of also handling longer strings? This
probably falls out in the wash of my next point ...

Where have you found this information? Can you please provide source for this?

This information has been published in several places on the list and in the
wiki already ...
http://userguide.icu-project.org/strings/utf-8 for the ICU, and the RFC's here
for 64 bit improvements to PHP ...

Quote #1: "You can request 64 or 32 bits with the --with-library-bits= option, ..."
Quote #2: "Strings are represented as UChar * as the base string type."

http://userguide.icu-project.org/icufaq#TOC-How-do-I-get-32--or-64-bit-versions-of-the-ICU-libraries-

String length is platform dependent.

It is not only PHP that has hidden gems of information buried in the
documentation, but ...
"For UTF-8 strings, ICU normally uses (const) char * pointers and int32_t lengths"

The question here is how UTF-8 default works in ICU as we want to actually avoid
using UChar altogether using UText instead - I think?

--
Lester Caine - G8HFL

11 years ago by Crypto Compress — view source

unread

Am 12.03.2014 11:16, schrieb Lester Caine:

Crypto Compress wrote:

The very first problem I hit is ICU's limitation to 32bit string
lengths. How
does the switch to 64bit string length on 64 bit platforms impinge
on this.
While I can see the advantage of this particular change, would
that also now
require our own version of ICU capable of also handling longer
strings? This
probably falls out in the wash of my next point ...

Where have you found this information? Can you please provide
source for this?

This information has been published in several places on the list
and in the
wiki already ...
http://userguide.icu-project.org/strings/utf-8 for the ICU, and the
RFC's here
for 64 bit improvements to PHP ...

Quote #1: "You can request 64 or 32 bits with the
--with-library-bits= option, ..."
Quote #2: "Strings are represented as UChar * as the base string type."

http://userguide.icu-project.org/icufaq#TOC-How-do-I-get-32--or-64-bit-versions-of-the-ICU-libraries-

String length is platform dependent.

It is not only PHP that has hidden gems of information buried in the
documentation, but ...
"For UTF-8 strings, ICU normally uses (const) char * pointers and
int32_t lengths"

The question here is how UTF-8 default works in ICU as we want to
actually avoid using UChar altogether using UText instead - I think?

http://www.icu-project.org/apiref/icu4c/utext_8h.html

int64_t utext_nativeLength (UText *ut) Get the length of the text.

Looks like UText is utf-16.

11 years ago by Crypto Compress — view source

unread

Am 12.03.2014 11:27, schrieb Crypto Compress:

Am 12.03.2014 11:16, schrieb Lester Caine:

Crypto Compress wrote:

The very first problem I hit is ICU's limitation to 32bit string
lengths. How
does the switch to 64bit string length on 64 bit platforms
impinge on this.
While I can see the advantage of this particular change, would
that also now
require our own version of ICU capable of also handling longer
strings? This
probably falls out in the wash of my next point ...

Where have you found this information? Can you please provide
source for this?

This information has been published in several places on the list
and in the
wiki already ...
http://userguide.icu-project.org/strings/utf-8 for the ICU, and the
RFC's here
for 64 bit improvements to PHP ...

Quote #1: "You can request 64 or 32 bits with the
--with-library-bits= option, ..."
Quote #2: "Strings are represented as UChar * as the base string type."

http://userguide.icu-project.org/icufaq#TOC-How-do-I-get-32--or-64-bit-versions-of-the-ICU-libraries-

String length is platform dependent.

It is not only PHP that has hidden gems of information buried in the
documentation, but ...
"For UTF-8 strings, ICU normally uses (const) char * pointers and
int32_t lengths"

The question here is how UTF-8 default works in ICU as we want to
actually avoid using UChar altogether using UText instead - I think?

http://www.icu-project.org/apiref/icu4c/utext_8h.html

int64_t utext_nativeLength (UText *ut) Get the length of the
text.

Looks like UText is utf-16.

ICU Text Access allows other formats, such as UTF-8 or non-contiguous
UTF-16 strings, to be placed in a UText wrapper and then passed to ICU
services.

11 years ago by Pierre Joye — view source

unread

On Wed, Mar 12, 2014 at 11:33 AM, Crypto Compress
cryptocompress@googlemail.com wrote:

ICU Text Access allows other formats, such as UTF-8 or non-contiguous
UTF-16 strings, to be placed in a UText wrapper and then passed to ICU
services.

This is running in circle and does not really help to move forwards...

Lester has a point with the UTF-8 testing. I am almost done with the
tests code and will publish it soonish.

Also I do not get your argument earlier in this discussion saying that
we should not implement objects or pseudo-objects for unicode support.
where is the problem? It can work with existing functions as well,
does not break BC, does not introduce weird syntax that prevents code
from running in 5.x and 6.x (u"foo" will f.e.). The more I look at it,
the more I think it is the way.

Cheers,

Pierre

@pierrejoye | http://www.libgd.org

11 years ago by Crypto Compress — view source

unread

Am 12.03.2014 11:54, schrieb Pierre Joye:

Also I do not get your argument earlier in this discussion saying that
we should not implement objects or pseudo-objects for unicode support.
where is the problem? It can work with existing functions as well,
does not break BC, does not introduce weird syntax that prevents code
from running in 5.x and 6.x (u"foo" will f.e.). The more I look at it,
the more I think it is the way.

Am 06.03.2014 08:56, schrieb Crypto Compress:

Would "type juggling" allow for autoboxing into a second string type
where needed (unicode-aware functions)?

No, we should not create an string-object to handle all intricacies of
unicode.
I'm against heavyweight objects not "object" per se.

11 years ago by Lester Caine — view source

unread

Pierre Joye wrote:

ICU Text Access allows other formats, such as UTF-8 or non-contiguous

UTF-16 strings, to be placed in a UText wrapper and then passed to ICU
services.
This is running in circle and does not really help to move forwards...

Lester has a point with the UTF-8 testing. I am almost done with the
tests code and will publish it soonish.

Also I do not get your argument earlier in this discussion saying that
we should not implement objects or pseudo-objects for unicode support.
where is the problem? It can work with existing functions as well,
does not break BC, does not introduce weird syntax that prevents code
from running in 5.x and 6.x (u"foo" will f.e.). The more I look at it,
the more I think it is the way.

I think we are both heading to the same point from different ends Pierre? That
is as far as handling unicode data is concerned. It's not so much running in a
circle as the chicken and egg. Select any 3 out of four options to get to the
final answer?

I'm back on windows platform looking at problems there and I had forgotten just
how badly Borland C++ handles widestring, but running ICU there and stripping
that code will work for me! I'm not sure that a library in the middle is needed,
JUST some pseudo-objects to smooth the transition? ICU running in UTF-8 mode
does seem to be the answer, but while I can test C++ builds I'm just not into
the PHP codebase enough to do the sort of testing that is needed :( Conversion
to C++ is something I could deal with ...

Unicode variable names ARE secondary, but if the handling of unicode works as
well as it seems to be for me then it may be an option that can be considered.

--
Lester Caine - G8HFL

11 years ago by Crypto Compress — view source

unread

Unicode variable names ARE secondary, but if the handling of unicode
works as well as it seems to be for me then it may be an option that
can be considered.

http://3v4l.org/kWb0U
Please help me, what is this about?

11 years ago by Lester Caine — view source

unread

Crypto Compress wrote:

Unicode variable names ARE secondary, but if the handling of unicode works as
well as it seems to be for me then it may be an option that can be considered.

http://3v4l.org/kWb0U
Please help me, what is this about?

Exactly what has already been discussed?
You can use unicode strings in many areas of PHP, but it is not by design, but
rather as the result of 'holes' in the design.
Pierre would like to prevent that happening and I can support the principle, but
I suspect that too many people already use this so there will be something of a
problem :(

--
Lester Caine - G8HFL

11 years ago by Crypto Compress — view source

unread

You can use unicode strings in many areas of PHP, but it is not by
design, but rather as the result of 'holes' in the design.

http://3v4l.org/bJYDM
Oh, i see. Thanks!

11 years ago by Yasuo Ohgaki — view source

unread

Hi all,

On Wed, Mar 12, 2014 at 9:41 PM, Crypto Compress <
cryptocompress@googlemail.com> wrote:

You can use unicode strings in many areas of PHP, but it is not by design,

but rather as the result of 'holes' in the design.

http://3v4l.org/bJYDM
Oh, i see. Thanks!

The spec of variable name is defined.

As a regular expression, it would be expressed thus:
'[a-zA-Z_\x7f-\xff][a-zA-Z0-9_\x7f-\xff]*'
http://www.php.net/manual/en/language.variables.basics.php

Therefore, people are using something more meaningful like

http://3v4l.org/78ek4

It's useful for teaching kids to program with their native language, for
example. I see many codes that use Japanese strings in Unit tests.

Regards,

--
Yasuo Ohgaki
yohgaki@ohgaki.net

11 years ago by Rasmus Lerdorf — view source

unread

Crypto Compress wrote:

Unicode variable names ARE secondary, but if the handling of unicode
works as
well as it seems to be for me then it may be an option that can be
considered.

http://3v4l.org/kWb0U
Please help me, what is this about?

Exactly what has already been discussed?
You can use unicode strings in many areas of PHP, but it is not by
design, but rather as the result of 'holes' in the design.

That's not a hole in the design. It was quite deliberate and it had
little to do with Unicode at the time. It was a deliberate effort to not
artificially limit identifiers beyond that which the language syntax
naturally prevented. Think <space> ; , { } ( ) etc.

-Rasmus

11 years ago by Crypto Compress — view source

unread

Am 13.03.2014 01:01, schrieb Rasmus Lerdorf:

Crypto Compress wrote:

Unicode variable names ARE secondary, but if the handling of unicode
works as
well as it seems to be for me then it may be an option that can be
considered.
http://3v4l.org/kWb0U
Please help me, what is this about?
Exactly what has already been discussed?
You can use unicode strings in many areas of PHP, but it is not by
design, but rather as the result of 'holes' in the design.
That's not a hole in the design. It was quite deliberate and it had
little to do with Unicode at the time. It was a deliberate effort to not
artificially limit identifiers beyond that which the language syntax
naturally prevented. Think <space> ; , { } ( ) etc.

-Rasmus

IMHO it was the right decision to no artificially limit identifiers and
it is a fair trade-off for case-insensitivity without unicode (class ß{}
class SS{}).
With unicode identifiers there is at least one more problem through
normalization to consider. somewhat simplified: $☀☁ and $⛅ (=== in unicode)

11 years ago by Yasuo Ohgaki — view source

unread

Hi all,

On Thu, Mar 13, 2014 at 10:22 AM, Crypto Compress <
cryptocompress@googlemail.com> wrote:

Am 13.03.2014 01:01, schrieb Rasmus Lerdorf:

Crypto Compress wrote:

Unicode variable names ARE secondary, but if the handling of unicode

works as
well as it seems to be for me then it may be an option that can be
considered.

http://3v4l.org/kWb0U
Please help me, what is this about?

Exactly what has already been discussed?
You can use unicode strings in many areas of PHP, but it is not by
design, but rather as the result of 'holes' in the design.

That's not a hole in the design. It was quite deliberate and it had
little to do with Unicode at the time. It was a deliberate effort to not
artificially limit identifiers beyond that which the language syntax
naturally prevented. Think <space> ; , { } ( ) etc.

-Rasmus

IMHO it was the right decision to no artificially limit identifiers and it
is a fair trade-off for case-insensitivity without unicode (class ß{} class
SS{}).
With unicode identifiers there is at least one more problem through
normalization to consider. somewhat simplified: $☀☁ and $⛅ (=== in unicode)

Good point, but users should use NFC UTF-8 without BOM for
variable/function names.
It would be documentation issue.

Regards,

--
Yasuo Ohgaki
yohgaki@ohgaki.net

11 years ago by Crypto Compress — view source

unread

Hi Yasuo,

    That's not a hole in the design. It was quite deliberate and
    it had
    little to do with Unicode at the time. It was a deliberate
    effort to not
    artificially limit identifiers beyond that which the language
    syntax
    naturally prevented. Think <space> ; , { } ( ) etc.


IMHO it was the right decision to no artificially limit
identifiers and it is a fair trade-off for case-insensitivity
without unicode (class ß{} class SS{}).
With unicode identifiers there is at least one more problem
through normalization to consider. somewhat simplified: $☀☁ and
 $⛅ (=== in unicode)

Good point, but users should use NFC UTF-8 without BOM for
variable/function names.
It would be documentation issue.

in the languages i know combining diacritics are not common so can't
evaluate how practical it is to type those. Would it be impossible to
change code with a dumb editor?

$café !== $café
0x63 0x61 0x66 0xC3 0xA9
0x63 0x61 0x66 0x65 0xCC 0x81

cryptocompress

11 years ago by Lester Caine — view source

unread

Crypto Compress wrote:

Good point, but users should use NFC UTF-8 without BOM for variable/function
names.
It would be documentation issue.

in the languages i know combining diacritics are not common so can't evaluate
how practical it is to type those. Would it be impossible to change code with a
dumb editor?

$café !== $café
0x63 0x61 0x66 0xC3 0xA9
0x63 0x61 0x66 0x65 0xCC 0x81

'cryptocompress' (is that really on your passport :( )

This is exactly the area we need to agree on a plan moving forward.

There are a number of options on the table

1 - Limit variable and other names to 'ASCII' only characters so that case
folding can be maintained.

2 - Remove 'case insensitivity' but not just for point 1 reasons.
( I see this as your example being two different strings ;) )

3 - Allow unicode names to be used in places where they currently cause problems.

Not actually using unicode variable names myself, I still don't understand where
the problems are with '3' except for the simple comparison case where
normalizing and case conversion creates a minefield? People are currently using
unicode in these areas and understand many of the restrictions?

--
Lester Caine - G8HFL

11 years ago by Crypto Compress — view source

unread

Am 13.03.2014 10:18, schrieb Lester Caine:

Crypto Compress wrote:

Good point, but users should use NFC UTF-8 without BOM for
variable/function
names.
It would be documentation issue.

in the languages i know combining diacritics are not common so can't
evaluate
how practical it is to type those. Would it be impossible to change
code with a
dumb editor?

$café !== $café
0x63 0x61 0x66 0xC3 0xA9
0x63 0x61 0x66 0x65 0xCC 0x81

'cryptocompress' (is that really on your passport :( )

This is exactly the area we need to agree on a plan moving forward.

There are a number of options on the table

1 - Limit variable and other names to 'ASCII' only characters so that
case folding can be maintained.

2 - Remove 'case insensitivity' but not just for point 1 reasons.
( I see this as your example being two different strings ;) )

3 - Allow unicode names to be used in places where they currently
cause problems.

Not actually using unicode variable names myself, I still don't
understand where the problems are with '3' except for the simple
comparison case where normalizing and case conversion creates a
minefield? People are currently using unicode in these areas and
understand many of the restrictions?

My unverified assumption is: The performance impact (cli without
opcache) is too huge to get this right. If we do not get this right,
there are complainers. How shall we die?

Виталий

11 years ago by Stas Malyshev — view source

unread

Hi!

Good point, but users should use NFC UTF-8 without BOM for
variable/function names.

Given that the user has no means to distinguish normal forms and other
invisible details, what you're saying is "stuff would randomly not work,
but we'll make it user's fault because we'd have a small print note in
our docs that says the user has to satisfy conditions for which he has
no means to check". Not the best idea.

--
Stanislav Malyshev, Software Architect
SugarCRM: http://www.sugarcrm.com/
(408)454-6900 ext. 227

11 years ago by Yasuo Ohgaki — view source

unread

Hi Stas,

On Fri, Mar 14, 2014 at 4:59 AM, Stas Malyshev smalyshev@sugarcrm.comwrote:

Good point, but users should use NFC UTF-8 without BOM for
variable/function names.

Given that the user has no means to distinguish normal forms and other
invisible details, what you're saying is "stuff would randomly not work,
but we'll make it user's fault because we'd have a small print note in
our docs that says the user has to satisfy conditions for which he has
no means to check". Not the best idea.

I agree with your argument in general, but restricting pattern will break
existing scripts which are working now. If

$café !== $café
0x63 0x61 0x66 0xC3 0xA9
0x63 0x61 0x66 0x65 0xCC 0x81

is issue. It's current issue. Do we really have to care about this and
introduce BC? It seems documentation would be enough.

Regards,

--
Yasuo Ohgaki
yohgaki@ohgaki.net

11 years ago by Andrea Faulds — view source

unread

I agree with your argument in general, but restricting pattern will break
existing scripts which are working now. If

$café !== $café
0x63 0x61 0x66 0xC3 0xA9
0x63 0x61 0x66 0x65 0xCC 0x81

is issue. It's current issue. Do we really have to care about this and
introduce BC? It seems documentation would be enough.

Writing programs in other languages is generally bad practise anyway. I don’t think PHP should encourage it if it will cause problems.

This would introduce BC breaks, yes, but I suppose you could probably write a program to change identifier names.

--
Andrea Faulds
http://ajf.me/

11 years ago by Yasuo Ohgaki — view source

unread

Andrea,

I agree with your argument in general, but restricting pattern will break
existing scripts which are working now. If

$café !== $café
0x63 0x61 0x66 0xC3 0xA9
0x63 0x61 0x66 0x65 0xCC 0x81

is issue. It's current issue. Do we really have to care about this and
introduce BC? It seems documentation would be enough.

Writing programs in other languages is generally bad practise anyway. I
don’t think PHP should encourage it if it will cause problems.

This would introduce BC breaks, yes, but I suppose you could probably
write a program to change identifier names.

Writing test programs in native language is common now.
I think this is RSpec (Ruby's UNIT test framework) influence.

Regards,

--
Yasuo Ohgaki
yohgaki@ohgaki.net

11 years ago by Lester Caine — view source

unread

Rasmus Lerdorf wrote:

Unicode variable names ARE secondary, but if the handling of unicode

works as
well as it seems to be for me then it may be an option that can be
considered.

http://3v4l.org/kWb0U
Please help me, what is this about?

Exactly what has already been discussed?
You can use unicode strings in many areas of PHP, but it is not by
design, but rather as the result of 'holes' in the design.

That's not a hole in the design. It was quite deliberate and it had
little to do with Unicode at the time. It was a deliberate effort to not
artificially limit identifiers beyond that which the language syntax
naturally prevented. Think <space> ; , { } ( ) etc.

It is a 'hole' in so far as it does allow unicode strings to be used in places
where one of the proposals to tidy up unicode support is to close that hole.
That the design decision at the time allowed unicode through while preventing
other 'invalid' identifiers was probably correct but there are a number of those
'holes' (not just unicode ones) which people are now trying to close and it is
identifying new rules that we are trying to do for PHP6?

My memory these days is not as good as it used to be and 30 years ago I could
identify line numbers of programs while on the phone to customers. Today I have
trouble remembering the customer rang :( Another reason why these subtle changes
to how things work become more annoying, habit is the long term memory. But I
think that one of the previous discussions on the very point of variable names
was to allow limited use of them? So closing that hole now would be a major BC
break?

--
Lester Caine - G8HFL

11 years ago by Pierre Joye — view source

unread

On Mar 11, 2014 12:06 PM, "Crypto Compress"

We should not discuss this till Pierre/we clarified the core problems.

I seriously hope "we" more than me alone :)

11 years ago by Yasuo Ohgaki — view source

unread

Hi all,

Just FYI.
http://www.w3.org/TR/encoding/
This would be the list of legacy encoding for HTML5.

Regards

--
Yasuo Ohgaki
yohgaki@ohgaki.net

I'm slowly working through a long list of things relating to unicode
strings trying to work out just where the main problems are.

The very first problem I hit is ICU's limitation to 32bit string lengths.
How does the switch to 64bit string length on 64 bit platforms impinge on
this. While I can see the advantage of this particular change, would that
also now require our own version of ICU capable of also handling longer
strings? This probably falls out in the wash of my next point ...

Currently strings are simply strings? I'm sure we have already had this
discussion, and it will be necessary to switch from simple strings to a
string object which can handle the intricacies of unicode?

Pierre - I presume that it's this distinction that is where I'm crossing
over between variable and similar names which just remain as simple stings
while 'data' that is unicode is provided by sting objects. These then need
to work nicely with areas that expect a simple string? Where a string
object is returned an ASCII version will be created when a simple string is
necessary?

The 'leak' of unicode currently into name strings is simply that there is
nothing currently stopping them from storing UTF-8? That this works is more
by luck than design, but results in subtle problems with case conversion
and the like which does not expect unicode strings? BUT people can
currently use any format data in a string even one using a 64 bit pointer
as long as it does not go through a path that does expect ASCII?

If the simple string is isolated from UTF-8 and unicode is kept to it's
own data type such as an improved integrated mbstring package then this
make a suitable 'half way' house for PHP6?

I don't NEED unicode variable names, but I can see that this would be a
nice to have in non-English speaking countries. In much the same way we
provide translated versions of web pages, I can even see the advantage of
function name aliases in different languages as having more relevance that
simply changing the current English names for picky reasons, but that is
not likely to happen in my lifetime! Perhaps PHP10 :)

--
Lester Caine - G8HFL

Contact - http://lsces.co.uk/wiki/?page=contact
L.S.Caine Electronic Services - http://lsces.co.uk
EnquirySolve - http://enquirysolve.com/
Model Engineers Digital Workshop - http://medw.co.uk
Rainbow Digital Media - http://rainbowdigitalmedia.co.uk

-- Lester Caine - G8HFL

-- Lester Caine - G8HFL

-- Lester Caine - G8HFL

Cheers,

-- Lester Caine - G8HFL

-- Lester Caine - G8HFL

-- Lester Caine - G8HFL

-- Lester Caine - G8HFL

-- Lester Caine - G8HFL

--
Lester Caine - G8HFL

--
Lester Caine - G8HFL

--
Lester Caine - G8HFL

--
Lester Caine - G8HFL

--
Lester Caine - G8HFL

--
Lester Caine - G8HFL

--
Lester Caine - G8HFL

--
Lester Caine - G8HFL