PHP6 wiki page

12 years ago by Andrew Faulds — view source

unread

Is there a PHP 6 wiki page for co-ordinating development of and
collecting ideas for PHP6 development?

There are a lot of idea being thrown around, would be nice if we had a
page (there might be one, but I can't find it).

Thanks.

--
Andrew Faulds
http://ajf.me/

12 years ago by Ferenc Kovacs — view source

unread

Is there a PHP 6 wiki page for co-ordinating development of and collecting
ideas for PHP6 development?

There are a lot of idea being thrown around, would be nice if we had a
page (there might be one, but I can't find it).

Thanks.

We had one for the last php6 attempt: https://wiki.php.net/todo/php60 but
we didn't started a new one about/after the recent '6.0 And Moving Forward'
thread.

--
Ferenc Kovács
@Tyr43l - http://tyrael.hu

11 years ago by Lester Caine — view source

unread

Ferenc Kovacs wrote:

Is there a PHP 6 wiki page for co-ordinating development of and collecting
ideas for PHP6 development?

There are a lot of idea being thrown around, would be nice if we had a
page (there might be one, but I can't find it).

We had one for the last php6 attempt:https://wiki.php.net/todo/php60 but
we didn't started a new one about/after the recent '6.0 And Moving Forward'
thread.

Yes I know this is an old thread, but it is probably indicative of the state of
play? It was last modified in 2006!

Is it worth going through the points there or would it be better simply starting
with a clean piece of paper and leaving this as history?

Yasuo has already brought up one of the unicode problems that will need to be
addressed in the phar file name thread. But more fundamentally I don't think
there was agreement on whether we simply standardise on unicode in the core, or
allow a single byte mode? 8 years on, I feel that the amount of utf8 material
that is floating around, the easiest route IS unicode only? Perhaps with a
compiler switch to disable it if people want the option, rather than an ini
setting? And I only use 127 character English :) but probably half the emails
I'm processing have unicode of some sort these days? Certainly all of my
websites require utf8 without any discussion.

Even 64bit integers get a mention near the bottom! Although it is probably worth
flagging that most higher end processors these days do have a 256bit integer
capability! So while 128bit has been mentioned in recent posts, perhaps a little
more in depth coverage is appropriate if PHP6 is going to be the base for the
next 10 years?

Many of the bullet points certainly need updating, but how many have actually
already been addressed in PHP5.X?

--
Lester Caine - G8HFL

Contact - http://lsces.co.uk/wiki/?page=contact
L.S.Caine Electronic Services - http://lsces.co.uk
EnquirySolve - http://enquirysolve.com/
Model Engineers Digital Workshop - http://medw.co.uk
Rainbow Digital Media - http://rainbowdigitalmedia.co.uk

11 years ago by Rowan Collins — view source

unread

Lester Caine wrote (on 14/02/2014):

But more fundamentally I don't think there was agreement on whether we
simply standardise on unicode in the core, or allow a single byte
mode? 8 years on, I feel that the amount of utf8 material that is
floating around, the easiest route IS unicode only?

The question is not whether to be "Unicode only", it's how to
implement Unicode. It's not just a case of making all your strings
wider, every function that manipulates a string in any way has to be
thought through, and every input and output has to be converted to/from
whatever encoding is chosen as the internal implementation.

While updating the Wikipedia article [1] I came across this slide set
[2], which has a fairly decent explanation of the issues and why the
previous implementation was abandoned.

If somebody comes up with an implementation proposal of Unicode strings,
whether to have a mode that doesn't use it can be discussed, but right
now there doesn't seem to be such a live proposal.

[1] http://en.wikipedia.org/wiki/PHP#PHP_6_and_Unicode
[2]
http://www.slideshare.net/andreizm/the-good-the-bad-and-the-ugly-what-happened-to-unicode-and-php-6

Regards,

Rowan Collins
[IMSoP]

11 years ago by Lester Caine — view source

unread

Rowan Collins wrote:

If somebody comes up with an implementation proposal of Unicode strings, whether
to have a mode that doesn't use it can be discussed, but right now there doesn't
seem to be such a live proposal.

I think it is now accepted that the mistake was UTF16?
Personally I always thought it was the wrong choice as other projects had
already shown and so was not likely to work.

If we look at UTF8 as a starting point, then in the large majority of places all
that results is longer strings? Modern tools will just display them and the bulk
of PHP simply works without a problem already? I've mentioned one point already
in other threads ... if you are simply looking to match a string, then
equal/not-equal is all that is required. The current compare also looks to
supply 'order' as well but in many cases this is simply not needed? Drop the
'order' and it does not mater if the string has strange characters in it.

The slide show recognises that converting to UTF8 and then back again is
something that simply slots in on the periphery, and so should just work. I
throw out any of the date and currency styling ... that is a different problem
and is already covered well and can return uncode strings if required! To be
honest I can't see why these are bundled with the unicode problem at all?

This just leaves 'sort' and more complex string handling? I accept that the
major brick wall here is 'case-insensitivity' since unicode string length may
well change when making a conversion. That is not exactly a 'PHP' problem, but a
fact of life with the languages we are dealing with? strtolower already has
problems with the more complex single byte character sets, but there is no
reason that it could not follow the unicode defined rules as a starting point?

Moving to 'character handling' specifically I have always viewed that as a place
where 'under the hood' the string being handled becomes UTF32 so that we are
back looking at individual characters rather than 'bytes'? But I am getting out
of my own depth when the character is fabricated from more than one unicode
character. The bit that Yasuo outlined earlier re NFC/NFD normalization? This is
another variation on the 'case-insensitive' problem? If an accent is added
additionally to a base character, then I would tend to defaulting to combine
them when processing a string, but that may not be correct ...

The bottom line is - I think - that we already know what works and what does
not, and we can define a 'default' simple sort where case-sensitivity is
restricted to fixed length string conversions to get a base system. Moving to
sort routines that respect different locals is then a layer on top? But all of
the groundwork for a default system does already exist? We have the filters to
convert to and from UTF8, and we have all the basic processes for handling utf8
strings. It is just a matter of agreeing a method of pulling it together?

--
Lester Caine - G8HFL

11 years ago by Rasmus Lerdorf — view source

unread

Lester Caine wrote (on 14/02/2014):

But more fundamentally I don't think there was agreement on whether we
simply standardise on unicode in the core, or allow a single byte
mode? 8 years on, I feel that the amount of utf8 material that is
floating around, the easiest route IS unicode only?

The question is not whether to be "Unicode only", it's how to
implement Unicode. It's not just a case of making all your strings
wider, every function that manipulates a string in any way has to be
thought through, and every input and output has to be converted to/from
whatever encoding is chosen as the internal implementation.

While updating the Wikipedia article [1] I came across this slide set
[2], which has a fairly decent explanation of the issues and why the
previous implementation was abandoned.

If somebody comes up with an implementation proposal of Unicode strings,
whether to have a mode that doesn't use it can be discussed, but right
now there doesn't seem to be such a live proposal.

What we really need is an awesome small and fast Unicode library that
does everything ICU does but faster and in less code while using UTF-8
as its internal storage so we don't have to convert on each and every
operation. There are a ton of non-obvious things beyond simple string
manipulation. String collation alone is massively complicated, for example.

-Rasmus

11 years ago by Lester Caine — view source

unread

Rasmus Lerdorf wrote:

What we really need is an awesome small and fast Unicode library that
does everything ICU does but faster and in less code while using UTF-8
as its internal storage so we don't have to convert on each and every
operation. There are a ton of non-obvious things beyond simple string
manipulation. String collation alone is massively complicated, for example.

Surely the bottom line is that to cover every fine detail, ICU has to be used as
the smaller libraries tend to make few assumptions to make life easy? But my
point was that most of the time you only need the simple stuff? Simply using
UTF8 strings in place of the byte based ones in all of the relevant string?

Remove the need to 'lowercase' by dropping case-insensitivity and things are
simplified somewhat? I've found the comment I was looking for finally while
searching around ... "UTF-8 is specially designed so that many byte-oriented
string functions continue to work or only need minor modifications."
This is why people can put unicode characters in many places in PHP now without
it actually breaking?

I've seen a few comments about switching to C++ and
http://utfcpp.sourceforge.net/ caught my eye, but
http://www.public-software-group.org/utf8proc-documentation came to light when I
started looking at NDF/NDC but I've been looking for a suitable unicode string
handler for doing substring clipping and all of that. I AM right in thinking
that mbstring is basically overkill if everything being worked with has already
been converted to UTF8? While I was aware of accent code points, I'd not quite
appreciated how complicated they can get. Up until now I've just been looking at
text cut and pasted from UTM8 messages.

If one simply ignores the transcoding in and out, leaving the core only to
handle clean UTF8 strings what non-trivial things are left? Could this be a
candidate for a SOC project?

--
Lester Caine - G8HFL

11 years ago by Lester Caine — view source

unread

Lester Caine wrote:

If one simply ignores the transcoding in and out, leaving the core only to
handle clean UTF8 strings what non-trivial things are left? Could this be a
candidate for a SOC project?

Has anybody looked at the U_CHARSET_IS_UTF8 flag in ICU?

The only reference I've found to it is on the UTF-8 page of the ICU site
http://userguide.icu-project.org/strings/utf-8 but it would seem to be their
attempt to remove the overheads of UTF-16 conversions when the base character
set is already UTF-8? The bit I'm having trouble with is it's link to
UCONFIG_NO_CONVERSION which would seem to disable any conversion filter, but we
still want to convert into and export from UTF-8 in the outside world, so I
don't see why that is appropriate?

U_WCHAR_IS_UTF32 would seem to simplify codepoint based activity by using UTF-32
'strings' when looking at character based processing. This is how I've been
viewing handling 'character' based string handling anyway. Rather than
introducing the problems UTF-16 seems to create here, but I'm not sure what
happens on windows based platforms here. It seems UTF-16 is the default for
windows API in ICU.

My simplistic view of things seems to think there are basically three string
lengths ...
1/ Number of bytes for buffer
2/ Number of code points ( characters + control and embellishment ? )
3/ Number of glyphs ( option to display or hide control codes as in ASCII )

But this has now been confused by the introduction of NFD/NFC/NFKC/NFKD? Which
will vary all of the above in some cases? Being somewhat linguistically
challenged, while I understand the concepts such as accents, Would standardising
on say NFD help with actions like lower/upper conversion, or does accenting a
character sometimes change it's alphabetical order so collations need the 'NFC'
form to sort by?

I think that what is clear is that while there may be a single 'UTF-8' writing
standard, sorting collations are even more diverse than the previous codesets?
Firebird has always managed COLLATION as a separate filter to CHARACTER SET, and
allows individual fields to have their own COLLATION so we can index on
different languages within the one table. I'm thinking that this may be required
when adding sorting in a UTF-8 based setup? Rather than specifying 'encoding'
one simply specifies 'collation' where it varies from the basic rules?

--
Lester Caine - G8HFL

11 years ago by Pierre Joye — view source

unread

What we really need is an awesome small and fast Unicode library that
does everything ICU does but faster and in less code while using UTF-8
as its internal storage so we don't have to convert on each and every
operation. There are a ton of non-obvious things beyond simple string
manipulation. String collation alone is massively complicated, for
example.

http://www.public-software-group.org/utf8proc-documentation looks
interesting. There are other but it has to be chosen very carefully :)

Cheers,
Pierre

11 years ago by Lester Caine — view source

unread

Pierre Joye wrote:

What we really need is an awesome small and fast Unicode library that
does everything ICU does but faster and in less code while using UTF-8
as its internal storage so we don't have to convert on each and every
operation. There are a ton of non-obvious things beyond simple string
manipulation. String collation alone is massively complicated, for
example.

http://www.public-software-group.org/utf8proc-documentation looks
interesting. There are other but it has to be chosen very carefully :)
I'd looked at that last night
http://lsces.co.uk/wiki/PHP6+unicode+core is the start of a crib sheet but
searching with google really is painful so any other links would be welcome!

--
Lester Caine - G8HFL

11 years ago by Pierre Joye — view source

unread

What we really need is an awesome small and fast Unicode library that
does everything ICU does but faster and in less code while using UTF-8
as its internal storage so we don't have to convert on each and every
operation. There are a ton of non-obvious things beyond simple string
manipulation. String collation alone is massively complicated, for
example.

http://www.public-software-group.org/utf8proc-documentation looks
interesting. There are other but it has to be chosen very carefully :)

https://github.com/josephg/librope claims to be fast and compliant. Added
to my list.

Cheers,
Pierre

11 years ago by Lester Caine — view source

unread

Pierre Joye wrote:

What we really need is an awesome small and fast Unicode library that
does everything ICU does but faster and in less code while using UTF-8
as its internal storage so we don't have to convert on each and every
operation. There are a ton of non-obvious things beyond simple string
manipulation. String collation alone is massively complicated, for
example.

http://www.public-software-group.org/utf8proc-documentation looks
interesting. There are other but it has to be chosen very carefully :)

https://github.com/josephg/librope claims to be fast and compliant. Added
to my list.

If I'm reading that correctly it does the substring stuff on already converted
UTF8 strings? It needs a UTF8 conversion of even a UTF16 string to work ... I think?

--
Lester Caine - G8HFL

11 years ago by Stas Malyshev — view source

unread

Hi!

operation. There are a ton of non-obvious things beyond simple string
manipulation. String collation alone is massively complicated, for example.

Oh yes, and if somebody thinks case sensitivity is weird now, wait until
Unicode gets into play. There for some chars when you change the case
string length changes, and for some conversion is not roundtrip-safe.
And you have various long form/short form combining issues which means
you need to normalize everything on every corner. So letting Unicode
into things like identifiers opens a huge container of worms.
Also, if one wants to appreciate what other cans of worms are hiding
there, I recommend this oldie but goodie:
http://stackoverflow.com/a/6163129/214196
It's about Perl, but we'd have many of the same issues.

Stanislav Malyshev, Software Architect
SugarCRM: http://www.sugarcrm.com/
(408)454-6900 ext. 227

11 years ago by Yasuo Ohgaki — view source

unread

Hi Stas,

On Mon, Feb 17, 2014 at 12:13 PM, Stas Malyshev smalyshev@sugarcrm.comwrote:

operation. There are a ton of non-obvious things beyond simple string
manipulation. String collation alone is massively complicated, for
example.

Oh yes, and if somebody thinks case sensitivity is weird now, wait until
Unicode gets into play. There for some chars when you change the case
string length changes, and for some conversion is not roundtrip-safe.
And you have various long form/short form combining issues which means
you need to normalize everything on every corner. So letting Unicode
into things like identifiers opens a huge container of worms.
Also, if one wants to appreciate what other cans of worms are hiding
there, I recommend this oldie but goodie:
http://stackoverflow.com/a/6163129/214196
It's about Perl, but we'd have many of the same issues.

Nice article. I mostly agree.

"Code that converts unknown characters to ? is broken, stupid, braindead,
and runs contrary to the standard recommendation, which says NOT TO DO
THAT! RTFM for why not."

While I agree this (It's BAD to accept broken text as valid input), there
are situations that programmer has to handle broken text. Ruby finally
admits scrab method is needed. It's available from Ruby 2.1.0.

Regards,

--
Yasuo Ohgaki
yohgaki@ohgaki.net

11 years ago by Julien Pauli — view source

unread

Is there a PHP 6 wiki page for co-ordinating development of and collecting
ideas for PHP6 development?

There are a lot of idea being thrown around, would be nice if we had a page
(there might be one, but I can't find it).

Feel free to create a new php5++ one.

We should keep the history of what's been done on old PHP60.
Perhaps it is better to rename php60 to php60ld and get php60 back alive ?

Julien

11 years ago by Rowan Collins — view source

unread

Julien Pauli wrote (on 14/02/2014):

Is there a PHP 6 wiki page for co-ordinating development of and collecting
ideas for PHP6 development?

There are a lot of idea being thrown around, would be nice if we had a page
(there might be one, but I can't find it).
Feel free to create a new php5++ one.

+1 for calling it "PHP 5++", with the first item on the todo list being
"decide on version number".

I have a book next to me titled "PHP 6 and MySQL 5"...

We should keep the history of what's been done on old PHP60.
Perhaps it is better to rename php60 to php60ld and get php60 back alive ?

How about "php6_2010", which is when the Unicode implementation was
officially abandoned?

--
Rowan Collins
[IMSoP]

11 years ago by Julien Pauli — view source

unread

Julien Pauli wrote (on 14/02/2014):

Is there a PHP 6 wiki page for co-ordinating development of and
collecting
ideas for PHP6 development?

There are a lot of idea being thrown around, would be nice if we had a
page
(there might be one, but I can't find it).

Feel free to create a new php5++ one.

+1 for calling it "PHP 5++", with the first item on the todo list being
"decide on version number".

I have a book next to me titled "PHP 6 and MySQL 5"...

That's what happens when some author want to run faster than time.
Silly people...

We should keep the history of what's been done on old PHP60.
Perhaps it is better to rename php60 to php60ld and get php60 back alive ?

How about "php6_2010", which is when the Unicode implementation was
officially abandoned?

Why not, this is just a naming anyway, nothing really important.

Julien

-- Lester Caine - G8HFL

Regards,

-- Lester Caine - G8HFL

-- Lester Caine - G8HFL

-- Lester Caine - G8HFL

-- Lester Caine - G8HFL

-- Lester Caine - G8HFL

--
Lester Caine - G8HFL

--
Lester Caine - G8HFL

--
Lester Caine - G8HFL

--
Lester Caine - G8HFL

--
Lester Caine - G8HFL

--
Lester Caine - G8HFL