[RFC] Replace the flex-based scanner with an re2c [1] based lexer

18 years ago by Stanislav Malyshev — view source — reply

unread

Hi!

be much easier, switching to re2c promises a much faster lexer. Actually,
without any specific re2c optimizations we already get around a 20% scanner

I think 20% faster is very cool.
However, as I understand re2c is not a standard tool found everywhere.
So what happens if you wanted to use it on some exotic system where re2c
is not readily available as manintainer-supported software? Also, flex
is available on Windows for example as part of cygwin, while I don't see
re2c there.
I understand this can be of low importance since we keep generated files
in our repositories, but I think we still have to keep it in mind.
I understand also current patch requires non-release version of re2c -
maybe we should have some release version at least until we make PHP
depend on it?

Current state:
Flex has been fully replaced by re2c in Zend. We have also switched to an
mmap-based lexer approach for now. However, we had to drop multibyte support

Were the stream support issues solved?

as well as the encoding declare. The current state can be checked out from
Scott's subversion repository [3] and you can follow the development on his
Trac setup [4]. When you want to build php with re2c, then you need to grab
re2c from its sourceforge subversion repository [5]. You can also check out
the changes in a patch created Sunday 2nd March against a PHP checkout from
14th February [6].

Further steps:
Commit this to PHP 5.3. Synch to HEAD. Add pecl/intl to 5.3. Discuss/recreate
multibyte support with libintl.

Note - pecl/intl does nothing towards multibyte support etc., at least
for now. If there are voloteers to change that, it can be discussed, but
so far it is for doing entirely other things (locale-dependent
functionality mostly).
So, I think before re2c parser can be merged the issue with multibyte
compatibility must be solved - otherwise it will make the users that
rely on it unable to use newer PHP. As cool as 20% faster is, I think we
can't drop support for such feature, especially not in 5.3.

Commit to 5.3 between the 5th and the 15th of March. Synch to HEAD a couple
of days later. Moving pecl/libintl to ext (depends on the 5.3 RMs decision).
After that is done, decide about multibyte support. Along with the commit to
the 5.3 branch there will be a new re2c version available.

I think we first need to figure out what happens to multibyte support,
and not commit anything before we have it figured out. Multibyte support
is important piece of functionality for some PHP users, and it works
now. Breaking it without providing any alternative - especially that we
have now 5.3 mostly ready for the release cycle, and solving multibyte
problems with re2c may take undefined amount of time, as far as I
understand. I do not think it would be acceptable to release 5.3 without
multibyte support, so the option here either merge it now and have 5.3
waiting until MB is figured out, or try to figure it out before commit
and if we can't in a reasonable term, go forward with 5.3 and defer the
parser change for 5.4.

Again, while I think the speedup is great and congratulate Marcus, Nuno
and Scott on great work, I think we should keep in mind we have working
parser right now and changing it in an incompatible way is very
high-risk and should not be taken hastily.

Stanislav Malyshev, Zend Software Architect
stas@zend.com http://www.zend.com/
(408)253-8829 MSN: stas@zend.com

18 years ago by Rasmus Lerdorf — view source — reply

unread

Stanislav Malyshev wrote:

Hi!

be much easier, switching to re2c promises a much faster lexer.
Actually,
without any specific re2c optimizations we already get around a 20%
scanner

I think 20% faster is very cool.
However, as I understand re2c is not a standard tool found everywhere.
So what happens if you wanted to use it on some exotic system where
re2c is not readily available as manintainer-supported software? Also,
flex is available on Windows for example as part of cygwin, while I
don't see re2c there.
I don't think this part is a concern since we have required re2c for
quite a while now to build many critical parts of PHP. People who
actually need to regenerate the parser files are the same people for
whom it is trivial to figure out how to install re2c. And yes, it would
of course be good to use a released version of re2c, but I think by the
time 5.3 is ready to go the version of re2c we need will be out there.
Since it is Marcus' baby, he can just push it out, I don't think this is
a stumbling block either. Some of the new stuff in re2c was
specifically added to make it easier to write a PHP parser, so I don't
think backporting to an older version is really an option.

-Rasmus

18 years ago by Marcus Boerger — view source — reply

unread

Hello Rasmus,

Monday, March 3, 2008, 12:25:52 AM, you wrote:

Stanislav Malyshev wrote:

Hi!

be much easier, switching to re2c promises a much faster lexer.
Actually,
without any specific re2c optimizations we already get around a 20%
scanner

I think 20% faster is very cool.
However, as I understand re2c is not a standard tool found everywhere.
So what happens if you wanted to use it on some exotic system where
re2c is not readily available as manintainer-supported software? Also,
flex is available on Windows for example as part of cygwin, while I
don't see re2c there.
I don't think this part is a concern since we have required re2c for
quite a while now to build many critical parts of PHP. People who
actually need to regenerate the parser files are the same people for
whom it is trivial to figure out how to install re2c. And yes, it would
of course be good to use a released version of re2c, but I think by the
time 5.3 is ready to go the version of re2c we need will be out there.
Since it is Marcus' baby, he can just push it out, I don't think this is
a stumbling block either. Some of the new stuff in re2c was
specifically added to make it easier to write a PHP parser, so I don't
think backporting to an older version is really an option.

Right. The current re2c development cycle is solely dedicated to be able
to rewrite the PHP scanners. I will update re2c whenever necessary during
the remaining development cycle and release a new stable release before we
release PHP 5.3.

Best regards,
Marcus

18 years ago by Stanislav Malyshev — view source — reply

unread

I don't think this part is a concern since we have required re2c for
quite a while now to build many critical parts of PHP. People who

Ok, great then - only issue remaining is the multibyte support.

--
Stanislav Malyshev, Zend Software Architect
stas@zend.com http://www.zend.com/
(408)253-8829 MSN: stas@zend.com

18 years ago by Marcus Boerger — view source — reply

unread

Hello Stanislav,

Sunday, March 2, 2008, 11:47:57 PM, you wrote:

Hi!

be much easier, switching to re2c promises a much faster lexer. Actually,
without any specific re2c optimizations we already get around a 20% scanner

I think 20% faster is very cool.
However, as I understand re2c is not a standard tool found everywhere.
So what happens if you wanted to use it on some exotic system where re2c
is not readily available as manintainer-supported software? Also, flex
is available on Windows for example as part of cygwin, while I don't see
re2c there.
I understand this can be of low importance since we keep generated files
in our repositories, but I think we still have to keep it in mind.
I understand also current patch requires non-release version of re2c -
maybe we should have some release version at least until we make PHP
depend on it?

Well, re2c works for on a very large amount of systems, can easily be build
and comes with a read to download windows executable. Furthermore all major
distributions have re2c packages. Along with storing the generated files in
cvs i see no issue at all in these regards.

Current state:
Flex has been fully replaced by re2c in Zend. We have also switched to an
mmap-based lexer approach for now. However, we had to drop multibyte support

Were the stream support issues solved?

We completely dropped multibyte support. The reason is that the way we were
doing it, is that we constanlty switch between the full original and a
recoded duplicate that simply ignores multibyte (or any encoding at all).
Once we have finished the move to re2c, we can support all of those
correctly. The multibyte support also duplicated the encoding tables
otherwise available in ext/mbstring or ext/iconv or pecl/intl.

as well as the encoding declare. The current state can be checked out from
Scott's subversion repository [3] and you can follow the development on his
Trac setup [4]. When you want to build php with re2c, then you need to grab
re2c from its sourceforge subversion repository [5]. You can also check out
the changes in a patch created Sunday 2nd March against a PHP checkout from
14th February [6].

Further steps:
Commit this to PHP 5.3. Synch to HEAD. Add pecl/intl to 5.3. Discuss/recreate
multibyte support with libintl.

Note - pecl/intl does nothing towards multibyte support etc., at least
for now. If there are voloteers to change that, it can be discussed, but
so far it is for doing entirely other things (locale-dependent
functionality mostly).

Yes I know. However pecl/intl brings in a php/icu bridge which we can build
on.

So, I think before re2c parser can be merged the issue with multibyte
compatibility must be solved - otherwise it will make the users that
rely on it unable to use newer PHP. As cool as 20% faster is, I think we
can't drop support for such feature, especially not in 5.3.

Rely on a not supported undocumented feature? I am rather able to build php
and rewrite that support.

Commit to 5.3 between the 5th and the 15th of March. Synch to HEAD a couple
of days later. Moving pecl/libintl to ext (depends on the 5.3 RMs decision).
After that is done, decide about multibyte support. Along with the commit to
the 5.3 branch there will be a new re2c version available.

I think we first need to figure out what happens to multibyte support,
and not commit anything before we have it figured out. Multibyte support
is important piece of functionality for some PHP users, and it works
now. Breaking it without providing any alternative - especially that we
have now 5.3 mostly ready for the release cycle, and solving multibyte
problems with re2c may take undefined amount of time, as far as I
understand. I do not think it would be acceptable to release 5.3 without
multibyte support, so the option here either merge it now and have 5.3
waiting until MB is figured out, or try to figure it out before commit
and if we can't in a reasonable term, go forward with 5.3 and defer the
parser change for 5.4.

Again, while I think the speedup is great and congratulate Marcus, Nuno
and Scott on great work, I think we should keep in mind we have working
parser right now and changing it in an incompatible way is very
high-risk and should not be taken hastily.

You are free to contribute and make MB support working upfront.

Best regards,
Marcus

18 years ago by Stanislav Malyshev — view source — reply

unread

Hi!

Were the stream support issues solved?

We completely dropped multibyte support. The reason is that the way we were

I wasn't asking about multibyte (that we discuss below), but about other
streams - I think I mentioned it on IRC last time re2c parser was
discussed. I remember re2c used mmap, and not all files PHP can run can
be mmap'ed. Was it fixed?

Once we have finished the move to re2c, we can support all of those
correctly. The multibyte support also duplicated the encoding tables
otherwise available in ext/mbstring or ext/iconv or pecl/intl.

pecl/intl per se doesn't have any encoding tables. ICU does, but that
would mean you have to have ICU to run PHP. That might not be a big
problem since ICU is supported by IBM (read: good chance more "exotic"
systems would have support) it is still dependency on non-bundled 3rd
party library in PHP 5 core. Of course, PHP 6 has this dependency, but
we might want to not have such things in 5.x so that you won't have to
change your system too much while staying on 5.x.

Rely on a not supported undocumented feature? I am rather able to build php
and rewrite that support.

Being undocumented is nothing to be proud of, however as poorly
documented as it is, it is used. I'm all for implementing it in a better
way - and having new parser is a good time to do it. That's exactly the
reason we shouldn't rush with it but do it right this time. There's no
burning need to have a new parser right now, so we can have some moment
to think - ok, how we want multibyte support there to work? And if we
might need some modifications, we'd have time and flexibility to do it,
not having the code in 5.3 which was supposed to go in RC in Q1 (ending
1 month from now).

You are free to contribute and make MB support working upfront.

I know I'm free :) However, as much as I understand the eagerness of
having it in the source tree, I repeat that I do not think dropping
multibyte support in 5.3 is acceptable. Thus, if it is committed right
now, 5.3 would have to be deferred until this is resolved. If this is
resolved timely for 5.3 - great. If not, we better get it in 5.4 right
than in 5.3 wrong.

Stanislav Malyshev, Zend Software Architect
stas@zend.com http://www.zend.com/
(408)253-8829 MSN: stas@zend.com

18 years ago by Marcus Boerger — view source — reply

unread

Hello Stanislav,

Monday, March 3, 2008, 5:39:35 AM, you wrote:

Hi!

Were the stream support issues solved?

We completely dropped multibyte support. The reason is that the way we were

I wasn't asking about multibyte (that we discuss below), but about other
streams - I think I mentioned it on IRC last time re2c parser was
discussed. I remember re2c used mmap, and not all files PHP can run can
be mmap'ed. Was it fixed?

Ah, you didn't write that so I got confused. Anyway, what we are doing is
the following order:

If mmap is supported, then use it
If mmap is not supported or does not work then read the whole stream
If that is not possible read char by char

The flex based scanner reads in smaller chunks or char by char, so it is
more or less always like case 3.

Once we have finished the move to re2c, we can support all of those
correctly. The multibyte support also duplicated the encoding tables
otherwise available in ext/mbstring or ext/iconv or pecl/intl.

pecl/intl per se doesn't have any encoding tables. ICU does, but that
would mean you have to have ICU to run PHP. That might not be a big
problem since ICU is supported by IBM (read: good chance more "exotic"
systems would have support) it is still dependency on non-bundled 3rd
party library in PHP 5 core. Of course, PHP 6 has this dependency, but
we might want to not have such things in 5.x so that you won't have to
change your system too much while staying on 5.x.

Are you saying we cannot depend on ICU in PHP 6 and have to redo it
completely or what?

Rely on a not supported undocumented feature? I am rather able to build php
and rewrite that support.

Being undocumented is nothing to be proud of, however as poorly
documented as it is, it is used. I'm all for implementing it in a better
way - and having new parser is a good time to do it. That's exactly the
reason we shouldn't rush with it but do it right this time. There's no
burning need to have a new parser right now, so we can have some moment
to think - ok, how we want multibyte support there to work? And if we
might need some modifications, we'd have time and flexibility to do it,
not having the code in 5.3 which was supposed to go in RC in Q1 (ending
1 month from now).

You are free to contribute and make MB support working upfront.

I know I'm free :) However, as much as I understand the eagerness of
having it in the source tree, I repeat that I do not think dropping
multibyte support in 5.3 is acceptable. Thus, if it is committed right
now, 5.3 would have to be deferred until this is resolved. If this is
resolved timely for 5.3 - great. If not, we better get it in 5.4 right
than in 5.3 wrong.

I don't see a problem with redoing multibyte support in a useable way.
Actually we better redo it anyway because it is a very bad solution as it
is right now. That is the current solution duplicates the input and uses a
flattening filter to always scan an eight bit input stream. Then when
something needs to get pushed to the output, we recalculate the position on
the original input and use that part. Changing to re2c we can do a very
easy solution. When requested or detected per BOM, we switch to a second
version of the scanner that works on unsigned int and supports the full
unicode character set (only thing to do for re2c is to switch the input
type and guess what, this is already in production on a lot of systems).

Other approaches are to natively support UTF-8 and UTF-16 besides 8 bit
and UTF-32. Further more we can apply any kind of filtering correctly on
top of the UTF-* scanner.

I Know there is some work left but when we do not apply the work now then
we basically have two engines. In that case I'll just rewrite the engine
completely and replace it.

Best regards,
Marcus

18 years ago by Stanislav Malyshev — view source — reply

unread

Hi!

If mmap is supported, then use it

If mmap is not supported or does not work then read the whole stream

If that is not possible read char by char

Why should it read the whole stream into memory? The file could be very
big, maybe it would make more sense to read it is some chunks?

BTW, I see there are some substantial changes in streams there - new
file type ZEND_HANDLE_MAPPED, new "fsizer" handler. Will they be
dsecribed somewhere so other places which deal with streams could be
updated if there's need? Does it have any implication on extensions etc.
or it's isolated?

party library in PHP 5 core. Of course, PHP 6 has this dependency, but
we might want to not have such things in 5.x so that you won't have to
change your system too much while staying on 5.x.

Are you saying we cannot depend on ICU in PHP 6 and have to redo it
completely or what?

Just curious who you were answering to... Anyway, to be clear:

PHP 6 is major version with its major feature being Unicode support.
PHP 5.x is same-major branch, where you are not expected to have to
change your system in order to upgrade.
We do not expect people to take PHP 6 and have absolutely everything
work instantly from PHP 5. We try to minimize upgrade path, but major
version upgrades can take some adjustments.
We expect people to upgrade from 5.2.x to 5.3.x without changing
their systems.

Is it clearer why I think PHP 5.x and 6 are different and why I think
ICU dependency in the 5.3 core might be a problem?

I Know there is some work left but when we do not apply the work now then
we basically have two engines. In that case I'll just rewrite the engine
completely and replace it.

I think better to have two engines one of which we can release than have
no release-able engine at all. Again, if you say you are 100% sure that
can be figured out quickly and not delay 5.3 release cycle to Q2 or Q3 -
great, let's do it. Since we were not going to have any major changes in
the engine until 5.3 is out anyway - the gap between your code and CVS
code won't be substantial. You could use a branch to make it even easier.

As for rewriting the engine - I think that would be just a waste of effort.

Stanislav Malyshev, Zend Software Architect
stas@zend.com http://www.zend.com/
(408)253-8829 MSN: stas@zend.com

18 years ago by Pierre Joye — view source — reply

unread

Hi,

Just curious who you were answering to... Anyway, to be clear:

PHP 6 is major version with its major feature being Unicode support.

PHP 5.x is same-major branch, where you are not expected to have to
change your system in order to upgrade.

We do not expect people to take PHP 6 and have absolutely everything
work instantly from PHP 5. We try to minimize upgrade path, but major
version upgrades can take some adjustments.

We expect people to upgrade from 5.2.x to 5.3.x without changing
their systems.

Is it clearer why I think PHP 5.x and 6 are different and why I think
ICU dependency in the 5.3 core might be a problem?

It is clearer but it is not a problem. New features may introduce new
dependencies. Having a dependency on libicu while we introduce intl
and other features related to unicode or i18n. I would agree if we
were talking about 5.2.x.

--
Pierre
http://blog.thepimp.net | http://www.libgd.org

18 years ago by Stanislav Malyshev — view source — reply

unread

Hi!

It is clearer but it is not a problem. New features may introduce new
dependencies. Having a dependency on libicu while we introduce intl
and other features related to unicode or i18n. I would agree if we
were talking about 5.2.x.

pecl/intl is an extension, there's no surprise that you need external
library when you enable extension. However, adding dependency in core
that you can not rid of has a lot of consequences (think distributions,
builds on non-Linux systems, etc., etc.).

Stanislav Malyshev, Zend Software Architect
stas@zend.com http://www.zend.com/
(408)253-8829 MSN: stas@zend.com

18 years ago by George Schlossnagle — view source — reply

unread

Hi!

It is clearer but it is not a problem. New features may introduce new
dependencies. Having a dependency on libicu while we introduce intl
and other features related to unicode or i18n. I would agree if we
were talking about 5.2.x.

pecl/intl is an extension, there's no surprise that you need
external library when you enable extension. However, adding
dependency in core that you can not rid of has a lot of consequences
(think distributions, builds on non-Linux systems, etc., etc.).

steps in from nowhere

It's just a build dependency, right? And one that's already required
if you want to generate all the internal parsers by hand as part of
your build. If it's really that huge a concern you could ship a
precompiled scanner/lexer.

George

18 years ago by Pierre Joye — view source — reply

unread

Hi!

It is clearer but it is not a problem. New features may introduce new
dependencies. Having a dependency on libicu while we introduce intl
and other features related to unicode or i18n. I would agree if we
were talking about 5.2.x.

pecl/intl is an extension, there's no surprise that you need external
library when you enable extension. However, adding dependency in core
that you can not rid of has a lot of consequences (think distributions,
builds on non-Linux systems, etc., etc.).

intl (and related changes) is almost the only why one will upgrade to
5.3.x. There is no core (as in zend engine) for 95% of our users.
There is a PHP release with default features which can be relied on.
That's my feeling and experiences on this topic.

That being said, icu is so common these days, I really don't see a
problem to have it as dep. If we were asking for some esoteric
library, I would worry more, obviously :)

--
Pierre
http://blog.thepimp.net | http://www.libgd.org

18 years ago by Stanislav Malyshev — view source — reply

unread

Hi!

intl (and related changes) is almost the only why one will upgrade to
5.3.x. There is no core (as in zend engine) for 95% of our users.

From NEWS:

Added and improved PHP syntax and semantics:
. Added NOWDOC. (Gwynne Raskind, Stas, Dmitry)
. Added "?:" operator. (Marcus)
. Added support for namespaces. (Dmitry, Stas, Gregory)
. Added support for Late Static Binding. (Dmitry, Etienne Kneuss)
. Added support for __callstatic() magic method. (Sara)
. Added support for dynamic access of static members using
$foo::myFunc().
(Etienne Kneuss)
. Improved checks for callbacks. (Marcus)
And that's not counting extension stuff. I of course value a lot the
importance given to intl, but 5.3 IMHO is juicier than just intl :)
--
Stanislav Malyshev, Zend Software Architect
stas@zend.com http://www.zend.com/
(408)253-8829 MSN: stas@zend.com

18 years ago by Pierre Joye — view source — reply

unread

Hi Stan,

Hi!

intl (and related changes) is almost the only why one will upgrade to
5.3.x. There is no core (as in zend engine) for 95% of our users.

Sorry I was not clear. I did not say that there is nothing done.
That's why I said "almost" :)

From NEWS:

Added and improved PHP syntax and semantics:
. Added NOWDOC. (Gwynne Raskind, Stas, Dmitry)
. Added "?:" operator. (Marcus)
. Added support for namespaces. (Dmitry, Stas, Gregory)
. Added support for Late Static Binding. (Dmitry, Etienne Kneuss)
. Added support for __callstatic() magic method. (Sara)
. Added support for dynamic access of static members using
$foo::myFunc().
(Etienne Kneuss)
. Improved checks for callbacks. (Marcus)
And that's not counting extension stuff. I of course value a lot the
importance given to intl, but 5.3 IMHO is juicier than just intl :)

Indeed and that's not yet finished :)

But we have to be realistic, namespaces and icu/intl are the really
appealing language features (and long awaited).

Pierre
http://blog.thepimp.net | http://www.libgd.org

18 years ago by Marcus Boerger — view source — reply

unread

Hello Stanislav,

Monday, March 3, 2008, 8:48:38 PM, you wrote:

Hi!

It is clearer but it is not a problem. New features may introduce new
dependencies. Having a dependency on libicu while we introduce intl
and other features related to unicode or i18n. I would agree if we
were talking about 5.2.x.

pecl/intl is an extension, there's no surprise that you need external
library when you enable extension. However, adding dependency in core
that you can not rid of has a lot of consequences (think distributions,
builds on non-Linux systems, etc., etc.).

No one was considering any such move. Having pecl/intl shipped per default
as symlinked into ext would be as much optional as --enable-zend-multibyte
or --enable-mbstring are right now. This will be more like brining in zip
to 5.2. However it is completely off-topic as it is just one possible cause
of action while the other is to stick with mbstring.

Best regards,
Marcus

18 years ago by Steph Fox — view source — reply

unread

No one was considering any such move. Having pecl/intl shipped per default
as symlinked into ext would be as much optional as --enable-zend-multibyte
or --enable-mbstring are right now. This will be more like brining in zip
to 5.2. However it is completely off-topic as it is just one possible
cause
of action while the other is to stick with mbstring.

Intl and mbstring don't share anything like the same functionality...

Steph

18 years ago by Pierre Joye — view source — reply

unread

Hi Marcus,

Hello Stanislav,

Monday, March 3, 2008, 8:48:38 PM, you wrote:

Hi!

It is clearer but it is not a problem. New features may introduce new
dependencies. Having a dependency on libicu while we introduce intl
and other features related to unicode or i18n. I would agree if we
were talking about 5.2.x.

Bad example, it is not symlinked :)

And heh, it would be time to give a break with your zip rant, hmmk? =)

--
Pierre
http://blog.thepimp.net | http://www.libgd.org

18 years ago by Marcus Boerger — view source — reply

unread

Hello Pierre,

Monday, March 3, 2008, 9:31:37 PM, you wrote:

Hi Marcus,

Hello Stanislav,

Monday, March 3, 2008, 8:48:38 PM, you wrote:

Hi!

It is clearer but it is not a problem. New features may introduce new
dependencies. Having a dependency on libicu while we introduce intl
and other features related to unicode or i18n. I would agree if we
were talking about 5.2.x.

Bad example, it is not symlinked :)

And heh, it would be time to give a break with your zip rant, hmmk? =)

Sorry, this wasn't meant at all as a rant. It is just a recent example
where a new extension brought in a new dependency. Though you come with a
bundled one so it actually should have looked for a better one.

Best regards,
Marcus

18 years ago by Marcus Boerger — view source — reply

unread

Hello Stanislav,

Monday, March 3, 2008, 7:59:41 PM, you wrote:

Hi!

If mmap is supported, then use it

If mmap is not supported or does not work then read the whole stream

If that is not possible read char by char

Why should it read the whole stream into memory? The file could be very
big, maybe it would make more sense to read it is some chunks?

BTW, I see there are some substantial changes in streams there - new
file type ZEND_HANDLE_MAPPED, new "fsizer" handler. Will they be
dsecribed somewhere so other places which deal with streams could be
updated if there's need? Does it have any implication on extensions etc.
or it's isolated?

The streams stuff is meant to integrate with the PHP streams layer which we
take care of. In the beginning we thought we could live with the old
interface but that wasn't really a good idea. So I came up with a new
interface and all that this would break is stuff like Phar (well there is
as far as i know nothing else like it) and guess I'll take care of it.
Actually pohar already works nearly 100%. And since Phar is a moving target
right now I wouldn't worry too much about it. Also if an application
decides to do less than it can provide less handlers than it had to earlier.
And well I could document stuff. But since when is that the PHP way? In
fact when I document stuff inside the engine I tend to get strange
reactions.

party library in PHP 5 core. Of course, PHP 6 has this dependency, but
we might want to not have such things in 5.x so that you won't have to
change your system too much while staying on 5.x.

Are you saying we cannot depend on ICU in PHP 6 and have to redo it
completely or what?

Just curious who you were answering to... Anyway, to be clear:

PHP 6 is major version with its major feature being Unicode support.

PHP 5.x is same-major branch, where you are not expected to have to
change your system in order to upgrade.
Oh since when? Where did you read that?

We do not expect people to take PHP 6 and have absolutely everything
work instantly from PHP 5. We try to minimize upgrade path, but major
version upgrades can take some adjustments.

We expect people to upgrade from 5.2.x to 5.3.x without changing
their systems.
Since when? And when did we ever say anything like that? Since when do we
tell people anything about their system?

Is it clearer why I think PHP 5.x and 6 are different and why I think
ICU dependency in the 5.3 core might be a problem?
I don't care at all. So far the plan was to bring in ICU and there is no
need to start a new discussion here. All I said is that we might be able to
do things better. But maybe taht is the complete wrong approach and I am
all wrong.

I Know there is some work left but when we do not apply the work now then
we basically have two engines. In that case I'll just rewrite the engine
completely and replace it.

I think better to have two engines one of which we can release than have
no release-able engine at all. Again, if you say you are 100% sure that
can be figured out quickly and not delay 5.3 release cycle to Q2 or Q3 -
great, let's do it.
In software there is nothing like 100% Or is everything you work on bug
free? Mine isn't. It took me more than two years to make re2c ready for
this task (take this in whatever way you feel).

Since we were not going to have any major changes in
the engine until 5.3 is out anyway - the gap between your code and CVS
code won't be substantial. You could use a branch to make it even easier.
How so? We don't use git or any other CMS that allows merging at all. I
would love to switch to GIT or Subversion but I don't think this is
reasonable given our development group/cycles.

As for rewriting the engine - I think that would be just a waste of effort.

Good, then let's not do that :-)

Best regards,
Marcus

18 years ago by Stanislav Malyshev — view source — reply

unread

Hi!

interface but that wasn't really a good idea. So I came up with a new
interface and all that this would break is stuff like Phar (well there is

If it breaks phar, it may break others too... Anyway, good description
of what was changed won't hurt.

PHP 5.x is same-major branch, where you are not expected to have to
change your system in order to upgrade.
Oh since when? Where did you read that?

Since forever, that's why we have major and minor versions.

I don't care at all. So far the plan was to bring in ICU and there is no

In PHP 6, not 5.3.

In software there is nothing like 100% Or is everything you work on bug
free? Mine isn't. It took me more than two years to make re2c ready for
this task (take this in whatever way you feel).

I'm not talking about bugs. I'm talking about having compatible engine
implementation. Nobody would require 100% bug-free code, it's not
realistic. Requirement is that scripts that run on 5.2 would run on 5.3,
not counting bugs. I guess you agree having no multibyte support does
not really qualifies as "bug" :) So we are talking about if putting it
in now might hurt 5.3 release process by postponing it for a long time
or not. If not - great.

How so? We don't use git or any other CMS that allows merging at all. I

CVS allows merging. I did it a lot of times. Of course, there could be
conflicts, but the engine is quite static now, so I don't foresee a lot
of them.

Stanislav Malyshev, Zend Software Architect
stas@zend.com http://www.zend.com/
(408)253-8829 MSN: stas@zend.com

18 years ago by Marcus Boerger — view source — reply

unread

Hello Stanislav,

Monday, March 3, 2008, 8:56:41 PM, you wrote:

Hi!

interface but that wasn't really a good idea. So I came up with a new
interface and all that this would break is stuff like Phar (well there is

If it breaks phar, it may break others too... Anyway, good description
of what was changed won't hurt.

PHP 5.x is same-major branch, where you are not expected to have to
change your system in order to upgrade.
Oh since when? Where did you read that?

Since forever, that's why we have major and minor versions.

I don't care at all. So far the plan was to bring in ICU and there is no

In PHP 6, not 5.3.

http://wiki.pooteeweet.org/PhP53#toc3 item 2.
One cannot go any more official than that. Lukas simply collects stuff
there that we talk about and mybe even decide up on. If we change our minds
later we shift stuff around or even drop them. That said it looks like it
is still on our goal list and I've not yet heard someone saying something
else.

In software there is nothing like 100% Or is everything you work on bug
free? Mine isn't. It took me more than two years to make re2c ready for
this task (take this in whatever way you feel).

I'm not talking about bugs. I'm talking about having compatible engine
implementation. Nobody would require 100% bug-free code, it's not
realistic. Requirement is that scripts that run on 5.2 would run on 5.3,
not counting bugs. I guess you agree having no multibyte support does
not really qualifies as "bug" :) So we are talking about if putting it
in now might hurt 5.3 release process by postponing it for a long time
or not. If not - great.

Heck, yeah, guess we proposed this after we got the same 99% PASSes we get
on a normal checkout. Besides a few additional Phar tests and well three
tests whose outout might change. Where changes are like getting full error
messages rather than cut down ones in edgecases.

How so? We don't use git or any other CMS that allows merging at all. I

CVS allows merging. I did it a lot of times. Of course, there could be
conflicts, but the engine is quite static now, so I don't foresee a lot
of them.

CVS does merging on its own when ther are no conflicts. I am talking about
real merge support.

Best regards,
Marcus

18 years ago by Stanislav Malyshev — view source — reply

unread

Hi!

In PHP 6, not 5.3.

http://wiki.pooteeweet.org/PhP53#toc3 item 2.

ITYM item 1. But that's extension, not engine core. I'm of course
all for having pecl/intl joined :)

CVS does merging on its own when ther are no conflicts. I am talking about
real merge support.

As I said, since engine is mostly static right now, I don't believe
there would be too many conflicts - especially taking into account the
main part of the replacement - scanner - is "wholesale" - you just
remove whole module and put other in.

Stanislav Malyshev, Zend Software Architect
stas@zend.com http://www.zend.com/
(408)253-8829 MSN: stas@zend.com

18 years ago by Derick Rethans — view source — reply

unread

We expect people to upgrade from 5.2.x to 5.3.x without changing their
systems.

Is it clearer why I think PHP 5.x and 6 are different and why I think ICU
dependency in the 5.3 core might be a problem?

FWIW... I also think that bringing in ICU in 5.3 so late in the cycle

or actually at all in 5.3 - is not such a bright idea.

regards,
Derick

--
Derick Rethans
http://derickrethans.nl | http://ezcomponents.org | http://xdebug.org

18 years ago by Steph Fox — view source — reply

unread

Is it clearer why I think PHP 5.x and 6 are different and why I think ICU
dependency in the 5.3 core might be a problem?

FWIW... I also think that bringing in ICU in 5.3 so late in the cycle

or actually at all in 5.3 - is not such a bright idea.

'so late in the cycle'? We haven't had a beta rc yet. I agree intl should've
been moved into core several weeks ago if that helps any...

Steph

18 years ago by Pierre Joye — view source — reply

unread

Hi Stan,

Hi!

be much easier, switching to re2c promises a much faster lexer. Actually,
without any specific re2c optimizations we already get around a 20% scanner

I think 20% faster is very cool.
However, as I understand re2c is not a standard tool found everywhere.
So what happens if you wanted to use it on some exotic system where re2c
is not readily available as manintainer-supported software? Also, flex
is available on Windows for example as part of cygwin, while I don't see
re2c there.

A quick note about this non problem. re2c works pretty well on windows
and they provide a .exe as far as I remember (much easier than flex
which requires cygwin or gnuwin32, even if both work :). Besides the
portability of re2c, we already use it in some extensions (if I
remember correctly) and nobody complained.

Cheers,

Pierre
http://blog.thepimp.net | http://www.libgd.org

18 years ago by johannes@php.net — view source — reply

unread

Hi,

Hi!

be much easier, switching to re2c promises a much faster lexer. Actually,
without any specific re2c optimizations we already get around a 20% scanner

I think 20% faster is very cool.
However, as I understand re2c is not a standard tool found everywhere.
So what happens if you wanted to use it on some exotic system where re2c
is not readily available as manintainer-supported software? Also, flex
is available on Windows for example as part of cygwin, while I don't see
re2c there.
I understand this can be of low importance since we keep generated files
in our repositories, but I think we still have to keep it in mind.
I understand also current patch requires non-release version of re2c -
maybe we should have some release version at least until we make PHP
depend on it?

We need a change there anyways, flex 2.5.4 is bundled with less systems,
even my Solaris 20 box has 2.5.33 instead of 2.5.4 by default. And I
think changing to something which is maintained by one of our main
contributors might be beneficial for us.

Note - pecl/intl does nothing towards multibyte support etc., at least
for now. If there are voloteers to change that, it can be discussed, but
so far it is for doing entirely other things (locale-dependent
functionality mostly).
So, I think before re2c parser can be merged the issue with multibyte
compatibility must be solved - otherwise it will make the users that
rely on it unable to use newer PHP. As cool as 20% faster is, I think we
can't drop support for such feature, especially not in 5.3.

Commit to 5.3 between the 5th and the 15th of March. Synch to HEAD a couple
of days later. Moving pecl/libintl to ext (depends on the 5.3 RMs decision).
After that is done, decide about multibyte support. Along with the commit to
the 5.3 branch there will be a new re2c version available.

I think we first need to figure out what happens to multibyte support,
and not commit anything before we have it figured out. Multibyte support
is important piece of functionality for some PHP users, and it works
now. Breaking it without providing any alternative - especially that we
have now 5.3 mostly ready for the release cycle, and solving multibyte
problems with re2c may take undefined amount of time, as far as I
understand. I do not think it would be acceptable to release 5.3 without
multibyte support, so the option here either merge it now and have 5.3
waiting until MB is figured out, or try to figure it out before commit
and if we can't in a reasonable term, go forward with 5.3 and defer the
parser change for 5.4.

Since there's no documentation about zend-multibyte stuff I spent some
time searching for other resources about it, but except bug reports I
found nothing whee it was required. I'm sure there are some but comments
like "TODO: support widechars" in the code give me the impression that
it doesn't really work... and I guess many people just enable it sinceit
sounds important not due to the fact that hey really need it. Of course
I might be wrong so I'd be interested in use cases for
--enable-zend-multibyte stuff. Maybe we can fullfill the needs without
the switch.

If there are good use cases for that switch I won't like to replace some
small engine thingy with a huge external library like ICU.

And I doubt that more than just a few people know what it really does -
Marcus and I just found out while working on that stuff over the
weekend.

Again, while I think the speedup is great and congratulate Marcus, Nuno
and Scott on great work, I think we should keep in mind we have working
parser right now and changing it in an incompatible way is very
high-risk and should not be taken hastily.

Right, it's great work they did there but a broken scanner would be one
of the worst things we might ship. So I'd invite everybody to checkout
that version from SVN (see Marcus's mail) and test it using the worst
stuff you can think off :-)

johannes

18 years ago by Stanislav Malyshev — view source — reply

unread

Hi!

Since there's no documentation about zend-multibyte stuff I spent some
time searching for other resources about it, but except bug reports I
found nothing whee it was required. I'm sure there are some but comments
like "TODO: support widechars" in the code give me the impression that
it doesn't really work... and I guess many people just enable it sinceit

It does work and there are people using it, even though I imagine it can
have some bugs. I guess it would be best to talk to mbstring maintainer
on code details, etc.

If there are good use cases for that switch I won't like to replace some
small engine thingy with a huge external library like ICU.

The use cases are scripts written in encodings like shift-JIS, etc.

And I doubt that more than just a few people know what it really does -
Marcus and I just found out while working on that stuff over the
weekend.

So I guess documentation is important :) Let it be a lesson to us all.

Stanislav Malyshev, Zend Software Architect
stas@zend.com http://www.zend.com/
(408)253-8829 MSN: stas@zend.com

18 years ago by Alan Knowles — view source — reply

unread

Can you clarify the Multibyte issues:

I presume this means that it can handle ASCII/UTF8/16 etc. but will
not handle things like BIG5/GB encoding in source code - this may be a
bit of an issue around here..

Regards
Alan

Marcus Boerger wrote:

RFC: REPLACE THE FLEX-BASED SCANNER WITH AN RE2C [1] BASED LEXER

Situation:
The current flex-based lexer depends on an outdated and unsupported flex
version. Alternatives include either updating to a newer version of flex or
using re2c, which we already use for a variety of things (serializing, pdo sql
scanning, date/time parsing). While moving towards a newer flex version would
be much easier, switching to re2c promises a much faster lexer. Actually,
without any specific re2c optimizations we already get around a 20% scanner
performance increase. Running the tests gets an overall speedup of 2%. It is
arguable whether this is enough, but re2c has more advantages. First of all,
re2c allows one to scan any type of input (ASCII, UTF-8, UTF-16, UTF-32).
Secondly, it allows for better integration with Lemon [2], which would be the
next step. And thirdly we can switch to a reentrant scanner.

Current state:
Flex has been fully replaced by re2c in Zend. We have also switched to an
mmap-based lexer approach for now. However, we had to drop multibyte support
as well as the encoding declare. The current state can be checked out from
Scott's subversion repository [3] and you can follow the development on his
Trac setup [4]. When you want to build php with re2c, then you need to grab
re2c from its sourceforge subversion repository [5]. You can also check out
the changes in a patch created Sunday 2nd March against a PHP checkout from
14th February [6].

Further steps:
Commit this to PHP 5.3. Synch to HEAD. Add pecl/intl to 5.3. Discuss/recreate
multibyte support with libintl.

Future steps:
Replace bison with lemon in PHP 5.4 or HEAD.

Time Frame:
Commit to 5.3 between the 5th and the 15th of March. Synch to HEAD a couple
of days later. Moving pecl/libintl to ext (depends on the 5.3 RMs decision).
After that is done, decide about multibyte support. Along with the commit to
the 5.3 branch there will be a new re2c version available.

Marcus Boerger
Nuno Lopes
Scott MacVicar

[1] http://re2c.org/
[2] http://www.hwaci.com/sw/lemon/
[3] svn://whisky.macvicar.net/php-re2c
[4] http://trac.macvicar.net/php-re2c/
[5] https://re2c.svn.sourceforge.net/svnroot/re2c/trunk/re2c
[6] http://php.net/~helly/php-re2c-20080302.diff.txt

18 years ago by Marcus Boerger — view source — reply

unread

Hello Alan,

be my hero then :-) Could you generate a few tests for the multibyte
support so that we know how it is used right now and what we need to take
care of?

marcus

Monday, March 3, 2008, 12:48:44 AM, you wrote:

Can you clarify the Multibyte issues:

I presume this means that it can handle ASCII/UTF8/16 etc. but will
not handle things like BIG5/GB encoding in source code - this may be a
bit of an issue around here..

Regards
Alan

Marcus Boerger wrote:

RFC: REPLACE THE FLEX-BASED SCANNER WITH AN RE2C [1] BASED LEXER

Situation:
The current flex-based lexer depends on an outdated and unsupported flex
version. Alternatives include either updating to a newer version of flex or
using re2c, which we already use for a variety of things (serializing, pdo sql
scanning, date/time parsing). While moving towards a newer flex version would
be much easier, switching to re2c promises a much faster lexer. Actually,
without any specific re2c optimizations we already get around a 20% scanner
performance increase. Running the tests gets an overall speedup of 2%. It is
arguable whether this is enough, but re2c has more advantages. First of all,
re2c allows one to scan any type of input (ASCII, UTF-8, UTF-16, UTF-32).
Secondly, it allows for better integration with Lemon [2], which would be the
next step. And thirdly we can switch to a reentrant scanner.

Current state:
Flex has been fully replaced by re2c in Zend. We have also switched to an
mmap-based lexer approach for now. However, we had to drop multibyte support
as well as the encoding declare. The current state can be checked out from
Scott's subversion repository [3] and you can follow the development on his
Trac setup [4]. When you want to build php with re2c, then you need to grab
re2c from its sourceforge subversion repository [5]. You can also check out
the changes in a patch created Sunday 2nd March against a PHP checkout from
14th February [6].

Further steps:
Commit this to PHP 5.3. Synch to HEAD. Add pecl/intl to 5.3. Discuss/recreate
multibyte support with libintl.

Future steps:
Replace bison with lemon in PHP 5.4 or HEAD.

Time Frame:
Commit to 5.3 between the 5th and the 15th of March. Synch to HEAD a couple
of days later. Moving pecl/libintl to ext (depends on the 5.3 RMs decision).
After that is done, decide about multibyte support. Along with the commit to
the 5.3 branch there will be a new re2c version available.

Marcus Boerger
Nuno Lopes
Scott MacVicar

[1] http://re2c.org/
[2] http://www.hwaci.com/sw/lemon/
[3] svn://whisky.macvicar.net/php-re2c
[4] http://trac.macvicar.net/php-re2c/
[5] https://re2c.svn.sourceforge.net/svnroot/re2c/trunk/re2c
[6] http://php.net/~helly/php-re2c-20080302.diff.txt

Best regards,
Marcus

18 years ago by Alan Knowles — view source — reply

unread

a few replaces with this file should be a good testcase

probably worth testing

comments with these character in them. both /* and //
string with these characters in them.
lynx -source
'http://smontagu.damowmow.com/genEncodingTest.cgi?family=windows&codepage=950'
| grep test | grep -v testcase

I have definatly seen code with chinese characters in comments and
strings and a few times function names and variable names with chinese
characters...

Regards
Alan

Marcus Boerger wrote:

Hello Alan,

be my hero then :-) Could you generate a few tests for the multibyte
support so that we know how it is used right now and what we need to take
care of?

marcus

Monday, March 3, 2008, 12:48:44 AM, you wrote:

Can you clarify the Multibyte issues:

I presume this means that it can handle ASCII/UTF8/16 etc. but will
not handle things like BIG5/GB encoding in source code - this may be a
bit of an issue around here..

Regards
Alan

Marcus Boerger wrote:

RFC: REPLACE THE FLEX-BASED SCANNER WITH AN RE2C [1] BASED LEXER

Situation:
The current flex-based lexer depends on an outdated and unsupported flex
version. Alternatives include either updating to a newer version of flex or
using re2c, which we already use for a variety of things (serializing, pdo sql
scanning, date/time parsing). While moving towards a newer flex version would
be much easier, switching to re2c promises a much faster lexer. Actually,
without any specific re2c optimizations we already get around a 20% scanner
performance increase. Running the tests gets an overall speedup of 2%. It is
arguable whether this is enough, but re2c has more advantages. First of all,
re2c allows one to scan any type of input (ASCII, UTF-8, UTF-16, UTF-32).
Secondly, it allows for better integration with Lemon [2], which would be the
next step. And thirdly we can switch to a reentrant scanner.

Current state:
Flex has been fully replaced by re2c in Zend. We have also switched to an
mmap-based lexer approach for now. However, we had to drop multibyte support
as well as the encoding declare. The current state can be checked out from
Scott's subversion repository [3] and you can follow the development on his
Trac setup [4]. When you want to build php with re2c, then you need to grab
re2c from its sourceforge subversion repository [5]. You can also check out
the changes in a patch created Sunday 2nd March against a PHP checkout from
14th February [6].

Further steps:
Commit this to PHP 5.3. Synch to HEAD. Add pecl/intl to 5.3. Discuss/recreate
multibyte support with libintl.

Future steps:
Replace bison with lemon in PHP 5.4 or HEAD.

Time Frame:
Commit to 5.3 between the 5th and the 15th of March. Synch to HEAD a couple
of days later. Moving pecl/libintl to ext (depends on the 5.3 RMs decision).
After that is done, decide about multibyte support. Along with the commit to
the 5.3 branch there will be a new re2c version available.

Marcus Boerger
Nuno Lopes
Scott MacVicar

[1] http://re2c.org/
[2] http://www.hwaci.com/sw/lemon/
[3] svn://whisky.macvicar.net/php-re2c
[4] http://trac.macvicar.net/php-re2c/
[5] https://re2c.svn.sourceforge.net/svnroot/re2c/trunk/re2c
[6] http://php.net/~helly/php-re2c-20080302.diff.txt

Best regards,
Marcus

18 years ago by Marcus Boerger — view source — reply

unread

Hello Alan, Andi, Rui,

my impression still is that not a single person uses this crap. I only
hear of people claiming they have heard that people use it. But what I see
is broken code and not a single test. If this is not going to change as in
we are not getting any .phpt files for this feature then there are two
ways. First I implement something that I personally would expect and I
wouldn't care about anything that is there right now or second we simply
get rid of it completely.

So far I have extended re2c to make it easier to deal with other encodings
and even allow multiple char width at the same time. So I did my homework.
Now I expect that somebody writes tests! Then we could provide a scanner
that works on UCS-2 or on UTF-32 and then try to identofy the script
encoding. Then work on th extended charset and do a reverse encoding if
necessary for output. THough even thinking about this approach (still like
what we seem to have right now) really hurts my very badly becasue it is
the wrong approach. What we want is a working HEAD.

marcus

Monday, March 3, 2008, 4:19:24 PM, you wrote:

a few replaces with this file should be a good testcase

probably worth testing

comments with these character in them. both /* and //

string with these characters in them.
lynx -source
'http://smontagu.damowmow.com/genEncodingTest.cgi?family=windows&codepage=950'
| grep test | grep -v testcase

I have definatly seen code with chinese characters in comments and
strings and a few times function names and variable names with chinese
characters...

Regards
Alan

Marcus Boerger wrote:

Hello Alan,

be my hero then :-) Could you generate a few tests for the multibyte
support so that we know how it is used right now and what we need to take
care of?

marcus

Monday, March 3, 2008, 12:48:44 AM, you wrote:

Can you clarify the Multibyte issues:

I presume this means that it can handle ASCII/UTF8/16 etc. but will
not handle things like BIG5/GB encoding in source code - this may be a
bit of an issue around here..

Regards
Alan

Marcus Boerger wrote:

RFC: REPLACE THE FLEX-BASED SCANNER WITH AN RE2C [1] BASED LEXER

Situation:
The current flex-based lexer depends on an outdated and unsupported flex
version. Alternatives include either updating to a newer version of flex or
using re2c, which we already use for a variety of things (serializing, pdo sql
scanning, date/time parsing). While moving towards a newer flex version would
be much easier, switching to re2c promises a much faster lexer. Actually,
without any specific re2c optimizations we already get around a 20% scanner
performance increase. Running the tests gets an overall speedup of 2%. It is
arguable whether this is enough, but re2c has more advantages. First of all,
re2c allows one to scan any type of input (ASCII, UTF-8, UTF-16, UTF-32).
Secondly, it allows for better integration with Lemon [2], which would be the
next step. And thirdly we can switch to a reentrant scanner.

Current state:
Flex has been fully replaced by re2c in Zend. We have also switched to an
mmap-based lexer approach for now. However, we had to drop multibyte support
as well as the encoding declare. The current state can be checked out from
Scott's subversion repository [3] and you can follow the development on his
Trac setup [4]. When you want to build php with re2c, then you need to grab
re2c from its sourceforge subversion repository [5]. You can also check out
the changes in a patch created Sunday 2nd March against a PHP checkout from
14th February [6].

Further steps:
Commit this to PHP 5.3. Synch to HEAD. Add pecl/intl to 5.3. Discuss/recreate
multibyte support with libintl.

Future steps:
Replace bison with lemon in PHP 5.4 or HEAD.

Time Frame:
Commit to 5.3 between the 5th and the 15th of March. Synch to HEAD a couple
of days later. Moving pecl/libintl to ext (depends on the 5.3 RMs decision).
After that is done, decide about multibyte support. Along with the commit to
the 5.3 branch there will be a new re2c version available.

Marcus Boerger
Nuno Lopes
Scott MacVicar

[1] http://re2c.org/
[2] http://www.hwaci.com/sw/lemon/
[3] svn://whisky.macvicar.net/php-re2c
[4] http://trac.macvicar.net/php-re2c/
[5] https://re2c.svn.sourceforge.net/svnroot/re2c/trunk/re2c
[6] http://php.net/~helly/php-re2c-20080302.diff.txt

Best regards,
Marcus

Best regards,
Marcus

18 years ago by Stanislav Malyshev — view source — reply

unread

is broken code and not a single test. If this is not going to change as in
we are not getting any .phpt files for this feature then there are two

As I understand the theory of the thing should be pretty simple, you set
input encoding (by config or declare) and internal encoding, and then
when script is being read, you convert it from input to internal.
However, it appears that since flex couldn't stomach certain encodings,
there's also a hack there - script is translated from input to some
"safe" encoding for flex, and then strings are translated back to
"internal" encoding after flex processes them. If re2c can deal with
encodings like SJIS without trouble then some of the hacks might be
unnecessary. I think encodings that need to be checked are those in
zend_multibyte.c that have "compatible" flag off.

Here's a short script example I found that shows what's the problem there:

<?php echo 'ソ'; ?>

Character echoed there is U+30BD "Katakana letter SO". Now if you run it
in UTF-8, works good. However, if you recode it to Shift-JIS, it won't
run, since this script looks to the parser this way:

<?php echo '<83>'; ?>
(that's dump of VI output, so replace <83> with actual 0x83 if you
compose it). That's parse error for the parser, if parsed "naively". So
somehow the parser needs to know 0x83+\ is actually U+30BD and at the
same time the user still might want it as 0x83+\ in a zval (or maybe as
utf-8 - it depends on him).

Stanislav Malyshev, Zend Software Architect
stas@zend.com http://www.zend.com/
(408)253-8829 MSN: stas@zend.com

18 years ago by Marcus Boerger — view source — reply

unread

Hello Stanislav,

cool, care to change the code snippet into a test as I've done for Rui's
snippet?

marcus

Sunday, March 23, 2008, 5:06:53 AM, you wrote:

is broken code and not a single test. If this is not going to change as in
we are not getting any .phpt files for this feature then there are two

As I understand the theory of the thing should be pretty simple, you set
input encoding (by config or declare) and internal encoding, and then
when script is being read, you convert it from input to internal.
However, it appears that since flex couldn't stomach certain encodings,
there's also a hack there - script is translated from input to some
"safe" encoding for flex, and then strings are translated back to
"internal" encoding after flex processes them. If re2c can deal with
encodings like SJIS without trouble then some of the hacks might be
unnecessary. I think encodings that need to be checked are those in
zend_multibyte.c that have "compatible" flag off.

Here's a short script example I found that shows what's the problem there:

<?php echo 'ソ'; ?>

Character echoed there is U+30BD "Katakana letter SO". Now if you run it
in UTF-8, works good. However, if you recode it to Shift-JIS, it won't
run, since this script looks to the parser this way:

<?php echo '<83>'; ?>
(that's dump of VI output, so replace <83> with actual 0x83 if you
compose it). That's parse error for the parser, if parsed "naively". So
somehow the parser needs to know 0x83+\ is actually U+30BD and at the
same time the user still might want it as 0x83+\ in a zval (or maybe as
utf-8 - it depends on him).

Stanislav Malyshev, Zend Software Architect
stas@zend.com http://www.zend.com/
(408)253-8829 MSN: stas@zend.com

Best regards,
Marcus

18 years ago by Lukas Kahwe Smith — view source — reply

unread

Can you clarify the Multibyte issues:

I presume this means that it can handle ASCII/UTF8/16 etc. but
will not handle things like BIG5/GB encoding in source code - this
may be a bit of an issue around here..

At first I also thought that this had something to do with ext/
mbstring, but since then I have learned that this is not the case.
However this confusion is likely what causes many people to enable
zend mb support. So the question to Stas (Alan and the rest of the
world) is if they really have a script in the wild that actually
requires this switch and would break if its would be disabled. And if
there is such a script what exactly are the needs and how can these be
filled in 5.3 using re2c.

regards,
Lukas

18 years ago by Derick Rethans — view source — reply

unread

However, we had to drop multibyte support as well as the encoding
declare.

Just wondering, why did you have to drop the "declare(encoding=...)" ?
It's just ignored in PHP 5.x - and it is useful to have for migrating
php 5.3 apps to 6. So can you atleast make the new parser just ignore
this statement?

regards,
Derick

--
Derick Rethans
http://derickrethans.nl | http://ezcomponents.org | http://xdebug.org

18 years ago by johannes@php.net — view source — reply

unread

Hi Derick,

However, we had to drop multibyte support as well as the encoding
declare.

Just wondering, why did you have to drop the "declare(encoding=...)" ?
It's just ignored in PHP 5.x - and it is useful to have for migrating
php 5.3 apps to 6. So can you atleast make the new parser just ignore
this statement?

It is not ignored in PHP 5 as Marcus and I found out while reading the
code :-)
If you compile with --enable-zend-multibyte you can change the encoding
using declare even multiple times per file using declare it seems.

johannes

18 years ago by Marcus Boerger — view source — reply

unread

Hello Derick,

actually you get a message (E_COMPILE_WARNING) that this is not
supported. Maybe we could turn this into an E_NOTICE though.

marcus

Monday, March 3, 2008, 9:28:01 AM, you wrote:

However, we had to drop multibyte support as well as the encoding
declare.

Just wondering, why did you have to drop the "declare(encoding=...)" ?
It's just ignored in PHP 5.x - and it is useful to have for migrating
php 5.3 apps to 6. So can you atleast make the new parser just ignore
this statement?

regards,
Derick

--
Derick Rethans
http://derickrethans.nl | http://ezcomponents.org | http://xdebug.org

Best regards,
Marcus

18 years ago by Derick Rethans — view source — reply

unread

actually you get a message (E_COMPILE_WARNING) that this is not
supported. Maybe we could turn this into an E_NOTICE though.

No, I don't get any warning/notice/ whatever with PHP 5.3:

derick@kossu:~$ php-5.3dev -derror_reporting=65535

<?php
declare(encoding="utf-8");
echo "foo\n";
?>

foo

Please don't break this.

regards,
Derick

18 years ago by Marcus Boerger — view source — reply

unread

Hello Derick,

ok, for now I changed to not issue any error at all.

marcus

Monday, March 3, 2008, 11:28:31 AM, you wrote:

actually you get a message (E_COMPILE_WARNING) that this is not
supported. Maybe we could turn this into an E_NOTICE though.

No, I don't get any warning/notice/ whatever with PHP 5.3:

derick@kossu:~$ php-5.3dev -derror_reporting=65535

<?php
declare(encoding="utf-8");
echo "foo\n";
?>>

foo

Please don't break this.

regards,
Derick

Best regards,
Marcus

18 years ago by Marcus Boerger — view source — reply

unread

Hello everyone,

sorry for the crosspost. But recent discussions about:
'[RFC] Replace the flex-based scanner with an re2c [1] based lexer'
revealed one big issue. During the development of said RFC we dropped
--enable-multibyte-support and interaction between engine and ext/mbstring
using declare(encoding=..). Now neither of the two is documented anywhere,
nor does any of the core developers happen to know how it works, what it is
supposed to do or how to test it.

Since we do not want to drop this feature we need some test code, best in
the form of .PHPTs. You can find information on how to write tests here:
http://qa.php.net/write-test.php and
http://talks.somabo.de/200703_montreal_need_for_testing.pdf

If you are interested in this further you are of course also more than
welcome to help in any other form. Apart from the proposal below, there
is also my blog entry to help you getting started:
http://blog.somabo.de/2008/02/php-on-re2c.html

thanks
marcus

Sunday, March 2, 2008, 11:21:34 PM, you wrote:

RFC: REPLACE THE FLEX-BASED SCANNER WITH AN RE2C [1] BASED LEXER

Situation:
The current flex-based lexer depends on an outdated and unsupported flex
version. Alternatives include either updating to a newer version of flex or
using re2c, which we already use for a variety of things (serializing, pdo sql
scanning, date/time parsing). While moving towards a newer flex version would
be much easier, switching to re2c promises a much faster lexer. Actually,
without any specific re2c optimizations we already get around a 20% scanner
performance increase. Running the tests gets an overall speedup of 2%. It is
arguable whether this is enough, but re2c has more advantages. First of all,
re2c allows one to scan any type of input (ASCII, UTF-8, UTF-16, UTF-32).
Secondly, it allows for better integration with Lemon [2], which would be the
next step. And thirdly we can switch to a reentrant scanner.

Current state:
Flex has been fully replaced by re2c in Zend. We have also switched to an
mmap-based lexer approach for now. However, we had to drop multibyte support
as well as the encoding declare. The current state can be checked out from
Scott's subversion repository [3] and you can follow the development on his
Trac setup [4]. When you want to build php with re2c, then you need to grab
re2c from its sourceforge subversion repository [5]. You can also check out
the changes in a patch created Sunday 2nd March against a PHP checkout from
14th February [6].

Further steps:
Commit this to PHP 5.3. Synch to HEAD. Add pecl/intl to 5.3. Discuss/recreate
multibyte support with libintl.

Future steps:
Replace bison with lemon in PHP 5.4 or HEAD.

Time Frame:
Commit to 5.3 between the 5th and the 15th of March. Synch to HEAD a couple
of days later. Moving pecl/libintl to ext (depends on the 5.3 RMs decision).
After that is done, decide about multibyte support. Along with the commit to
the 5.3 branch there will be a new re2c version available.

Marcus Boerger
Nuno Lopes
Scott MacVicar

[1] http://re2c.org/
[2] http://www.hwaci.com/sw/lemon/
[3] svn://whisky.macvicar.net/php-re2c
[4] http://trac.macvicar.net/php-re2c/
[5] https://re2c.svn.sourceforge.net/svnroot/re2c/trunk/re2c
[6] http://php.net/~helly/php-re2c-20080302.diff.txt

18 years ago by Andi Gutmans — view source — reply

unread

Hi Marcus, Johannes, and all,

First of all let me say that I have no conceptual problem with replacing
the scanner with re2c. If it's cleaner, performs better and a better
maintained piece of software (let's hope Marcus doesn't get run over)
then we can move to re2c.

There are a few important things to consider though:

There is a huge PHP/MySQL community in the far east especially in
Japan. You may not hear as much from them because they mostly don't post
on our public lists but it's large. They very much depend on multibyte
support and it works well for them (I have talked to several people in
those communities). Shift-JIS is a matter of fact for those communities.
We can't just dump them in PHP 5.3.
We need to make sure that we have a streams story that works and
existing functionality is supported by it (sounds like this is almost
complete so probably not high risk).
We should make sure we can achieve compatibility including supporting
functionality like declare(...) which is used by some including
multibyte guys. I haven't heard of a reason why this couldn't be
possible with RE2C.

I think all the above is doable but we shouldn't ship without
accomplishing that 100% compatibility especially telling the non-Latin
world that we will stop supporting them.

So at the end of the day it all boils down to timing. I have been
expecting Johannes to cut a beta any day now (I realize Sun acquisition
somewhat postponed his schedule). PHP 5.3 is on a pretty good track to a
good & stable release cycle. I think re-engineering a core piece of the
engine at this point adds considerable risk and would definitely prolong
the release cycle.

So while I'm supportive of embracing RE2C if we get commitment to reach
that 100% compatibility including multibyte support, I don't quite
understand the sense of urgency and why we'd want to introduce this risk
so late in the development of PHP 5.3. This is a risk the release
manager shouldn't really be willing to take. Rewriting this multibyte
support will require time and interaction with the communities that are
currently using it to make sure that it meets their needs. It will not
be a trivial project.

We can definitely work towards RE2C in parallel and as Stas said the
engine hasn't really been changing very much recently to make this hard
(we finished our todos for 5.3). We could even branch off PHP 5.4 right
after RC1 for PHP 5.3 and therefore reduce the time where this patch
would need to be maintained separately (although I think it can already
be maintained in a branch).

Let's consider all the angles in addition to wanting to get the code in
the tree asap.
Andi

18 years ago by Marcus Boerger — view source — reply

unread

Hello Andi,

Tuesday, March 4, 2008, 7:51:07 AM, you wrote:

Hi Marcus, Johannes, and all,

First of all let me say that I have no conceptual problem with replacing
the scanner with re2c. If it's cleaner, performs better and a better
maintained piece of software (let's hope Marcus doesn't get run over)
then we can move to re2c.

There are a few important things to consider though:

There is a huge PHP/MySQL community in the far east especially in
Japan. You may not hear as much from them because they mostly don't post
on our public lists but it's large. They very much depend on multibyte
support and it works well for them (I have talked to several people in
those communities). Shift-JIS is a matter of fact for those communities.
We can't just dump them in PHP 5.3.

We need to make sure that we have a streams story that works and
existing functionality is supported by it (sounds like this is almost
complete so probably not high risk).

We should make sure we can achieve compatibility including supporting
functionality like declare(...) which is used by some including
multibyte guys. I haven't heard of a reason why this couldn't be
possible with RE2C.

I think all the above is doable but we shouldn't ship without
accomplishing that 100% compatibility especially telling the non-Latin
world that we will stop supporting them.

So at the end of the day it all boils down to timing. I have been
expecting Johannes to cut a beta any day now (I realize Sun acquisition
somewhat postponed his schedule). PHP 5.3 is on a pretty good track to a
good & stable release cycle. I think re-engineering a core piece of the
engine at this point adds considerable risk and would definitely prolong
the release cycle.

So while I'm supportive of embracing RE2C if we get commitment to reach
that 100% compatibility including multibyte support, I don't quite
understand the sense of urgency and why we'd want to introduce this risk
so late in the development of PHP 5.3. This is a risk the release
manager shouldn't really be willing to take. Rewriting this multibyte
support will require time and interaction with the communities that are
currently using it to make sure that it meets their needs. It will not
be a trivial project.

We can definitely work towards RE2C in parallel and as Stas said the
engine hasn't really been changing very much recently to make this hard
(we finished our todos for 5.3). We could even branch off PHP 5.4 right
after RC1 for PHP 5.3 and therefore reduce the time where this patch
would need to be maintained separately (although I think it can already
be maintained in a branch).

Let's consider all the angles in addition to wanting to get the code in
the tree asap.
Andi

This sounds like we are going to do the same mistake over and over and over
again. Who is forcing a hard time line on us? Why are we late in the
develoment I don't get it at all. We haven't done all steps that were on
our radar for 5.3. Now that we finally found time to address this we should
do it. Otherwise the consequence is just that we have to do a 5.4 version
immediately. What is the reason for that, who is more happy with a 5.3 now?
Are we a company that makes money with selling upgrades?

Best regards,
Marcus

18 years ago by Antony Dovgal — view source — reply

unread

This sounds like we are going to do the same mistake over and over and over
again. Who is forcing a hard time line on us? Why are we late in the
develoment I don't get it at all.

Right.
Please take more time if needed, no need to rush and release something half-working.
If it takes several months to prepare 5.3 release, let it be so.

After all, we're not a commercial company that has to roll out a release every
couple of months under pressure of share holders and overall competition.

--
Wbr,
Antony Dovgal

18 years ago by Stanislav Malyshev — view source — reply

unread

Hi!

Right.
Please take more time if needed, no need to rush and release something half-working.
If it takes several months to prepare 5.3 release, let it be so.

With this approach we would never release 5.3 - each couple of months
somebody would have a cool idea which would only require initial commit
and 2-3 months work on it on CVS, which delays the release - and then it
goes to the next idea. We should cut it off somewhere - not because
these ideas are bad - they aren't, but because we have to have releases.
The best idea is worth nothing for the users unless it's part of the
release.
5.3 is not the last version of PHP, and we have quite a bunch of stuff
there already - so I think it makes sense to have release of what we
have or will have soon, all while continuing to develop the ideas for
next versions.

After all, we're not a commercial company that has to roll out a release every
couple of months under pressure of share holders and overall competition.

If you think that because PHP project is not a commercial company it
doesn't have to adhere to the laws of markets, popularity and users
expectations - you are mistaken. We still have to take into account
millions of PHP users, even though they don't pay us money directly.
And it's open source which was "release often" last time I checked ;)

Stanislav Malyshev, Zend Software Architect
stas@zend.com http://www.zend.com/
(408)253-8829 MSN: stas@zend.com

18 years ago by Hannes Magnusson — view source — reply

unread

The best idea is worth nothing for the users unless it's part of the
release.

Improving on that statement: The coolest feature ever is worth
absolutely nothing unless it is documented.

Don't care if its a new language construct, new class, function or
method, optional parameter, new syntax in php.ini, errorlevel, dropped
warnings or an awesome --enable-zend-multibyte configure switch. If it
isn't documented its totally useless for anyone not reading php-cvs,
zend-engine-cvs and this list daily.

I'll hunt you all down and make you eat 1kg of vegetables each day
after the 5.3 release untill proper documentation and upgrade guides
have been written.
Mark my words my friends, mark my words! ;)

-Hannes

18 years ago by Stanislav Malyshev — view source — reply

unread

Hi!

Improving on that statement: The coolest feature ever is worth
absolutely nothing unless it is documented.

I agree with the intent - documentation is very important. Even
though, people use undocumented features too (probably cursing the lazy
developers on the way ;)

BTW, as far as I remember, we have at least 4 undocumented features
right now sitting in 5.3 CVS, so if anybody wants to do something cool,
that's a good place:

Nowdocs aren't documented
.htaccess-like .ini files undocumented
[HOST=] and [PATH=] .ini sections undocumented
new version constants undocumented
--
Stanislav Malyshev, Zend Software Architect
stas@zend.com http://www.zend.com/
(408)253-8829 MSN: stas@zend.com

18 years ago by Stanislav Malyshev — view source — reply

unread

Hi!

Nowdocs aren't documented

.htaccess-like .ini files undocumented

[HOST=] and [PATH=] .ini sections undocumented

new version constants undocumented

BTW, not sure if other things from the top of NEWS file are documented
either...

Stanislav Malyshev, Zend Software Architect
stas@zend.com http://www.zend.com/
(408)253-8829 MSN: stas@zend.com

18 years ago by Andi Gutmans — view source — reply

unread

-----Original Message-----
From: Hannes Magnusson [mailto:hannes.magnusson@gmail.com]
Sent: Tuesday, March 04, 2008 11:18 AM
To: Stas Malyshev
Cc: Antony Dovgal; Marcus Boerger; Andi Gutmans;
internals@lists.php.net
Subject: Re: [PHP-DEV] [RFC] Replace the flex-based scanner with an
re2c [1] based lexer

Improving on that statement: The coolest feature ever is worth
absolutely nothing unless it is documented.

Don't care if its a new language construct, new class, function or
method, optional parameter, new syntax in php.ini, errorlevel, dropped
warnings or an awesome --enable-zend-multibyte configure switch. If it
isn't documented its totally useless for anyone not reading php-cvs,
zend-engine-cvs and this list daily.

I'll hunt you all down and make you eat 1kg of vegetables each day
after the 5.3 release untill proper documentation and upgrade guides
have been written.
Mark my words my friends, mark my words! ;)

Why do you say it's not documented?
http://www.aconus.com/~oyaji/www/apache_linux_php.htm
http://tinyurl.com/2o8pq2

OK just kidding and I agree it would be nice to have it better
documented in the mainstream docs. As it applies mostly to the Asian
users though (Chinese/Japanese) who usually seek localized docs it's
probably not as good as it should be in php.net.

Andi

18 years ago by Hannes Magnusson — view source — reply

unread

OK just kidding and I agree it would be nice to have it better
documented in the mainstream docs. As it applies mostly to the Asian
users though (Chinese/Japanese) who usually seek localized docs it's
probably not as good as it should be in php.net.

The Japanese docs are 100% up-to-date with the English docs so they
shouldn't have any problem reading out docs.
In fact, if you do changes in the en/ tree Takagi Masahiro will have
it translated within 24hours - even if that change spanned 50files.
Not kidding.

-Hannes

18 years ago by Hannes Magnusson — view source — reply

unread

Why do you say it's not documented?
http://www.aconus.com/~oyaji/www/apache_linux_php.htm
http://tinyurl.com/2o8pq2

According to the latter link, our windows binaries don't enable
zend-multibyte, is this true?

-Hannes

18 years ago by Jani Taskinen — view source — reply

unread

I'll hunt you all down and make you eat 1kg of vegetables each day
after the 5.3 release untill proper documentation and upgrade guides
have been written.

I already eat that much vegetables a day..what's my punishment? :-p
(and Pierre promised to handle the php.ini docs.. :D)

--Jani

18 years ago by Antony Dovgal — view source — reply

unread

Hi!

Right.
Please take more time if needed, no need to rush and release something half-working.
If it takes several months to prepare 5.3 release, let it be so.

With this approach we would never release 5.3 - each couple of months
somebody would have a cool idea which would only require initial commit
and 2-3 months work on it on CVS, which delays the release - and then it
goes to the next idea. We should cut it off somewhere - not because
these ideas are bad - they aren't, but because we have to have releases.

Even though I do agree that delaying the release every 2-3 months is bad,
I believe this particular case deserves some special treatment.
And btw this is a major release, not just a bugfix one, so everyone (Zend included)
should spend even more time to make sure there are no regressions whatsoever.

Releasing a half-working version just "because we have to have releases" is total nonsense.
So please instead of arguing with me, help Marcus and the others if
you don't want the release postponed.

The best idea is worth nothing for the users unless it's part of the
release.
5.3 is not the last version of PHP

Making new 5.x releases each year makes no sense to me, so 5.3 seems to be
perfect candidate for the next several years if we want to implement something major.

After all, we're not a commercial company that has to roll out a release every
couple of months under pressure of share holders and overall competition.

If you think that because PHP project is not a commercial company it
doesn't have to adhere to the laws of markets, popularity and users
expectations - you are mistaken.

These are the last things I think of.
The most important is to make it as stable as we can.

We still have to take into account
millions of PHP users, even though they don't pay us money directly.

Right, and they want PHP to do its job and to do it good.

And it's open source which was "release often" last time I checked ;)

Wow, that's the most serious argument ever!

--
Wbr,
Antony Dovgal

18 years ago by Stanislav Malyshev — view source — reply

unread

Hi!

Even though I do agree that delaying the release every 2-3 months is bad,
I believe this particular case deserves some special treatment.

Why? We have perfectly working parser now and no immediate need to
replace it. I agree that new parser is faster and better, but we are
perfectly capable to live without it for half a year until it's
polished, if that proves to be the situation.

Releasing a half-working version just "because we have to have releases" is total nonsense.

Fully agreed here. That's why I'm against committing new parser without
multibyte support.

So please instead of arguing with me, help Marcus and the others if
you don't want the release postponed.

Unfortunately, I do not know Marcus' code and may not have resources to
help him right now. Please keep in mind that while I am happy to help
whenever I can, I am not under obligation to help on call to any project
as soon as anybody wants me to, just because he wants it.
That said, if somebody can and does fix new parser to support MB in
reasonable time - I'm all for it.

Making new 5.x releases each year makes no sense to me, so 5.3 seems to be
perfect candidate for the next several years if we want to implement something major.

What's wrong with making new 5.x releases each year if needed?

Right, and they want PHP to do its job and to do it good.

Having no mutibyte support used by a lot of people does not qualify as
"do its job and to do it good". What qualifies is either 5.3 with old
parser or 5.3 with new parser, fully compatible. As I believe I already
explained about delaying release etc., I wouldn't repeat myself here.

Stanislav Malyshev, Zend Software Architect
stas@zend.com http://www.zend.com/
(408)253-8829 MSN: stas@zend.com

18 years ago by Scott MacVicar — view source — reply

unread

Marcus Boerger wrote:

This sounds like we are going to do the same mistake over and over and over
again. Who is forcing a hard time line on us? Why are we late in the
develoment I don't get it at all. We haven't done all steps that were on
our radar for 5.3. Now that we finally found time to address this we should
do it. Otherwise the consequence is just that we have to do a 5.4 version
immediately. What is the reason for that, who is more happy with a 5.3 now?
Are we a company that makes money with selling upgrades?

Best regards,
Marcus

Agreed,

Putting something off for no good reason is going to cause the task to
lose momentum. There are three of us more than willing to look at any
issues that come up post-merge, at the moment it is complete sans the
zend multibyte code.

Scott

18 years ago by Andi Gutmans — view source — reply

unread

-----Original Message-----
From: Marcus Boerger [mailto:helly@php.net]
Sent: Tuesday, March 04, 2008 1:39 AM
To: Andi Gutmans
Cc: internals@lists.php.net
Subject: Re: [PHP-DEV] [RFC] Replace the flex-based scanner with an
re2c [1] based lexer

This sounds like we are going to do the same mistake over and over and
over
again. Who is forcing a hard time line on us? Why are we late in the
develoment I don't get it at all. We haven't done all steps that were
on
our radar for 5.3. Now that we finally found time to address this we
should
do it. Otherwise the consequence is just that we have to do a 5.4
version
immediately. What is the reason for that, who is more happy with a 5.3
now?
Are we a company that makes money with selling upgrades?

Actually you'd be surprised but for a company it's easier to have less
versions than more versions because you don't need to suddenly update &
QA all of your products again (one of the problems with supporting
open-source is too many versions :) Fortunately I can do what I think is
right for PHP disconnected from those kind of dependencies :)

No one is forcing a hard deadline but let's not behave like this is
something we can just sneak in especially when we don't deal with a huge
audience (multibyte) and make people believe we can do it quickly. I was
going off the premise which I thought everyone was on board with which
meant a Beta in Q1. If we are going to change that then let's be honest
with ourselves and suggest a new schedule which allows addressing the
issues and enough testing in order to make it into a stable PHP 5.3.

Andi

18 years ago by Stanislav Malyshev — view source — reply

unread

Hi!

We can definitely work towards RE2C in parallel and as Stas said the
engine hasn't really been changing very much recently to make this hard
(we finished our todos for 5.3). We could even branch off PHP 5.4 right

Small correction - we still have a couple of todo items. I think we'll
have them done by the end of next week (hopefully ;) But they shouldn't
do changes to the parser, in any case.

Stanislav Malyshev, Zend Software Architect
stas@zend.com http://www.zend.com/
(408)253-8829 MSN: stas@zend.com

18 years ago by Marcus Boerger — view source — reply

unread

Hello Andi,

Tuesday, March 4, 2008, 7:51:07 AM, you wrote:

Hi Marcus, Johannes, and all,

First of all let me say that I have no conceptual problem with replacing
the scanner with re2c. If it's cleaner, performs better and a better
maintained piece of software (let's hope Marcus doesn't get run over)
then we can move to re2c.

There are a few important things to consider though:

There is a huge PHP/MySQL community in the far east especially in
Japan. You may not hear as much from them because they mostly don't post
on our public lists but it's large. They very much depend on multibyte
support and it works well for them (I have talked to several people in
those communities). Shift-JIS is a matter of fact for those communities.
We can't just dump them in PHP 5.3.

We need to make sure that we have a streams story that works and
existing functionality is supported by it (sounds like this is almost
complete so probably not high risk).

We should make sure we can achieve compatibility including supporting
functionality like declare(...) which is used by some including
multibyte guys. I haven't heard of a reason why this couldn't be
possible with RE2C.

I think all the above is doable but we shouldn't ship without
accomplishing that 100% compatibility especially telling the non-Latin
world that we will stop supporting them.

So at the end of the day it all boils down to timing. I have been
expecting Johannes to cut a beta any day now (I realize Sun acquisition
somewhat postponed his schedule). PHP 5.3 is on a pretty good track to a
good & stable release cycle. I think re-engineering a core piece of the
engine at this point adds considerable risk and would definitely prolong
the release cycle.

So while I'm supportive of embracing RE2C if we get commitment to reach
that 100% compatibility including multibyte support, I don't quite
understand the sense of urgency and why we'd want to introduce this risk
so late in the development of PHP 5.3. This is a risk the release
manager shouldn't really be willing to take. Rewriting this multibyte
support will require time and interaction with the communities that are
currently using it to make sure that it meets their needs. It will not
be a trivial project.

We can definitely work towards RE2C in parallel and as Stas said the
engine hasn't really been changing very much recently to make this hard
(we finished our todos for 5.3). We could even branch off PHP 5.4 right
after RC1 for PHP 5.3 and therefore reduce the time where this patch
would need to be maintained separately (although I think it can already
be maintained in a branch).

Let's consider all the angles in addition to wanting to get the code in
the tree asap.
Andi

Give me any reason why we need 5.4 at this point?
Any single one?
Are you having a bet or a deal about 5.3 release date?
And what is the deal, you do whatever you think goes in and that's a law?

Best regards,
Marcus

18 years ago by Marcus Boerger — view source — reply

unread

Hello Marcus,

Tuesday, March 4, 2008, 7:29:28 PM, you wrote:

Hello Andi,

Tuesday, March 4, 2008, 7:51:07 AM, you wrote:

Hi Marcus, Johannes, and all,

First of all let me say that I have no conceptual problem with replacing
the scanner with re2c. If it's cleaner, performs better and a better
maintained piece of software (let's hope Marcus doesn't get run over)
then we can move to re2c.

There are a few important things to consider though:

There is a huge PHP/MySQL community in the far east especially in
Japan. You may not hear as much from them because they mostly don't post
on our public lists but it's large. They very much depend on multibyte
support and it works well for them (I have talked to several people in
those communities). Shift-JIS is a matter of fact for those communities.
We can't just dump them in PHP 5.3.

We need to make sure that we have a streams story that works and
existing functionality is supported by it (sounds like this is almost
complete so probably not high risk).

We should make sure we can achieve compatibility including supporting
functionality like declare(...) which is used by some including
multibyte guys. I haven't heard of a reason why this couldn't be
possible with RE2C.

I think all the above is doable but we shouldn't ship without
accomplishing that 100% compatibility especially telling the non-Latin
world that we will stop supporting them.

So at the end of the day it all boils down to timing. I have been
expecting Johannes to cut a beta any day now (I realize Sun acquisition
somewhat postponed his schedule). PHP 5.3 is on a pretty good track to a
good & stable release cycle. I think re-engineering a core piece of the
engine at this point adds considerable risk and would definitely prolong
the release cycle.

So while I'm supportive of embracing RE2C if we get commitment to reach
that 100% compatibility including multibyte support, I don't quite
understand the sense of urgency and why we'd want to introduce this risk
so late in the development of PHP 5.3. This is a risk the release
manager shouldn't really be willing to take. Rewriting this multibyte
support will require time and interaction with the communities that are
currently using it to make sure that it meets their needs. It will not
be a trivial project.

We can definitely work towards RE2C in parallel and as Stas said the
engine hasn't really been changing very much recently to make this hard
(we finished our todos for 5.3). We could even branch off PHP 5.4 right
after RC1 for PHP 5.3 and therefore reduce the time where this patch
would need to be maintained separately (although I think it can already
be maintained in a branch).

Let's consider all the angles in addition to wanting to get the code in
the tree asap.
Andi

Give me any reason why we need 5.4 at this point?
Any single one?
Are you having a bet or a deal about 5.3 release date?
And what is the deal, you do whatever you think goes in and that's a law?

So luckily no one replied here and being very provocative did only result
in one thing. That is that Andi and me had a long phone call to keep our
friendship. The lesson learned here is that we really need to bring down
the noise on the list. We need to read each others mails before replying.
And germans like me need to keep them selves from getting to provocative
just to cause a reaction. Sorry for replying to you Andi.

Best regards,
Marcus

[RFC] Replace the flex-based scanner with an re2c [1] based lexer

Again, while I think the speedup is great and congratulate Marcus, Nuno and Scott on great work, I think we should keep in mind we have working parser right now and changing it in an incompatible way is very high-risk and should not be taken hastily.

As for rewriting the engine - I think that would be just a waste of effort.

pecl/intl is an extension, there's no surprise that you need external library when you enable extension. However, adding dependency in core that you can not rid of has a lot of consequences (think distributions, builds on non-Linux systems, etc., etc.).

But we have to be realistic, namespaces and icu/intl are the really appealing language features (and long awaited).

CVS allows merging. I did it a lot of times. Of course, there could be conflicts, but the engine is quite static now, so I don't foresee a lot of them.

As I said, since engine is mostly static right now, I don't believe there would be too many conflicts - especially taking into account the main part of the replacement - scanner - is "wholesale" - you just remove whole module and put other in.

Cheers,

So I guess documentation is important :) Let it be a lesson to us all.

BTW, not sure if other things from the top of NEWS file are documented either...

Having no mutibyte support used by a lot of people does not qualify as "do its job and to do it good". What qualifies is either 5.3 with old parser or 5.3 with new parser, fully compatible. As I believe I already explained about delaying release etc., I wouldn't repeat myself here.

Small correction - we still have a couple of todo items. I think we'll have them done by the end of next week (hopefully ;) But they shouldn't do changes to the parser, in any case.

Again, while I think the speedup is great and congratulate Marcus, Nuno
and Scott on great work, I think we should keep in mind we have working
parser right now and changing it in an incompatible way is very
high-risk and should not be taken hastily.

pecl/intl is an extension, there's no surprise that you need external
library when you enable extension. However, adding dependency in core
that you can not rid of has a lot of consequences (think distributions,
builds on non-Linux systems, etc., etc.).

But we have to be realistic, namespaces and icu/intl are the really
appealing language features (and long awaited).

CVS allows merging. I did it a lot of times. Of course, there could be
conflicts, but the engine is quite static now, so I don't foresee a lot
of them.

As I said, since engine is mostly static right now, I don't believe
there would be too many conflicts - especially taking into account the
main part of the replacement - scanner - is "wholesale" - you just
remove whole module and put other in.

BTW, not sure if other things from the top of NEWS file are documented
either...

Having no mutibyte support used by a lot of people does not qualify as
"do its job and to do it good". What qualifies is either 5.3 with old
parser or 5.3 with new parser, fully compatible. As I believe I already
explained about delaying release etc., I wouldn't repeat myself here.

Small correction - we still have a couple of todo items. I think we'll
have them done by the end of next week (hopefully ;) But they shouldn't
do changes to the parser, in any case.