[PATCH] Major optimization for heredocs/interpolated strings

18 years ago by Matt Wilmas — view source — reply

unread

Hi all,

I think I first realized that PHP's scanner splits non-constant strings into
many "pieces" after reading Sara's "How long is a piece of string?" blog
entry[1] last summer. At the time I didn't know much about the internals
and didn't know if anything could be done to change it. Then in the fall I
finally took a look at the scanner ;-) and thought it would be possible to
only "split" strings at variables. Finally a few months ago, I began
working out the changes -- it was working almost 2 months ago, but then I
got sidetracked :-/ from doing some more testing and making a few semantic
token changes till now.

So anyway, now heredocs and interpolated strings should be pretty much just
like constant strings and concatenation (except for the extra INIT_STRING
opcode). They scan/parse/compile faster (with less memory), run faster, and
there's less to free when destroying opcodes.

With a simple string like "This is $var string" (say $var = 'some'), I found
the compile/cleanup time to be up to 50% faster, and runtime 55% faster!
(Note: To test compile time, I eval()'d about 50 of them in an if (0) {...}
block.) The difference will be much more depending on how many "pieces"
there would've been before (e.g. longer).

The more complex rules increased the size of Flex's tables about 40%.
However, removing the old heredoc end rule, which used the ^
beginning-of-line operator, made the YY_RULE_SETUP macro be empty, saving
some space. The net result was an 8K/12K larger binary in 5.2/HEAD. I was
surprised at the overall performance increase without the ^ rule. Its
saving a few operations per match made just about as much difference as
Flex's -Cfe table compression (was playing with that first :^)) when
compiling the code from Zend/bench.php (5% I think).

This was with a Windows ZTS build. Running ApacheBench on a few different
scripts showed pretty nice overall improvements -- 10-15% was common in my
quick tests.

BTW, removing that ^ rule lifts the requirement that the character before
the closing heredoc label "must be a newline as defined by your operating
system," to quote the manual.

Now some of the other changes:

The ST_SINGLE_QUOTE state was removed from 5.2, like in HEAD.

A string like "$$$" is considered constant now, since that's really what it
is, right?

CG(zend_lineno) wasn't incremented before if a \n or \r newline (not \r\n)
followed a backslash in a non-constant string. { returned T_STRING instead
of T_BAD_CHARACTER like any other invalid escape sequence. (Note: Of course
these won't usually match now anyway, but will be part of a longer string.)

I removed HANDLE_NEWLINES() from the code that scans a string's text,
instead doing the newline check in the escape-checking loop, to prevent
scanning twice. And I removed the additional boundary check in
HANDLE_NEWLINES() and elsewhere since I didn't see the need -- AFAIK in all
cases you'll only hit '\0'.

I removed the one <<EOF>> rule since it was missing some states and it
wasn't doing anything that the default EOF rule doesn't by calling
yyterminate().

In zendlex(), the goto target doesn't need to recheck CG(increment_lineno)
since it hasn't changed, and I simplified the closing tag newline check
(also looked like it would miss \r ones).

Sorry for the long message! I'll send another if I think of something I
forgot to mention. Here are the patches:

http://realplain.com/php/scanner_optimizations.diff
http://realplain.com/php/scanner_optimizations_5_2.diff

Appreciate any feedback, or questions about any of it. :-)

Thanks,
Matt

[1]
http://blog.libssh2.org/index.php?/archives/28-How-long-is-a-piece-of-string.html

18 years ago by Matt Wilmas — view source — reply

unread

Hi again,

Hmm, not a single reply about this patch...? Did anyone try it out? :-)
Think it can be used after 5.2.2?

Matt

----- Original Message -----
From: "Matt Wilmas"
Sent: Thursday, April 12, 2007
Subject: [PHP-DEV] [PATCH] Major optimization for heredocs/interpolated
strings

Hi all,

I think I first realized that PHP's scanner splits non-constant strings
into
many "pieces" after reading Sara's "How long is a piece of string?" blog
entry[1] last summer. At the time I didn't know much about the internals
and didn't know if anything could be done to change it. Then in the fall
I
finally took a look at the scanner ;-) and thought it would be possible to
only "split" strings at variables. Finally a few months ago, I began
working out the changes -- it was working almost 2 months ago, but then I
got sidetracked :-/ from doing some more testing and making a few semantic
token changes till now.

So anyway, now heredocs and interpolated strings should be pretty much
just
like constant strings and concatenation (except for the extra INIT_STRING
opcode). They scan/parse/compile faster (with less memory), run faster,
and
there's less to free when destroying opcodes.

With a simple string like "This is $var string" (say $var = 'some'), I
found
the compile/cleanup time to be up to 50% faster, and runtime 55% faster!
(Note: To test compile time, I eval()'d about 50 of them in an if (0)
{...}
block.) The difference will be much more depending on how many "pieces"
there would've been before (e.g. longer).

The more complex rules increased the size of Flex's tables about 40%.
However, removing the old heredoc end rule, which used the ^
beginning-of-line operator, made the YY_RULE_SETUP macro be empty, saving
some space. The net result was an 8K/12K larger binary in 5.2/HEAD. I
was
surprised at the overall performance increase without the ^ rule. Its
saving a few operations per match made just about as much difference as
Flex's -Cfe table compression (was playing with that first :^)) when
compiling the code from Zend/bench.php (5% I think).

This was with a Windows ZTS build. Running ApacheBench on a few different
scripts showed pretty nice overall improvements -- 10-15% was common in my
quick tests.

BTW, removing that ^ rule lifts the requirement that the character before
the closing heredoc label "must be a newline as defined by your operating
system," to quote the manual.

Now some of the other changes:

The ST_SINGLE_QUOTE state was removed from 5.2, like in HEAD.

A string like "$$$" is considered constant now, since that's really what
it
is, right?

CG(zend_lineno) wasn't incremented before if a \n or \r newline (not \r\n)
followed a backslash in a non-constant string. { returned T_STRING
instead
of T_BAD_CHARACTER like any other invalid escape sequence. (Note: Of
course
these won't usually match now anyway, but will be part of a longer
string.)

I removed HANDLE_NEWLINES() from the code that scans a string's text,
instead doing the newline check in the escape-checking loop, to prevent
scanning twice. And I removed the additional boundary check in
HANDLE_NEWLINES() and elsewhere since I didn't see the need -- AFAIK in
all
cases you'll only hit '\0'.

I removed the one <<EOF>> rule since it was missing some states and it
wasn't doing anything that the default EOF rule doesn't by calling
yyterminate().

In zendlex(), the goto target doesn't need to recheck CG(increment_lineno)
since it hasn't changed, and I simplified the closing tag newline check
(also looked like it would miss \r ones).

Sorry for the long message! I'll send another if I think of something I
forgot to mention. Here are the patches:

http://realplain.com/php/scanner_optimizations.diff
http://realplain.com/php/scanner_optimizations_5_2.diff

Appreciate any feedback, or questions about any of it. :-)

Thanks,
Matt

[1]

http://blog.libssh2.org/index.php?/archives/28-How-long-is-a-piece-of-string.html

18 years ago by Marcus Boerger — view source — reply

unread

Hello Matt,

the patch looks interesting. I think we should commit it to HEAD. And
if it works good we can add it to 5.3 once we created it. Did you do any
measurements?

best regards
marcus

Thursday, April 26, 2007, 5:52:43 AM, you wrote:

Hi again,

Hmm, not a single reply about this patch...? Did anyone try it out? :-)
Think it can be used after 5.2.2?

Matt

----- Original Message -----
From: "Matt Wilmas"
Sent: Thursday, April 12, 2007
Subject: [PHP-DEV] [PATCH] Major optimization for heredocs/interpolated
strings

Hi all,

I think I first realized that PHP's scanner splits non-constant strings
into
many "pieces" after reading Sara's "How long is a piece of string?" blog
entry[1] last summer. At the time I didn't know much about the internals
and didn't know if anything could be done to change it. Then in the fall
I
finally took a look at the scanner ;-) and thought it would be possible to
only "split" strings at variables. Finally a few months ago, I began
working out the changes -- it was working almost 2 months ago, but then I
got sidetracked :-/ from doing some more testing and making a few semantic
token changes till now.

So anyway, now heredocs and interpolated strings should be pretty much
just
like constant strings and concatenation (except for the extra INIT_STRING
opcode). They scan/parse/compile faster (with less memory), run faster,
and
there's less to free when destroying opcodes.

With a simple string like "This is $var string" (say $var = 'some'), I
found
the compile/cleanup time to be up to 50% faster, and runtime 55% faster!
(Note: To test compile time, I eval()'d about 50 of them in an if (0)
{...}
block.) The difference will be much more depending on how many "pieces"
there would've been before (e.g. longer).

The more complex rules increased the size of Flex's tables about 40%.
However, removing the old heredoc end rule, which used the ^
beginning-of-line operator, made the YY_RULE_SETUP macro be empty, saving
some space. The net result was an 8K/12K larger binary in 5.2/HEAD. I
was
surprised at the overall performance increase without the ^ rule. Its
saving a few operations per match made just about as much difference as
Flex's -Cfe table compression (was playing with that first :^)) when
compiling the code from Zend/bench.php (5% I think).

This was with a Windows ZTS build. Running ApacheBench on a few different
scripts showed pretty nice overall improvements -- 10-15% was common in my
quick tests.

BTW, removing that ^ rule lifts the requirement that the character before
the closing heredoc label "must be a newline as defined by your operating
system," to quote the manual.

Now some of the other changes:

The ST_SINGLE_QUOTE state was removed from 5.2, like in HEAD.

A string like "$$$" is considered constant now, since that's really what
it
is, right?

CG(zend_lineno) wasn't incremented before if a \n or \r newline (not \r\n)
followed a backslash in a non-constant string. { returned T_STRING
instead
of T_BAD_CHARACTER like any other invalid escape sequence. (Note: Of
course
these won't usually match now anyway, but will be part of a longer
string.)

I removed HANDLE_NEWLINES() from the code that scans a string's text,
instead doing the newline check in the escape-checking loop, to prevent
scanning twice. And I removed the additional boundary check in
HANDLE_NEWLINES() and elsewhere since I didn't see the need -- AFAIK in
all
cases you'll only hit '\0'.

I removed the one <<EOF>> rule since it was missing some states and it
wasn't doing anything that the default EOF rule doesn't by calling
yyterminate().

In zendlex(), the goto target doesn't need to recheck CG(increment_lineno)
since it hasn't changed, and I simplified the closing tag newline check
(also looked like it would miss \r ones).

Sorry for the long message! I'll send another if I think of something I
forgot to mention. Here are the patches:

http://realplain.com/php/scanner_optimizations.diff
http://realplain.com/php/scanner_optimizations_5_2.diff

Appreciate any feedback, or questions about any of it. :-)

Thanks,
Matt

[1]

http://blog.libssh2.org/index.php?/archives/28-How-long-is-a-piece-of-string.html

Best regards,
Marcus

18 years ago by Matt Wilmas — view source — reply

unread

Hi Marcus,

Yeah, I did a lot of benchmarking different things ;-) and mentioned some
statistics in my original message -- such as 10-15% overall improvement
using ApacheBench on a few scripts. The difference for specific strings
depends on what the string looks like (length, etc.), but is always faster
for both scan/compile time and runtime. The difference is huge with, for
example, a longer heredoc without variables, like the one in Sara's blog
entry.

And as I also mentioned, removing the ^ beginning-of-line operator from the
old heredoc end token rule, it makes YY_RULE_SETUP empty in the generated C
file, which had been setting yy_at_bol for every token matched, just for
the one rule that used it. Eliminating that overhead for each token made a
measurable difference in overall scanner performance -- I'm sure there's
less difference with non-ZTS builds since YY_RULE_SETUP was referencing
globals.

BTW, regarding HEAD, for you, Andrei, or anyone, I think I noticed a bug,
unrelated to my changes: CG(literal_type) isn't set for backticks in the
parser, so it winds up being whatever was last set. I would've changed it,
but didn't know if it's supposed to always be IS_STRING or dependent on
UG(unicode).

Thanks,
Matt

----- Original Message -----
From: "Marcus Boerger"
Sent: Thursday, April 26, 2007

Hello Matt,

the patch looks interesting. I think we should commit it to HEAD. And
if it works good we can add it to 5.3 once we created it. Did you do any
measurements?

best regards
marcus

Thursday, April 26, 2007, 5:52:43 AM, you wrote:

Hi again,

Hmm, not a single reply about this patch...? Did anyone try it out? :-)
Think it can be used after 5.2.2?

Matt

18 years ago by Andrei Zmievski — view source — reply

unread

BTW, regarding HEAD, for you, Andrei, or anyone, I think I noticed
a bug,
unrelated to my changes: CG(literal_type) isn't set for backticks
in the
parser, so it winds up being whatever was last set. I would've
changed it,
but didn't know if it's supposed to always be IS_STRING or
dependent on
UG(unicode).

Thanks, I remember leaving that out since we didn't have filesystem
encoding support yet, but it's definitely should be there now. Fixed.

Also, I'm for committing your patch to HEAD and 5.3.

-Andrei

18 years ago by Antony Dovgal — view source — reply

unread

BTW, regarding HEAD, for you, Andrei, or anyone, I think I noticed
a bug,
unrelated to my changes: CG(literal_type) isn't set for backticks
in the
parser, so it winds up being whatever was last set. I would've
changed it,
but didn't know if it's supposed to always be IS_STRING or
dependent on
UG(unicode).

Thanks, I remember leaving that out since we didn't have filesystem
encoding support yet, but it's definitely should be there now. Fixed.

Also, I'm for committing your patch to HEAD and 5.3.

5.what??

--
Wbr,
Antony Dovgal

18 years ago by Andi Gutmans — view source — reply

unread

Before you commit it, give us a few more days to review it. We need to
be careful before commiting major changes to the scanner. If it checks
out I'm definitely fine with commiting.

Andi

-----Original Message-----
From: Andrei Zmievski [mailto:andrei@gravitonic.com]
Sent: Thursday, April 26, 2007 8:53 AM
To: Matt Wilmas
Cc: internals@lists.php.net; Marcus Boerger
Subject: Re: [PHP-DEV] [PATCH] Major optimization for
heredocs/interpolated strings

BTW, regarding HEAD, for you, Andrei, or anyone, I think I
noticed a
bug, unrelated to my changes: CG(literal_type) isn't set
for backticks
in the parser, so it winds up being whatever was last set.
I would've
changed it, but didn't know if it's supposed to always be
IS_STRING or
dependent on UG(unicode).

Thanks, I remember leaving that out since we didn't have
filesystem encoding support yet, but it's definitely should
be there now. Fixed.

Also, I'm for committing your patch to HEAD and 5.3.

-Andrei

--
To
unsubscribe, visit: http://www.php.net/unsub.php

18 years ago by Marcus Boerger — view source — reply

unread

Hello Andi,

how about you guys review/test it and then commit it if you agree to it?

best regards
marcus

Thursday, April 26, 2007, 7:18:46 PM, you wrote:

Before you commit it, give us a few more days to review it. We need to
be careful before commiting major changes to the scanner. If it checks
out I'm definitely fine with commiting.

Andi

-----Original Message-----
From: Andrei Zmievski [mailto:andrei@gravitonic.com]
Sent: Thursday, April 26, 2007 8:53 AM
To: Matt Wilmas
Cc: internals@lists.php.net; Marcus Boerger
Subject: Re: [PHP-DEV] [PATCH] Major optimization for
heredocs/interpolated strings

BTW, regarding HEAD, for you, Andrei, or anyone, I think I
noticed a
bug, unrelated to my changes: CG(literal_type) isn't set
for backticks
in the parser, so it winds up being whatever was last set.
I would've
changed it, but didn't know if it's supposed to always be
IS_STRING or
dependent on UG(unicode).

Thanks, I remember leaving that out since we didn't have
filesystem encoding support yet, but it's definitely should
be there now. Fixed.

Also, I'm for committing your patch to HEAD and 5.3.

-Andrei

--
To
unsubscribe, visit: http://www.php.net/unsub.php

Best regards,
Marcus

18 years ago by Andi Gutmans — view source — reply

unread

Sure makes even more sense.

-----Original Message-----
From: Marcus Boerger [mailto:helly@php.net]
Sent: Thursday, April 26, 2007 10:24 AM
To: Andi Gutmans
Cc: Andrei Zmievski; Matt Wilmas; internals@lists.php.net
Subject: Re: [PHP-DEV] [PATCH] Major optimization for
heredocs/interpolated strings

Hello Andi,

how about you guys review/test it and then commit it if you
agree to it?

best regards
marcus

Thursday, April 26, 2007, 7:18:46 PM, you wrote:

Before you commit it, give us a few more days to review it.
We need to
be careful before commiting major changes to the scanner.
If it checks
out I'm definitely fine with commiting.

Andi

-----Original Message-----
From: Andrei Zmievski [mailto:andrei@gravitonic.com]
Sent: Thursday, April 26, 2007 8:53 AM
To: Matt Wilmas
Cc: internals@lists.php.net; Marcus Boerger
Subject: Re: [PHP-DEV] [PATCH] Major optimization for
heredocs/interpolated strings

BTW, regarding HEAD, for you, Andrei, or anyone, I think I
noticed a
bug, unrelated to my changes: CG(literal_type) isn't set
for backticks
in the parser, so it winds up being whatever was last set.
I would've
changed it, but didn't know if it's supposed to always be
IS_STRING or
dependent on UG(unicode).

Thanks, I remember leaving that out since we didn't have
filesystem
encoding support yet, but it's definitely should be there
now. Fixed.

Also, I'm for committing your patch to HEAD and 5.3.

-Andrei

--
To
unsubscribe,
visit: http://www.php.net/unsub.php

Best regards,
Marcus

18 years ago by Matt Wilmas — view source — reply

unread

Hi all,

After an off-list discussion with Dmitry, the patch has been updated.
Mostly to remove old rules I had left (that returned T_CHARACTER, etc.)
since the new "super rules" can match what they did. So now every "piece"
of a non-constant string returns T_ENCAPSED_AND_WHITESPACE. I also added
comments to the new stuff to hopefully help explain the purpose/logic behind
it a bit better. :-)

Dmitry would like to commit the changes on Friday. I don't have the updated
patch for HEAD finished yet, but it's coming soon...

The updated 5.2 patch is at the same address,
http://realplain.com/php/scanner_optimizations_5_2.diff, and the original
was moved to http://realplain.com/php/scanner_optimizations_5_2-v1.diff

Thanks,
Matt

----- Original Message -----
From: "Matt Wilmas"
Sent: Thursday, April 12, 2007

Hi all,

I think I first realized that PHP's scanner splits non-constant strings
into
many "pieces" after reading Sara's "How long is a piece of string?" blog
entry[1] last summer. At the time I didn't know much about the internals
and didn't know if anything could be done to change it. Then in the fall
I
finally took a look at the scanner ;-) and thought it would be possible to
only "split" strings at variables. Finally a few months ago, I began
working out the changes -- it was working almost 2 months ago, but then I
got sidetracked :-/ from doing some more testing and making a few semantic
token changes till now.

So anyway, now heredocs and interpolated strings should be pretty much
just
like constant strings and concatenation (except for the extra INIT_STRING
opcode). They scan/parse/compile faster (with less memory), run faster,
and
there's less to free when destroying opcodes.

With a simple string like "This is $var string" (say $var = 'some'), I
found
the compile/cleanup time to be up to 50% faster, and runtime 55% faster!
(Note: To test compile time, I eval()'d about 50 of them in an if (0)
{...}
block.) The difference will be much more depending on how many "pieces"
there would've been before (e.g. longer).

The more complex rules increased the size of Flex's tables about 40%.
However, removing the old heredoc end rule, which used the ^
beginning-of-line operator, made the YY_RULE_SETUP macro be empty, saving
some space. The net result was an 8K/12K larger binary in 5.2/HEAD. I
was
surprised at the overall performance increase without the ^ rule. Its
saving a few operations per match made just about as much difference as
Flex's -Cfe table compression (was playing with that first :^)) when
compiling the code from Zend/bench.php (5% I think).

This was with a Windows ZTS build. Running ApacheBench on a few different
scripts showed pretty nice overall improvements -- 10-15% was common in my
quick tests.

BTW, removing that ^ rule lifts the requirement that the character before
the closing heredoc label "must be a newline as defined by your operating
system," to quote the manual.

Now some of the other changes:

The ST_SINGLE_QUOTE state was removed from 5.2, like in HEAD.

A string like "$$$" is considered constant now, since that's really what
it
is, right?

CG(zend_lineno) wasn't incremented before if a \n or \r newline (not \r\n)
followed a backslash in a non-constant string. { returned T_STRING
instead
of T_BAD_CHARACTER like any other invalid escape sequence. (Note: Of
course
these won't usually match now anyway, but will be part of a longer
string.)

I removed HANDLE_NEWLINES() from the code that scans a string's text,
instead doing the newline check in the escape-checking loop, to prevent
scanning twice. And I removed the additional boundary check in
HANDLE_NEWLINES() and elsewhere since I didn't see the need -- AFAIK in
all
cases you'll only hit '\0'.

I removed the one <<EOF>> rule since it was missing some states and it
wasn't doing anything that the default EOF rule doesn't by calling
yyterminate().

In zendlex(), the goto target doesn't need to recheck CG(increment_lineno)
since it hasn't changed, and I simplified the closing tag newline check
(also looked like it would miss \r ones).

Sorry for the long message! I'll send another if I think of something I
forgot to mention. Here are the patches:

http://realplain.com/php/scanner_optimizations.diff
http://realplain.com/php/scanner_optimizations_5_2.diff

Appreciate any feedback, or questions about any of it. :-)

Thanks,
Matt

[1]

http://blog.libssh2.org/index.php?/archives/28-How-long-is-a-piece-of-string.html

18 years ago by Stanislav Malyshev — view source — reply

unread

The updated 5.2 patch is at the same address,
http://realplain.com/php/scanner_optimizations_5_2.diff, and the original
was moved to http://realplain.com/php/scanner_optimizations_5_2-v1.diff

I've noticed a slight difference between how PHP worked without and with
this patch. Previously, if you had this code:
echo "$arr[123]";

then the opcode generated would have "123" as string. Now it has it as
number. Not sure if it's important, just to be aware of it.

Stanislav Malyshev, Zend Products Engineer
stas@zend.com http://www.zend.com/

18 years ago by Stanislav Malyshev — view source — reply

unread

Do we have any good performance testcases for the patch that show how it
improves the parser?

The updated 5.2 patch is at the same address,
http://realplain.com/php/scanner_optimizations_5_2.diff, and the original
was moved to http://realplain.com/php/scanner_optimizations_5_2-v1.diff

--
Stanislav Malyshev, Zend Products Engineer
stas@zend.com http://www.zend.com/

[PATCH] Major optimization for heredocs/interpolated strings

then the opcode generated would have "123" as string. Now it has it as number. Not sure if it's important, just to be aware of it.

then the opcode generated would have "123" as string. Now it has it as
number. Not sure if it's important, just to be aware of it.