[PATCH] New, optimized is_numeric_string, and other number stuff

18 years ago by Matt Wilmas — view source — reply

unread

Hi all,

I rewrote is_numeric_string/unicode to be faster and change a couple things.
The changes being:

Previously, large numbers (very long or "1e500") that became INF were
ignored (Bug #26349), which is not the behavior anywhere else.
Leading whitespace with hex numbers or ones that started with . (" .123")
also caused them to be ignored.
Hex strings were limited to LONG_MAX, and in scripts/parser, ULONG_MAX.
I added a zend_hex_strtod() function to handle numbers > LONG_MAX in both
places. From the previous comments like "strtod() messes up hex numbers,"
it seems there was desire to support them. :-)
Small change, but the string "0x" was considered non-numeric before, but
a partial match of the 0 now (basically to get a more accurate error
level/message with zend_parse_parameters(), for example).

Now the performance... The errno stuff has been removed from is_numeric_*
(and optimized in the parser) to save function calls with thread-safe
libraries (are they used even when ZTS is disabled?). In my tests on
Windows, I saw a 5-15% improvement with longs (less with more digits; on
64-bit systems, it could be slower at 12-15+ digits, but they're not
common). (With HEAD, everything I checked was consistent, but in 5.2, a few
random long tests were slower; must be some compiler weirdness? :-/) So not
much difference there for these changes, BUT doubles are over twice as
fast, and non-numeric string comparisons are up to nearly 3 times faster!
(Slightly less % improvement in Unicode mode.) Yeah, non-numeric strings
are detected very fast, which may be more significant since is_numeric_* is
always used on them (from compare_function(), zendi_smart_strcmp(), etc.).
Also, no number conversion is done if there's no corresponding pointer to
fill -- much faster when code is "just checking."

The larger inline function did increase the binary size by a few K... The
patches:

http://realplain.com/php/is_numeric.diff
http://realplain.com/php/is_numeric_5_2.diff

You can see that I changed MAX_LENGTH_OF_LONG to be accurate on 32-/64-bit,
which my changes rely on. I also fixed a few places where memory
calculations that use it could be too small, in theory.

I wanted to get this in before Ilia's Thursday deadline (if it's still on
:-)), in case it can be applied soon. Finally, don't know if you'd want to
use it as is, but I've attached possible NEWS file updates about this stuff.

Thoughts, questions? Thanks.

Matt

18 years ago by Ilia Alshanetsky — view source — reply

unread

Matt,

The zend_exceptions.c bit seems incorrect. You change +4 to +2 but
there are 4 "text" chars that need to be accounted for there. I am
going over the rest of the patch now.

Ilia Alshanetsky

18 years ago by Matt Wilmas — view source — reply

unread

Hi Ilia,

I actually changed +2 to +4. :-)

In the original message, I didn't clarify that the % improvement with
numbers was obtained using arithmetic operators with one numeric string
operand. And the number changes in the parser just made a few % difference
in my crude way of trying to test it (repeated "<number>; ..." passed to
eval()).

There are also many compiler warnings now in is_numeric_unicode() on
zend_cmp_unicode_and_literal(), because its parameters aren't declared
const...

Matt

----- Original Message -----
From: "Ilia Alshanetsky"
Sent: Tuesday, December 12, 2006

Matt,

The zend_exceptions.c bit seems incorrect. You change +4 to +2 but
there are 4 "text" chars that need to be accounted for there. I am
going over the rest of the patch now.

Ilia Alshanetsky

18 years ago by Ilia Alshanetsky — view source — reply

unread

Hi Ilia,

I actually changed +2 to +4. :-)

My mistake :)

As far as I can tell all tests still pass with the patch, although we
have a number of is_numeric_string() uses that will need to be
adjusted to accommodate the change to the function.

Ilia Alshanetsky

18 years ago by Matt Wilmas — view source — reply

unread

----- Original Message -----
From: "Ilia Alshanetsky"
Sent: Tuesday, December 12, 2006

As far as I can tell all tests still pass with the patch, although we
have a number of is_numeric_string() uses that will need to be
adjusted to accommodate the change to the function.

Really, like what? I didn't see anything that stuck out...

Ilia Alshanetsky

Matt

18 years ago by Ilia Alshanetsky — view source — reply

unread

Matt,

Take a look at array.c and wddx.c

----- Original Message -----
From: "Ilia Alshanetsky"
Sent: Tuesday, December 12, 2006

As far as I can tell all tests still pass with the patch, although we
have a number of is_numeric_string() uses that will need to be
adjusted to accommodate the change to the function.

Really, like what? I didn't see anything that stuck out...

Ilia Alshanetsky

Matt

--

Ilia Alshanetsky

18 years ago by Matt Wilmas — view source — reply

unread

Ilia,

I can't see anything that would be wrong there... Well, I have to go now,
so I won't be able to bother you with another reply for a while. :-)

Matt

----- Original Message -----
From: "Ilia Alshanetsky"
Sent: Tuesday, December 12, 2006

Matt,

Take a look at array.c and wddx.c

18 years ago by Matt Wilmas — view source — reply

unread

Hi Ilia,

I take it there aren't any issues in array.c or wddx.c?

Anyway, I wanted to point out one more is_numeric_* functionality change
that I skipped in my original message, but I think it's probably a logic bug
in the old code. The end of the old function is:

if (end_ptr_double>end_ptr_long && dval) {
*dval = local_dval;
return IS_DOUBLE;
} else if (end_ptr_long && lval) {
*lval = local_lval;
return IS_LONG;
}
return 0;

It's only reached if a partial numeric match (non-zero allow_errors) is
allowed. I've not tested it, but as far as I can tell, that means if an
is_numeric_* call wanted to know about a partial match and didn't have a
pointer for the corresponding type, 0 would incorrectly be returned (or
possibly IS_LONG in case of a double). I don't know if this is much of a
real issue...

From checking http://lxr.php.net, the first difference (returning 0) exists
in pecl/http/http_functions.c and http_request_api.c.

And either difference exists in ext/curl/streams.c. Again, not tested, but
it looks like the string '1.23foo' (double) would be treated as a long and
incorrectly fill "mr" with 1, but the string '1.23' (full match) would not.

Hope that's correct. I believe the old code was wrong, and it's correct to
treat full and partial matches the same, which my new version does.

Matt

----- Original Message -----
From: "Ilia Alshanetsky"
Sent: Tuesday, December 12, 2006

Matt,

Take a look at array.c and wddx.c

----- Original Message -----
From: "Ilia Alshanetsky"
Sent: Tuesday, December 12, 2006

As far as I can tell all tests still pass with the patch, although we
have a number of is_numeric_string() uses that will need to be
adjusted to accommodate the change to the function.

Really, like what? I didn't see anything that stuck out...

Ilia Alshanetsky

Matt

Ilia Alshanetsky

18 years ago by Matt Wilmas — view source — reply

unread

Hi Ilia, all,

I was just looking at zendi_smart_strcmp() and realized something I hadn't
considered (more on that in a sec.). First, and unrelated to the new
is_numeric_*, because _smart_strcmp() uses zend_strtod() if one operand is a
double and the other isn't, it results in ('0.0' == '0x123') being TRUE!
Seems like a bug. I think a (double) cast can be used there instead of
zend_strtod(), since that's what's done in compare_function()? Would be
faster too...

All right, now what's different with my new is_numeric_* functions:
previously is_numeric_* ignored doubles that overflowed (INF), and returned
0 -- think like 500 digits -- and the operands would be compared as strings.
So 2 strings that overflowed wouldn't wrongly be considered equal, I assume.
Now that's different with IS_DOUBLE being returned always, which is
appropriate for most cases like arithmetic. And it wasn't just overflown
numbers that previously returned 0, but underflown also ('1e-1000' or a VERY
small decimal number; they set ERANGE).

I think the simplest way to keep the old behavior with the new is_numeric_*
is to use errno in zendi_smart_strcmp(). That'll make it a little slower,
but no slower on longs than the old version I don't think, and it'll still
be much faster with doubles/non-numeric strings.

Any other thoughts on how it should be handled? That is the correct and
desired behavior to only compare numerically if both values can be
accurately represented? I'll update the patches ASAP once I find out what
to do...

Matt

----- Original Message -----
From: "Ilia Alshanetsky"
Sent: Tuesday, December 12, 2006

Matt,

Take a look at array.c and wddx.c

----- Original Message -----
From: "Ilia Alshanetsky"
Sent: Tuesday, December 12, 2006

As far as I can tell all tests still pass with the patch, although we
have a number of is_numeric_string() uses that will need to be
adjusted to accommodate the change to the function.

Really, like what? I didn't see anything that stuck out...

Ilia Alshanetsky

Matt

Ilia Alshanetsky

18 years ago by Matt Wilmas — view source — reply

unread

Hi (again) Ilia, all,

After thinking about the zend_smart_strcmp() issue more, I believe a numeric
comparison can be used except when both values overflow and have the
same sign (either INF or -INF). I also think it's OK to say an underflow
(that becomes an even 0.0) of both values can simply be considered a
floating-point precision loss. I've updated the function to reflect this.
Sound like an acceptable/good solution?

The old is_numeric_* function caused a string comparison if either value
overflowed, so an expression like ('123' < str_repeat('1', 500)) was FALSE;
it's now TRUE, which seems correct.

BTW, I noticed another bug that considered '-2000000000' to be greater
than '2000000000' because of overflow subtracting longs. I changed it to
be the same as in compare_function() where this bug was fixed exactly 6
years ago. :-) Finally, speaking of compare_function(), _smart_strcmp() now
uses a (double) cast too instead of _strtod() when dealing with a
long/double combo.

Patches are updated:
http://realplain.com/php/is_numeric.diff
http://realplain.com/php/is_numeric_5_2.diff

Hopefully everything is now in good shape for inclusion. :-?

Matt

----- Original Message -----
From: "Matt Wilmas"
Sent: Monday, December 18, 2006

Hi Ilia, all,

I was just looking at zendi_smart_strcmp() and realized something I hadn't
considered (more on that in a sec.). First, and unrelated to the new
is_numeric_*, because _smart_strcmp() uses zend_strtod() if one operand is
a
double and the other isn't, it results in ('0.0' == '0x123') being TRUE!
Seems like a bug. I think a (double) cast can be used there instead of
zend_strtod(), since that's what's done in compare_function()? Would be
faster too...

All right, now what's different with my new is_numeric_* functions:
previously is_numeric_* ignored doubles that overflowed (INF), and
returned
0 -- think like 500 digits -- and the operands would be compared as
strings.
So 2 strings that overflowed wouldn't wrongly be considered equal, I
assume.
Now that's different with IS_DOUBLE being returned always, which is
appropriate for most cases like arithmetic. And it wasn't just overflown
numbers that previously returned 0, but underflown also ('1e-1000' or a
VERY
small decimal number; they set ERANGE).

I think the simplest way to keep the old behavior with the new
is_numeric_*
is to use errno in zendi_smart_strcmp(). That'll make it a little slower,
but no slower on longs than the old version I don't think, and it'll still
be much faster with doubles/non-numeric strings.

Any other thoughts on how it should be handled? That is the correct and
desired behavior to only compare numerically if both values can be
accurately represented? I'll update the patches ASAP once I find out what
to do...

18 years ago by Andrei Zmievski — view source — reply

unread

Great! Assuming all tests pass, I'm for applying it.

-Andrei

Hi all,

I rewrote is_numeric_string/unicode to be faster and change a couple
things.
The changes being:

Previously, large numbers (very long or "1e500") that became INF
were
ignored (Bug #26349), which is not the behavior anywhere else.

Leading whitespace with hex numbers or ones that started with . ("
.123")
also caused them to be ignored.

Hex strings were limited to LONG_MAX, and in scripts/parser,
ULONG_MAX.
I added a zend_hex_strtod() function to handle numbers > LONG_MAX in
both
places. From the previous comments like "strtod() messes up hex
numbers,"
it seems there was desire to support them. :-)

Small change, but the string "0x" was considered non-numeric
before, but
a partial match of the 0 now (basically to get a more accurate error
level/message with zend_parse_parameters(), for example).

Now the performance... The errno stuff has been removed from
is_numeric_*
(and optimized in the parser) to save function calls with thread-safe
libraries (are they used even when ZTS is disabled?). In my tests on
Windows, I saw a 5-15% improvement with longs (less with more digits;
on
64-bit systems, it could be slower at 12-15+ digits, but they're not
common). (With HEAD, everything I checked was consistent, but in 5.2,
a few
random long tests were slower; must be some compiler weirdness? :-/)
So not
much difference there for these changes, BUT doubles are over twice
as
fast, and non-numeric string comparisons are up to nearly 3 times
faster!
(Slightly less % improvement in Unicode mode.) Yeah, non-numeric
strings
are detected very fast, which may be more significant since
is_numeric_* is
always used on them (from compare_function(), zendi_smart_strcmp(),
etc.).
Also, no number conversion is done if there's no corresponding pointer
to
fill -- much faster when code is "just checking."

The larger inline function did increase the binary size by a few K...
The
patches:

http://realplain.com/php/is_numeric.diff
http://realplain.com/php/is_numeric_5_2.diff

You can see that I changed MAX_LENGTH_OF_LONG to be accurate on
32-/64-bit,
which my changes rely on. I also fixed a few places where memory
calculations that use it could be too small, in theory.

I wanted to get this in before Ilia's Thursday deadline (if it's still
on
:-)), in case it can be applied soon. Finally, don't know if you'd
want to
use it as is, but I've attached possible NEWS file updates about this
stuff.

Thoughts, questions? Thanks.

Matt
<NEWS.diff.txt

18 years ago by Andi Gutmans — view source — reply

unread

I briefly reviewed the patch and it looks interesting. Will take me a bit
longer though and I'd also like some others here to do so.

In any case, I think it's risky for PHP 5.2.1 and would prefer to defer it
to the next release even if I end up being in favor of including the patch.
It changes a lot of subtleties which can be dangerous in such a short time
frame.

Thanks for the patch. Will dig deeper now...

Andi

-----Original Message-----
From: Matt Wilmas [mailto:php_lists@realplain.com]
Sent: Tuesday, December 12, 2006 6:58 AM
To: internals@lists.php.net
Subject: [PHP-DEV] [PATCH] New, optimized is_numeric_string,
and other number stuff

Hi all,

I rewrote is_numeric_string/unicode to be faster and change a
couple things.
The changes being:

Previously, large numbers (very long or "1e500") that
became INF were ignored (Bug #26349), which is not the
behavior anywhere else.

Leading whitespace with hex numbers or ones that started
with . (" .123") also caused them to be ignored.

Hex strings were limited to LONG_MAX, and in
scripts/parser, ULONG_MAX.
I added a zend_hex_strtod() function to handle numbers >
LONG_MAX in both places. From the previous comments like
"strtod() messes up hex numbers,"
it seems there was desire to support them. :-)

Small change, but the string "0x" was considered
non-numeric before, but a partial match of the 0 now
(basically to get a more accurate error level/message with
zend_parse_parameters(), for example).

Now the performance... The errno stuff has been removed from
is_numeric_* (and optimized in the parser) to save function
calls with thread-safe libraries (are they used even when ZTS
is disabled?). In my tests on Windows, I saw a 5-15%
improvement with longs (less with more digits; on 64-bit
systems, it could be slower at 12-15+ digits, but they're not
common). (With HEAD, everything I checked was consistent,
but in 5.2, a few random long tests were slower; must be some
compiler weirdness? :-/) So not much difference there for
these changes, BUT doubles are over twice as fast, and
non-numeric string comparisons are up to nearly 3 times faster!
(Slightly less % improvement in Unicode mode.) Yeah,
non-numeric strings are detected very fast, which may be more
significant since is_numeric_* is always used on them (from
compare_function(), zendi_smart_strcmp(), etc.).
Also, no number conversion is done if there's no
corresponding pointer to fill -- much faster when code is
"just checking."

The larger inline function did increase the binary size by a
few K... The
patches:

http://realplain.com/php/is_numeric.diff
http://realplain.com/php/is_numeric_5_2.diff

You can see that I changed MAX_LENGTH_OF_LONG to be accurate
on 32-/64-bit, which my changes rely on. I also fixed a few
places where memory calculations that use it could be too
small, in theory.

I wanted to get this in before Ilia's Thursday deadline (if
it's still on :-)), in case it can be applied soon. Finally,
don't know if you'd want to use it as is, but I've attached
possible NEWS file updates about this stuff.

Thoughts, questions? Thanks.

Matt