Don't compare zero exponentials in strings as equal

4 years ago by Ben Ramsey — view source

unread

Hi internals,

PHP's == comparison semantics for strings have a peculiar edge-case, where
comparisons of the form "0e123" == "0e456" return true, because they are
interpreted as floating point zero numbers. This is problematic, because
strings of that form are usually not numbers, but hex-encoded hashes or
similar.

I'm wondering if it may make sense to special-case the comparison semantics
to not consider strings of the form "0e[DIGITS]" equal, unless they are
exactly equal (i.e., fall back to lexicographical if both sides of the
comparison are zero exponentials).

Here's a possible implementation: https://github.com/php/php-src/pull/6749

Of course, the usual rule that you should always use === still holds, but
this at least eliminates the most dangerous edge case.

I encountered a similar situation a few years back.

We were testing whether a value was numeric and, if so, adding 0 to it in order to convert it to an appropriate number type. The code looked something like this:

if (is_numeric($value)) {
    $value += 0;
}

We chose not to do an explicit cast because the string could represent a float or an int, so we wanted the type coercion to do its magic.

We did this before calling json_encode() on the data structure, so that string numbers coming out of a database would be converted to numbers in JSON. For some reason, JSON_NUMERIC_CHECK wasn’t giving us what we wanted, but I can’t recall the issue we were having.

Anyway, we ran into some fun issues with hashes that looked like this:

'131124826899e4096767887418316466'

That value should have remained a string in the JSON output, but is_numeric() returns true for it, so it became INF.

We were able to come up with a work-around, but it’s not foolproof.

Cheers,
Ben

4 years ago by Stanislav Malyshev — view source

unread

Hi!

PHP's == comparison semantics for strings have a peculiar edge-case, where
comparisons of the form "0e123" == "0e456" return true, because they are
interpreted as floating point zero numbers. This is problematic, because
strings of that form are usually not numbers, but hex-encoded hashes or
similar.

This particular argument makes sense, but in more generic sense I feel
it leads us to a dangerous path of implying it's ok to use "==" to
compare strings, because we'll take care of the corner cases. Which I
think is wrong to imply because there are so many corner cases where it
still doesn't work and probably never will. I mean, "000" == "0000000"
is still true. "010" == "0000010" is still true. "1e23" == "001e023" is
still true. Nobody who applies == to strings and expects it to work out
as stri g comparison is doing the right thing. If you apply == to
hex-encoded hashes, that code is fubar, and fixing one particular corner
case won't rescue it. So I wonder if fixing one particular corner case
while leaving many others in would do much.

--
Stas Malyshev
smalyshev@gmail.com

4 years ago by Christian Schneider — view source

unread

Am 03.03.2021 um 19:21 schrieb Stanislav Malyshev smalyshev@gmail.com:

Nobody who applies == to strings and expects it to work out as stri g comparison is doing the right thing. If you apply == to hex-encoded hashes, that code is fubar, and fixing one particular corner case won't rescue it. So I wonder if fixing one particular corner case while leaving many others in would do much.

Just brainstorming, but following that logic, if we would like to remedy this in the long run:
a) Should == with two strings where one (or both?) of them is numeric cause an E_NOTICE or E_WARNING that === should be used?
b) Should == with two strings leave them as string? Possibly after a period of notice/warning as described in a)?

I haven't really thought this through but I still wanted to throw it out there.
I'm sure you'll tell me how foolish I am and why this is a very bad idea ;-)

Chris

4 years ago by Kamil Tekiela — view source

unread

Oh, I like Chris's idea. Yes, please.
Let's deprecate numerical comparison when both operands are strings and
remove that behaviour in PHP 9.0.
Type juggling can be useful when one of them is an integer or float but
when both are strings then chances are that this is an error.
Sorry Nikita, but adding special handling for edge-cases is only going to
make things messier.

4 years ago by Christian Schneider — view source

unread

Am 03.03.2021 um 21:25 schrieb Kamil Tekiela tekiela246@gmail.com:

Sorry Nikita, but adding special handling for edge-cases is only going to
make things messier.

I didn't want to say that, as there is plenty of code out there who might fall in this trap this intermediate measure might still make sense.
The good old quick-fix - warning - better fix cycle :-)

Chris

4 years ago by Ben Ramsey — view source

unread

when both are strings then chances are that this is an error.

Except when comparing two values from sources known to provide numbers as strings, such as form input and database results. :-)

Cheers,
Ben

4 years ago by Rowan Tommins — view source

unread

when both are strings then chances are that this is an error.

Except when comparing two values from sources known to provide numbers as strings, such as form input and database results. :-)

The juggling only makes a difference if the two sources provide
different representations of the same number - "12345" is equal to
"12345" whether you cast both sides to int or leave both as strings.

Regards,

--
Rowan Tommins
[IMSoP]

4 years ago by Christian Schneider — view source

unread

Am 04.03.2021 um 01:37 schrieb Ben Ramsey ben@benramsey.com:

when both are strings then chances are that this is an error.

Except when comparing two values from sources known to provide numbers as strings, such as form input and database results. :-)

This would be a problem for leading zeroes and leading/training spaces, right?

Leading zeroes theoretically could happen in databases, leading/training spaces happen in form input and possibly databases.
Are there other 'common' cases?

Chris

4 years ago by Nikita Popov — view source

unread

On Thu, Mar 4, 2021 at 10:54 AM Christian Schneider cschneid@cschneid.com
wrote:

Am 04.03.2021 um 01:37 schrieb Ben Ramsey ben@benramsey.com:

when both are strings then chances are that this is an error.

Except when comparing two values from sources known to provide numbers
as strings, such as form input and database results. :-)

This would be a problem for leading zeroes and leading/training spaces,
right?

Leading zeroes theoretically could happen in databases, leading/training
spaces happen in form input and possibly databases.
Are there other 'common' cases?

The main one that comes to mind is something like '0' == '0.0'. However,
the real problem is something else: Comparison behavior doesn't affect just
== and !=, but also < and >. And I can see how people would want '2' < '10'
to be true (numeric comparison) rather than false (lexicographical
comparison).

I generally agree that we should remove the special "numeric string"
handling for equality comparisons, and I don't think that removing that
behavior would have a major impact. But we do need to carefully consider
the impact it has on relational operators. There are two ways I can see
this going:

Decouple equality comparison from relational comparison. Don't handle
numeric strings for == and !=, but do handle them for <, >, etc. The
disadvantage is that comparison results may not be trichotomous, e.g. for
"0" op "0.0" all of ==, < and > would return false. (To be fair, this can
already happen in other cases, e.g. non-comparable objects.)
Don't allow relational comparison on strings. If you want to compare
them lexicographically, use strcmp(), otherwise cast to number first.
("Don't allow" here could be a warning to start with.)

Regards,
Nikita

4 years ago by Nikita Popov — view source

unread

On Thu, Mar 4, 2021 at 10:54 AM Christian Schneider cschneid@cschneid.com
wrote:

Am 04.03.2021 um 01:37 schrieb Ben Ramsey ben@benramsey.com:

when both are strings then chances are that this is an error.

Except when comparing two values from sources known to provide numbers
as strings, such as form input and database results. :-)

This would be a problem for leading zeroes and leading/training spaces,
right?

Leading zeroes theoretically could happen in databases, leading/training
spaces happen in form input and possibly databases.
Are there other 'common' cases?

The main one that comes to mind is something like '0' == '0.0'. However,
the real problem is something else: Comparison behavior doesn't affect just
== and !=, but also < and >. And I can see how people would want '2' < '10'
to be true (numeric comparison) rather than false (lexicographical
comparison).

I generally agree that we should remove the special "numeric string"
handling for equality comparisons, and I don't think that removing that
behavior would have a major impact. But we do need to carefully consider
the impact it has on relational operators. There are two ways I can see
this going:

Decouple equality comparison from relational comparison. Don't handle
numeric strings for == and !=, but do handle them for <, >, etc. The
disadvantage is that comparison results may not be trichotomous, e.g. for
"0" op "0.0" all of ==, < and > would return false. (To be fair, this can
already happen in other cases, e.g. non-comparable objects.)

Don't allow relational comparison on strings. If you want to compare
them lexicographically, use strcmp(), otherwise cast to number first.
("Don't allow" here could be a warning to start with.)

Regarding the last point, while I think that lexicographical comparison
with explicit < and > operators is pretty uncommon, sorting an array of
strings and expecting lexicographical order probably isn't unusual. While
SORT_STRING can be passed to enforce that, people probably expect that as
the default behavior. So just not allowing relational comparison is not a
great option either.

Nikita

4 years ago by Kamil Tekiela — view source

unread

I actually do a lot of lexicographical comparison with ISO8601 date
strings.
I also have numerical string comparisons, but the truth is I didn't think
they would be cast to integers so that is a potential bug in my code. When
I use < on strings I expect that they will both compare as strings. In
other places, I explicitly cast one or both operands to integers or float.
I do see the problem that many people could use < and > on strings
expecting numerical comparison, but I believe most people expect
lexicographical comparison like me.

4 years ago by Rowan Tommins — view source

unread

The main one that comes to mind is something like '0' == '0.0'. However,
the real problem is something else: Comparison behavior doesn't affect just
== and !=, but also < and >. And I can see how people would want '2' < '10'
to be true (numeric comparison) rather than false (lexicographical
comparison).

That's a very good point, and I think the existence of the <=> makes
this even more complicated.

Considering your two options:

Decouple equality comparison from relational comparison. Don't handle
numeric strings for == and !=, but do handle them for <, >, etc.

What would then be the result of '0' <=> '0.0'? Would the operator need
to special case the fact that they are numerically equal but
lexicographically unequal?

Don't allow relational comparison on strings. If you want to compare
them lexicographically, use strcmp(), otherwise cast to number first.

This is easy to implement for the <=> operator, but makes it much less
useful. Part of the appeal of the operator is that you can write code
like $sortCallback = fn($a,$b) => $a[$sortField] <=> $b[$sortField];
without needing different cases for different data types.

Granted, that's not going to use an appropriate sorting collation for
many languages, but nor is strcmp().

I think further narrowing the definition of "numeric string" is a more
useful course. If we were designing from scratch, the straight-forward
definition would be:

all digits: /^\d+$/
or, all digits with leading hyphen-minus: /^-\d+$/
or, at least one digit, a dot, and at least one more digit: /^\d+.\d+$/
or, as above, but with leading hyphen-minus: /^-\d+.\d+$/

I think anything beyond that list needs to be carefully justified.

Leading and trailing spaces are probably OK. Other whitespace
(newlines, tabs, etc) probably not.
Alternative notations like hexadecimal and exponentials are easy to
have false positive matches, and how common are they in practice?
Leading and trailing dots (".5", "1.") might be used sometimes, but
I'd probably lean against

So, ignoring BC concerns, I would be happy with "numeric string" defined
as "maybe space, maybe hyphen, some digits, maybe a dot and more digits,
maybe space", which I think in regex form looks like /^ *-?\d+(.\d+)? *$/

Regards,

--
Rowan Tommins
[IMSoP]

4 years ago by Nikita Popov — view source

unread

On Thu, Mar 4, 2021 at 1:03 PM Rowan Tommins rowan.collins@gmail.com
wrote:

The main one that comes to mind is something like '0' == '0.0'. However,
the real problem is something else: Comparison behavior doesn't affect
just
== and !=, but also < and >. And I can see how people would want '2' <
'10'
to be true (numeric comparison) rather than false (lexicographical
comparison).

That's a very good point, and I think the existence of the <=> makes
this even more complicated.

Considering your two options:

Decouple equality comparison from relational comparison. Don't handle
numeric strings for == and !=, but do handle them for <, >, etc.

What would then be the result of '0' <=> '0.0'? Would the operator need
to special case the fact that they are numerically equal but
lexicographically unequal?

Both '0' <=> '0.0' and '0.0' <=> '0' would return 1 in that case, which is
PHP's indication that values are non-comparable. It's definitely not a good
option.

Don't allow relational comparison on strings. If you want to compare
them lexicographically, use strcmp(), otherwise cast to number first.

This is easy to implement for the <=> operator, but makes it much less
useful. Part of the appeal of the operator is that you can write code
like $sortCallback = fn($a,$b) => $a[$sortField] <=> $b[$sortField];
without needing different cases for different data types.

Granted, that's not going to use an appropriate sorting collation for
many languages, but nor is strcmp().

I think further narrowing the definition of "numeric string" is a more
useful course. If we were designing from scratch, the straight-forward
definition would be:

all digits: /^\d+$/

or, all digits with leading hyphen-minus: /^-\d+$/

or, at least one digit, a dot, and at least one more digit: /^\d+.\d+$/

or, as above, but with leading hyphen-minus: /^-\d+.\d+$/

I think anything beyond that list needs to be carefully justified.

Leading and trailing spaces are probably OK. Other whitespace
(newlines, tabs, etc) probably not.

Alternative notations like hexadecimal and exponentials are easy to
have false positive matches, and how common are they in practice?

Leading and trailing dots (".5", "1.") might be used sometimes, but
I'd probably lean against

So, ignoring BC concerns, I would be happy with "numeric string" defined
as "maybe space, maybe hyphen, some digits, maybe a dot and more digits,
maybe space", which I think in regex form looks like /^ *-?\d+(.\d+)? *$/

A disadvantage of narrowing the definition in such a fashion is that it
introduces a discrepancy with (float) casts. I believe these currently
recognize the same values, with the exception that (float) discards
trailing garbage.

Another disadvantage is that exponential notation is commonly returned for
large numbers by various data source -- e.g. if you stored a large float in
a database, I'd expect you'd get it back in exponential notation (if you
get it back as a string). This means that your code could suddenly break
because the range of a value passes some heuristic threshold for how it
gets printed.

Regards,
Nikita

4 years ago by Rowan Tommins — view source

unread

A disadvantage of narrowing the definition in such a fashion is that
it introduces a discrepancy with (float) casts. I believe these
currently recognize the same values, with the exception that (float)
discards trailing garbage.

I don't think that's a big problem; as you say, explicit casts are
already more lax than implicit ones, and that's always going to be the
case in some sense, because they never "fail". Opinions may vary, though.

Another disadvantage is that exponential notation is commonly returned
for large numbers by various data source -- e.g. if you stored a large
float in a database, I'd expect you'd get it back in exponential
notation (if you get it back as a string). This means that your code
could suddenly break because the range of a value passes some
heuristic threshold for how it gets printed.

That may be a more compelling reason, at least given backwards
compatibility requirements. I don't know how common that is, but it
certainly sounds plausible.

Regards,

--
Rowan Tommins
[IMSoP]

4 years ago by Nikita Popov — view source

unread

Hi internals,

PHP's == comparison semantics for strings have a peculiar edge-case, where
comparisons of the form "0e123" == "0e456" return true, because they are
interpreted as floating point zero numbers. This is problematic, because
strings of that form are usually not numbers, but hex-encoded hashes or
similar.

I'm wondering if it may make sense to special-case the comparison
semantics to not consider strings of the form "0e[DIGITS]" equal, unless
they are exactly equal (i.e., fall back to lexicographical if both sides of
the comparison are zero exponentials).

Here's a possible implementation: https://github.com/php/php-src/pull/6749

Of course, the usual rule that you should always use === still holds, but
this at least eliminates the most dangerous edge case.

So, I gather the consensus here is that we should leave this alone as a
lost cause?

Regards,
Nikita