Hello internals,
I noticed that array functions like array_diff()
, array_intersect()
etc use weak comparison.
E.g.
array_diff([0, '', false, null], [null])
only leaves [0].
This makes these functions useless for a number of applications.
Also it can lead to unpleasant surprises, if a developer is not aware
of the silent type casting.
For BC reasons. the existing functions have to remain as they are.
Also, most of them don't really have a place for an extra parameter to
tell them to use strict comparison.
So, as a solution, we would have to introduce new functions.
Has anything like this been proposed in the past?
How would we name the new functions?
Should they be functions or static methods like StrictArray::diff(..)?
I could post an RFC but I want to get some feedback first.
Kind regards
Andreas
On Sat, Nov 11, 2023 at 6:05 PM Andreas Hennings andreas@dqxtech.net
wrote:
Hello internals,
I noticed that array functions likearray_diff()
,array_intersect()
etc use weak comparison.
That's not quite correct. Using the example of array_diff, the comparison
is a strict equality check on a string cast of the values. So
array_diff([""], [false]) will indeed be empty
but array_diff(["0"],[false]) will return ["0"].
Tbh any use case for whatever array function but with strict comparison is
such an easy thing to implement in userland[1] I'm not bothered about
supporting it in core. But that's just me. I don't generally like the idea
of adding new array_* or str_* functions to the global namespace without
very good cause. There is a precedent for it though, in terms of changes
which have gone through in PHP 8, such as array_is_list or str_starts_with.
[1] Example:
function array_diff_strict(array $array1, array ...$arrays): array
{
$diff = [];
foreach ($array1 as $value) {
$found = false;
foreach ($arrays as $array) {
if (in_array($value, $array, true)) {
$found = true;
break;
}
}
if (!$found) {
$diff[] = $value;
}
}
return $diff;
}
Hello David,
On Sat, Nov 11, 2023 at 6:05 PM Andreas Hennings andreas@dqxtech.net
wrote:Hello internals,
I noticed that array functions likearray_diff()
,array_intersect()
etc use weak comparison.That's not quite correct. Using the example of array_diff, the comparison
is a strict equality check on a string cast of the values. So
array_diff([""], [false]) will indeed be empty
but array_diff(["0"],[false]) will return ["0"].
Thanks, good to know!
So in other words, it is still some kind of weak comparison, but with
different casting rules than '=='.
Still this is not desirable in many cases.
Tbh any use case for whatever array function but with strict comparison is
such an easy thing to implement in userland[1] I'm not bothered about
supporting it in core. But that's just me. I don't generally like the idea
of adding new array_* or str_* functions to the global namespace without
very good cause. There is a precedent for it though, in terms of changes
which have gone through in PHP 8, such as array_is_list or str_starts_with.
I would argue that the strict variants of these functions would be
about as useful as the non-strict ones.
Or in my opinion, they would become preferable over the old functions
for most use cases.
In other words, we could say the old/existing functions should not
have been added to the language.
(of course this does not mean we can or should remove them now)
Regarding performance, I measure something like factor 2 for a diff of
range(0, 500) minus [5], comparing array_diff()
vs array_diff_strict()
as proposed here.
So for large arrays or repeated calls it does make a difference.
Regarding the cost of more native functions:
Is the concern more about polluting the global namespace, or about
adding more functions that need to be maintained?
I can see both arguments, but I don't have a clear opinion how these
costs should be weighed.
Cheers
Andreas
[1] Example:
function array_diff_strict(array $array1, array ...$arrays): array
{
$diff = [];
foreach ($array1 as $value) {
$found = false;
foreach ($arrays as $array) {
if (in_array($value, $array, true)) {
$found = true;
break;
}
}
if (!$found) {
$diff[] = $value;
}
}
return $diff;
}
Hello David,
On Sat, Nov 11, 2023 at 6:05 PM Andreas Hennings andreas@dqxtech.net
wrote:Hello internals,
I noticed that array functions likearray_diff()
,array_intersect()
etc use weak comparison.That's not quite correct. Using the example of array_diff, the comparison
is a strict equality check on a string cast of the values. So
array_diff([""], [false]) will indeed be empty
but array_diff(["0"],[false]) will return ["0"].Thanks, good to know!
So in other words, it is still some kind of weak comparison, but with
different casting rules than '=='.
Still this is not desirable in many cases.Tbh any use case for whatever array function but with strict comparison is
such an easy thing to implement in userland[1] I'm not bothered about
supporting it in core. But that's just me. I don't generally like the idea
of adding new array_* or str_* functions to the global namespace without
very good cause. There is a precedent for it though, in terms of changes
which have gone through in PHP 8, such as array_is_list or str_starts_with.I would argue that the strict variants of these functions would be
about as useful as the non-strict ones.
Or in my opinion, they would become preferable over the old functions
for most use cases.In other words, we could say the old/existing functions should not
have been added to the language.
(of course this does not mean we can or should remove them now)Regarding performance, I measure something like factor 2 for a diff of
range(0, 500) minus [5], comparingarray_diff()
vs array_diff_strict()
as proposed here.
So for large arrays or repeated calls it does make a difference.
Some more results on this.
With the right array having only one element, i can actually optimize
the userland function to be almost as fast as the native function.
However, if I pump up the right array, the difference becomes quite bad.
function array_diff_userland(array $array1, array $array2 = [], array
...$arrays): array {
if ($arrays) {
// Process additional arrays only when they exist.
$arrays = array_map('array_values', $arrays);
$array2 = array_merge($array2, ...$arrays);
}
// This is actually slower, it seems.
#return array_filter($array1, fn ($value) => !in_array($value,
$array2, TRUE));
$diff = [];
foreach ($array1 as $k => $value) {
// Use non-strict `in_array()`, to get a fair comparison with
the native function.
if (!in_array($value, $array2)) {
$diff[$k] = $value;
}
}
return $diff;
}
$arr = range(0, 500);
$arr2 = range(0, 1500, 2);
$dts = [];
$t = microtime(TRUE);
$diff_native = array_diff_userland($arr, $arr2);
$t += $dts['userland'] = (microtime(TRUE) - $t);
$diff_userland = array_diff($arr, $arr2);
$t += $dts['native'] = (microtime(TRUE) - $t);
assert($diff_userland === $diff_native);
// Run both again to detect differences due to warm-up.
$t = microtime(TRUE);
$diff_native = array_diff_userland($arr, $arr2);
$t += $dts['userland.1'] = (microtime(TRUE) - $t);
$diff_userland = array_diff($arr, $arr2);
$t += $dts['native.1'] = (microtime(TRUE) - $t);
assert($diff_userland === $diff_native);
// Now use a right array that has no overlap with the left array.
$t = microtime(TRUE);
$arr2 = range(501, 1500, 2);
$diff_native = array_diff_userland($arr, $arr2);
$t += $dts['userland.2'] = (microtime(TRUE) - $t);
$diff_userland = array_diff($arr, $arr2);
$t += $dts['native.2'] = (microtime(TRUE) - $t);
assert($diff_userland === $diff_native);
var_export(array_map(fn ($dt) => $dt * 1000 * 1000 . ' ns', $dts));
I see differences of factor 5 up to factor 10.
So to me, this alone is an argument to implement this natively.
The other argument is that it is kind of sad how the current functions
don't behave as one would expect.
Regarding the cost of more native functions:
Is the concern more about polluting the global namespace, or about
adding more functions that need to be maintained?
I can see both arguments, but I don't have a clear opinion how these
costs should be weighed.
The most straightforward option seems to just name the new functions
like array_diff_strict() etc.
But I am happy for other proposals.
Cheers
Andreas[1] Example:
function array_diff_strict(array $array1, array ...$arrays): array
{
$diff = [];
foreach ($array1 as $value) {
$found = false;
foreach ($arrays as $array) {
if (in_array($value, $array, true)) {
$found = true;
break;
}
}
if (!$found) {
$diff[] = $value;
}
}
return $diff;
}
On Sun, Nov 12, 2023 at 8:20 PM Andreas Hennings andreas@dqxtech.net
wrote:
So to me, this alone is an argument to implement this natively.
The other argument is that it is kind of sad how the current functions
don't behave as one would expect.
I'd expect there to be a larger and proportionately increasing performance
difference between array_diff versus array_udiff with callback or a
userland array_diff_strict function the larger the datasets you feed in.
But I'm not sure how common either the use case of diffing arrays of 25,000
or 250,000 elements might be, or needing this comparison to be strict
equality. I suspect the use case where both these conditions apply is very
rare.
But if you want to create an RFC, please go for it. You could add an extra
parameter to these functions after the input arrays, which was a flag for
strict comparison. Whether such a thing with a default value of non-strict
(so not BC breaking) would be considered preferable to new global
functions, I'm not sure. I'd probably go with new functions but maybe
someone else will weigh in with their thoughts.
Andreas,
Just out of curiosity, what is the use case for this? I can't really
think of a practical case where strict checking is needed for these
functions. Usually, you have a really good idea of what is in the
arrays when writing the code and can handle any edge cases (like
nulls, empty strings, etc) long before you reach for these functions.
Robert Landers
Software Engineer
Utrecht NL
Hello Robert,
Andreas,
Just out of curiosity, what is the use case for this? I can't really
think of a practical case where strict checking is needed for these
functions. Usually, you have a really good idea of what is in the
arrays when writing the code and can handle any edge cases (like
nulls, empty strings, etc) long before you reach for these functions.
I could ask the reverse question: When do you ever need a non-strict comparison?
I think in most modern php development, you would prefer the strict
comparison version simply because it is more simple and predictable.
But for real examples.
One thing I remember is array_diff($arr, [null]) to remove NULL
values, without removing empty strings.
Perhaps we could say this is a special case that could be solved in
other ways, because we only remove one value.
Another thing is when writing reusable general-purpose functions that
should work for all arrays.
The caller might know the types of the array values, but the developer
of the reusable function does not.
Another problem is if your arrays contain anything that is not
stringable. like objects and arrays.
Maybe I will remember other examples that are more practical.
Btw, as a general note on strict vs non-strict:
In some cases you want a "half strict" comparison, where '5' equals 5,
but true does NOT equal '1'.
But for now I am happy to focus on pure strict comparison.
Andreas
Robert Landers
Software Engineer
Utrecht NL--
To unsubscribe, visit: https://www.php.net/unsub.php
Hello Robert,
Andreas,
Just out of curiosity, what is the use case for this? I can't really
think of a practical case where strict checking is needed for these
functions. Usually, you have a really good idea of what is in the
arrays when writing the code and can handle any edge cases (like
nulls, empty strings, etc) long before you reach for these functions.I could ask the reverse question: When do you ever need a non-strict comparison?
I think in most modern php development, you would prefer the strict
comparison version simply because it is more simple and predictable.But for real examples.
One thing I remember is array_diff($arr, [null]) to removeNULL
values, without removing empty strings.
Perhaps we could say this is a special case that could be solved in
other ways, because we only remove one value.Another thing is when writing reusable general-purpose functions that
should work for all arrays.
The caller might know the types of the array values, but the developer
of the reusable function does not.Another problem is if your arrays contain anything that is not
stringable. like objects and arrays.Maybe I will remember other examples that are more practical.
Btw, as a general note on strict vs non-strict:
In some cases you want a "half strict" comparison, where '5' equals 5,
but true does NOT equal '1'.
But for now I am happy to focus on pure strict comparison.Andreas
Robert Landers
Software Engineer
Utrecht NL--
To unsubscribe, visit: https://www.php.net/unsub.php
Hello Andreas,
Another problem is if your arrays contain anything that is not
stringable. like objects and arrays.
array_udiff()
comes to mind here, is there a reason that doesn't work?
It would allow you to 'half-equals' things as well, if you want.
I could ask the reverse question: When do you ever need a non-strict comparison?
You actually answer your own question :)
where '5' equals 5,
One of the most beautiful things about PHP is that null == 0 == false,
or '5' == 5 == 5.0, or 1 == true == 'hello world', which is so
incredibly handy in web-dev that to ignore it is inviting bugs.
Headers may or may not be set, form values may or may not be set, JSON
documents may or may not be missing keys/values, etc. How everything
is coerced is extremely well documented and very obvious after working
with PHP for a while.
I'm reminded of this principle in PHP, quite often:
Be conservative in what you do, be liberal in what you accept from others.
Robert Landers
Software Engineer
Utrecht NL
One of the most beautiful things about PHP is that null == 0 == false, or '5' == 5 == 5.0, or 1 == true == 'hello world', which is so incredibly handy in web-dev that to ignore it is inviting bugs. Headers may or may not be set, form values may or may not be set, JSON documents may or may not be missing keys/values, etc. How everything is coerced is extremely well documented and very obvious after working with PHP for a while.
Sorry, can't help myself... unless you dare to pass NULL
to trim()
, or urlencode()
, or htmlspecialchars()
, or preg_match()
, etc, etc.
:-)
Craig
Hello Robert,
Andreas,
Just out of curiosity, what is the use case for this? I can't really
think of a practical case where strict checking is needed for these
functions. Usually, you have a really good idea of what is in the
arrays when writing the code and can handle any edge cases (like
nulls, empty strings, etc) long before you reach for these functions.I could ask the reverse question: When do you ever need a non-strict comparison?
I think in most modern php development, you would prefer the strict
comparison version simply because it is more simple and predictable.But for real examples.
One thing I remember is array_diff($arr, [null]) to removeNULL
values, without removing empty strings.
Perhaps we could say this is a special case that could be solved in
other ways, because we only remove one value.Another thing is when writing reusable general-purpose functions that
should work for all arrays.
The caller might know the types of the array values, but the developer
of the reusable function does not.Another problem is if your arrays contain anything that is not
stringable. like objects and arrays.Maybe I will remember other examples that are more practical.
Btw, as a general note on strict vs non-strict:
In some cases you want a "half strict" comparison, where '5' equals 5,
but true does NOT equal '1'.
But for now I am happy to focus on pure strict comparison.Andreas
Robert Landers
Software Engineer
Utrecht NL--
To unsubscribe, visit: https://www.php.net/unsub.php
Hello Andreas,
Another problem is if your arrays contain anything that is not
stringable. like objects and arrays.
array_udiff()
comes to mind here, is there a reason that doesn't work?
It would allow you to 'half-equals' things as well, if you want.
Yes this would work, with a callback that does strict diff.
But this is going to be much slower than a native array_diff()
for large arrays.
Also, array_udiff()
is weird in that it it expects a sorting
comparator function rather than a boolean comparator function.
I could ask the reverse question: When do you ever need a non-strict comparison?
You actually answer your own question :)
where '5' equals 5,
One of the most beautiful things about PHP is that null == 0 == false,
or '5' == 5 == 5.0, or 1 == true == 'hello world', which is so
incredibly handy in web-dev that to ignore it is inviting bugs.
One problem of == in php and in javascript is that it is not transitive.
And I would argue in most cases it does a lot more casting than we
need, in a way that can be perceived as surprising.
Headers may or may not be set, form values may or may not be set, JSON
documents may or may not be missing keys/values, etc. How everything
is coerced is extremely well documented and very obvious after working
with PHP for a while.
I imagine a user study would come to a different conclusion than "very obvious".
But ok.
Anyway, time for an RFC.
(I was going to write more, but I should protect this list from my
preaching and arguing)
Andreas
I'm reminded of this principle in PHP, quite often:
Be conservative in what you do, be liberal in what you accept from others.
Robert Landers
Software Engineer
Utrecht NL
On Sun, Nov 12, 2023 at 8:20 PM Andreas Hennings andreas@dqxtech.net
wrote:So to me, this alone is an argument to implement this natively.
The other argument is that it is kind of sad how the current functions
don't behave as one would expect.I'd expect there to be a larger and proportionately increasing performance
difference between array_diff versus array_udiff with callback or a
userland array_diff_strict function the larger the datasets you feed in.
But I'm not sure how common either the use case of diffing arrays of 25,000
or 250,000 elements might be, or needing this comparison to be strict
equality. I suspect the use case where both these conditions apply is very
rare.
The idea is to use the new functions in general-purpose algorithms
without worrying about type coercion or scaling.
But if you want to create an RFC, please go for it. You could add an extra
parameter to these functions after the input arrays, which was a flag for
strict comparison. Whether such a thing with a default value of non-strict
(so not BC breaking) would be considered preferable to new global
functions, I'm not sure. I'd probably go with new functions but maybe
someone else will weigh in with their thoughts.
The extra parameter does not really work for array_diff()
or
array_intersect()
, because these have variadic parameters.
Or at least it would not be "clean".
Technically we could just check if the last parameter is an array or
not, and if not, we use it as a control flag.
But I don't really like it that much, I prefer clean signatures.