Hello everybody!
I'd like to open a discussion regarding the behavior of array_unique()
with the SORT_REGULAR flag when used on arrays containing mixed types.
Currently, SORT_REGULAR uses non-strict comparisons, which can lead to
unintentional data loss when values like 100 and "100" are treated as
duplicates. This forces developers to implement user-land workarounds.
Here is a common scenario where this behavior is problematic:
$events = [
['id' => 100, 'type' => 'user.login'], // User event (int)
['id' => "100", 'type' => 'system.migration'], // System event (string)
['id' => 100, 'type' => 'user.login'], // Duplicate user event
];
$event_ids = array_column($events, 'id'); // [100, "100", 100]
// Current behavior with `SORT_REGULAR`
$unique_ids = array_unique($event_ids, SORT_REGULAR); // Result: [100]
// The string "100" is lost due to type coercion.
To address this, I propose adding a new flag, SORT_STRICT, which would
use strict (===) comparisons to differentiate between values of different
types.
With the new flag, the result would be:
// Proposed behavior with SORT_STRICT
$unique_ids = array_unique($event_ids, SORT_STRICT); // Result: [100, "100"]
// Both integer and string values are preserved.
I've already submitted a PR to correct the bug I just highlighted:
PR: https://github.com/php/php-src/pull/20273
The potential for a SORT_NATURAL flag also came to mind as another useful
addition, but I believe SORT_STRICT is the more critical feature to
discuss first.
I look forward to your feedback.
Thanks,
- Jason
Hello everybody!
The potential for a
SORT_NATURALflag also came to mind as another
useful addition, but I believeSORT_STRICTis the more critical
feature to discuss first.
I know I find array_unique generally useless due to its insistence on
stringifying everything for comparison.
$uniques = [];
foreach($source_array as $a) {
if(!in_array($a, $uniques, true)) {
$uniques[] = $a;
}
}
I seem to recall part of the issue is that array_unique works by sorting
its elements so that "equal" values are adjacent. I know this would be
done on O(n log(n)) vs. O(n^2) grounds, but that could be addressed at
least in part by a smarter sort criterion that sorts by type/class (in
some arbitrary order) before sorting by value. For uncomparable types
(i.e., instances of most classes) this would be by object ID, because we
don't actually care about ordering.
сб, 25 окт. 2025 г. в 01:18, Morgan Weedpacket@varteg.nz:
Hello everybody!
The potential for a
SORT_NATURALflag also came to mind as another
useful addition, but I believeSORT_STRICTis the more critical
feature to discuss first.I know I find array_unique generally useless due to its insistence on
stringifying everything for comparison.$uniques = []; foreach($source_array as $a) { if(!in_array($a, $uniques, true)) { $uniques[] = $a; } }I seem to recall part of the issue is that array_unique works by sorting
its elements so that "equal" values are adjacent. I know this would be
done on O(n log(n)) vs. O(n^2) grounds, but that could be addressed at
least in part by a smarter sort criterion that sorts by type/class (in
some arbitrary order) before sorting by value. For uncomparable types
(i.e., instances of most classes) this would be by object ID, because we
don't actually care about ordering.
I would rather propose smth like usort: array_uunique(array<T>, callable(T, T): -1|0|1): array<T>.
--
Valentin
Correct! Basically:
- SORT_STRINGS: reliable and predictable when you understand the value will
be converted to a string - SORT_NUMERIC: same but risky, you should be certain you're working with
numbers - SORT_REGULAR: the sort is unstable and will inevitably cause a bug that
no one will understand LOL
With the proposed SORT_STRICT, we will get super fast, reliable and
predictable deduplication.
Hello everybody!
The potential for a
SORT_NATURALflag also came to mind as another
useful addition, but I believeSORT_STRICTis the more critical
feature to discuss first.I know I find array_unique generally useless due to its insistence on
stringifying everything for comparison.$uniques = []; foreach($source_array as $a) { if(!in_array($a, $uniques, true)) { $uniques[] = $a; } }I seem to recall part of the issue is that array_unique works by sorting
its elements so that "equal" values are adjacent. I know this would be
done on O(n log(n)) vs. O(n^2) grounds, but that could be addressed at
least in part by a smarter sort criterion that sorts by type/class (in
some arbitrary order) before sorting by value. For uncomparable types
(i.e., instances of most classes) this would be by object ID, because we
don't actually care about ordering.
Quick POC:
https://github.com/jmarble/php-src/tree/feature/array-unique-sort-strict
~1.4x faster than this simple userland implementation on my local machine.
I purposefully avoided implementing a hash-bucket because I had already
tried that and encountered too many edge cases LOL:
https://gist.github.com/jmarble/1e08eb15274cd434e867baf96ffa301d
On Fri, Oct 24, 2025 at 4:51 PM Jason Marble <
jmarble@intuitivetechnology.com> wrote:
Correct! Basically:
- SORT_STRINGS: reliable and predictable when you understand the value
will be converted to a string- SORT_NUMERIC: same but risky, you should be certain you're working
with numbers- SORT_REGULAR: the sort is unstable and will inevitably cause a bug that
no one will understand LOLWith the proposed SORT_STRICT, we will get super fast, reliable and
predictable deduplication.Hello everybody!
The potential for a
SORT_NATURALflag also came to mind as another
useful addition, but I believeSORT_STRICTis the more critical
feature to discuss first.I know I find array_unique generally useless due to its insistence on
stringifying everything for comparison.$uniques = []; foreach($source_array as $a) { if(!in_array($a, $uniques, true)) { $uniques[] = $a; } }I seem to recall part of the issue is that array_unique works by sorting
its elements so that "equal" values are adjacent. I know this would be
done on O(n log(n)) vs. O(n^2) grounds, but that could be addressed at
least in part by a smarter sort criterion that sorts by type/class (in
some arbitrary order) before sorting by value. For uncomparable types
(i.e., instances of most classes) this would be by object ID, because we
don't actually care about ordering.
Hello everybody!
I'd like to open a discussion regarding the behavior of
array_unique()with theSORT_REGULARflag when used on arrays containing mixed types.Currently,
SORT_REGULARuses non-strict comparisons, which can lead to unintentional data loss when values like100and"100"are treated as duplicates. This forces developers to implement user-land workarounds.Here is a common scenario where this behavior is problematic:
$events = [ ['id' => 100, 'type' => 'user.login'], // User event (int) ['id' => "100", 'type' => 'system.migration'], // System event (string) ['id' => 100, 'type' => 'user.login'], // Duplicate user event ]; $event_ids = array_column($events, 'id'); // [100, "100", 100] // Current behavior with `SORT_REGULAR` $unique_ids = array_unique($event_ids, SORT_REGULAR); // Result: [100] // The string "100" is lost due to type coercion.To address this, I propose adding a new flag,
SORT_STRICT, which would use strict (===) comparisons to differentiate between values of different types.With the new flag, the result would be:
// Proposed behavior with SORT_STRICT $unique_ids = array_unique($event_ids, SORT_STRICT); // Result: [100, "100"] // Both integer and string values are preserved.I've already submitted a PR to correct the bug I just highlighted:
PR: https://github.com/php/php-src/pull/20273
The potential for aSORT_NATURALflag also came to mind as another useful addition, but I believeSORT_STRICTis the more critical feature to discuss first.I look forward to your feedback.
Thanks,
- Jason
Hi Jason,
Other than the bytes in memory and how they’re laid out, I fail to see how 100 is different from 100. They’re conceptually identical, and array_* functions generally behave by value, not by identity. I think it’s probably wise to take a step back here and evaluate the knock-on effects of something like this:
SORT_REGULAR has some warts, it isn’t perfect. Having a SORT_STRICT sounds kinda nice until you start thinking about it a bit. This parameter has traditionally been used to indicate a "comparison mode" that describes how to compare values. Strict identity is on a completely different axis (they can’t be less/greater than; objects aren’t strictly comparable, but they’re loosely comparable, 1.0 is strictly comparable to 1 or "1"). Further, it begs the question: "can I get a SORT_STRICT_NUMERIC" or "can I get a SORT_STRICT_STRING", which further indicates this is a completely different axis altogether than "just" a different comparison mode.
As to your example, it conflates two namespaces of Ids — user ids and system ids — into a single untyped bag, then asks array_unique() to preserve that boundary. This is a domain distinction, not a language problem. Simply removing your array_column() step in your example arrives at your desired solution.
— Rob
Hello everybody!
I'd like to open a discussion regarding the behavior of
array_unique()with theSORT_REGULARflag when used on arrays containing mixed types.Currently,
SORT_REGULARuses non-strict comparisons, which can lead to unintentional data loss when values like100and"100"are treated as duplicates. This forces developers to implement user-land workarounds.Here is a common scenario where this behavior is problematic:
$events = [ ['id' => 100, 'type' => 'user.login'], // User event (int) ['id' => "100", 'type' => 'system.migration'], // System event (string) ['id' => 100, 'type' => 'user.login'], // Duplicate user event ]; $event_ids = array_column($events, 'id'); // [100, "100", 100] // Current behavior with `SORT_REGULAR` $unique_ids = array_unique($event_ids, SORT_REGULAR); // Result: [100] // The string "100" is lost due to type coercion.To address this, I propose adding a new flag,
SORT_STRICT, which would use strict (===) comparisons to differentiate between values of different types.With the new flag, the result would be:
// Proposed behavior with SORT_STRICT $unique_ids = array_unique($event_ids, SORT_STRICT); // Result: [100, "100"] // Both integer and string values are preserved.I've already submitted a PR to correct the bug I just highlighted:
PR: https://github.com/php/php-src/pull/20273
The potential for aSORT_NATURALflag also came to mind as another useful addition, but I believeSORT_STRICTis the more critical feature to discuss first.I look forward to your feedback.
Thanks,
- Jason
Hi Jason,
Other than the bytes in memory and how they’re laid out, I fail to see how 100 is different from 100. They’re conceptually identical, and array_* functions generally behave by value, not by identity. I think it’s probably wise to take a step back here and evaluate the knock-on effects of something like this:
SORT_REGULARhas some warts, it isn’t perfect. Having a SORT_STRICT sounds kinda nice until you start thinking about it a bit. This parameter has traditionally been used to indicate a "comparison mode" that describes how to compare values. Strict identity is on a completely different axis (they can’t be less/greater than; objects aren’t strictly comparable, but they’re loosely comparable, 1.0 is strictly comparable to 1 or "1"). Further, it begs the question: "can I get a SORT_STRICT_NUMERIC" or "can I get a SORT_STRICT_STRING", which further indicates this is a completely different axis altogether than "just" a different comparison mode.As to your example, it conflates two namespaces of Ids — user ids and system ids — into a single untyped bag, then asks
array_unique()to preserve that boundary. This is a domain distinction, not a language problem. Simply removing yourarray_column()step in your example arrives at your desired solution.— Rob
I mis-typed this:
they can’t be less/greater than; objects aren’t strictly comparable, but they’re loosely comparable, 1.0 is strictly comparable to 1 or "1"
It should have read:
they can’t be less/greater than; objects aren’t strictly comparable, but they’re loosely comparable, 1.0 is not strictly comparable to 1 or "1"
PS. Speaking of "bytes in memory", it might be better to propose a SORT_BINARY. It has the same effect you’re looking for, but arrays of bytes have a lexicographical ordering.
— Rob
Other than the bytes in memory and how they’re laid out, I fail to see
how 100 is different from 100. They’re conceptually identical, and
array_* functions generally behave by value, not by identity.
In the case of objects, "value" and "identity" are the same thing;
without a __toString() method that always produces different strings for
different objects, array_unique() can't be used to deduplicate an array
of objects - which I find myself wanting to do on a fairly regular basis.
$uniques = array_values(array_combine(array_map(spl_object_id(...),
$source_array), $source_array));
Other than the bytes in memory and how they’re laid out, I fail to see
how 100 is different from 100. They’re conceptually identical, and
array_* functions generally behave by value, not by identity.In the case of objects, "value" and "identity" are the same thing;
without a __toString() method that always produces different strings for
different objects,array_unique()can't be used to deduplicate an array
of objects - which I find myself wanting to do on a fairly regular basis.$uniques = array_values(array_combine(array_map(spl_object_id(...), $source_array), $source_array));
Object identity and value are different things... https://3v4l.org/uZTsN
That’s literally the entire point of my original Records RFC: https://wiki.php.net/rfc/records — and a userland implementation here: https://github.com/withinboredom/records along with a few nice-to-haves https://github.com/withinboredom/common-records
— Rob
Object identity and value are different things... https://3v4l.org/uZTsN
https://3v4l.org/uZTsN
$white == new Color("white")
That's comparing the values of the objects' properties (which may or may
not be relevant to its "effective value" - the comparison applies to
private properties as well) and considering the aggregate to be the
"value of the object".
Regardless, the comparison is certainly not useful to me (where
recursively grovelling around in the objects' properties would be
prohibitively expensive if not fatal), and doesn't make array_unique()
any more helpful in deduplicating.
Rob has convinced me SORT_STRICT is semantically incorrect. I agree
SORT_BINARY has merit, though I'm having difficulty with the implementation.
I think I got too focused on convention wanting to align naming
convention with the existing SORT_* flags. But a perfectly acceptable
alternative exists, ARRAY_UNIQUE_STRICT.
I'm aware of the previous effort (https://externals.io/message/118952) made
regarding the flag ARRAY_UNIQUE_IDENTICAL. While this is technically
correct and follows existing convention (e.g. ARRAY_FILTER_USE_*), I
personally feel it's a bit awkward.
ARRAY_UNIQUE_STRICT is, I think, a bit more intuitive. Especially today, as
declare(strict_types=1) has become more common and even encouraged,
particularly for those who love PHPStan level max haha.
Pull it, test it, break it. Let's do this!
https://github.com/php/php-src/compare/master...jmarble:php-src:feature/array-unique-sort-strict
Object identity and value are different things... https://3v4l.org/uZTsN
https://3v4l.org/uZTsN$white == new Color("white")
That's comparing the values of the objects' properties (which may or may
not be relevant to its "effective value" - the comparison applies to
private properties as well) and considering the aggregate to be the
"value of the object".Regardless, the comparison is certainly not useful to me (where
recursively grovelling around in the objects' properties would be
prohibitively expensive if not fatal), and doesn't makearray_unique()
any more helpful in deduplicating.
Here's a nice example inspired by Rob's comparison of object identity and
value:
https://gist.github.com/jmarble/c86b5b0b3373498c889bc9c5579105a8
On Sat, Oct 25, 2025 at 2:01 PM Jason Marble <
jmarble@intuitivetechnology.com> wrote:
Rob has convinced me SORT_STRICT is semantically incorrect. I agree
SORT_BINARY has merit, though I'm having difficulty with the implementation.I think I got too focused on convention wanting to align naming
convention with the existing SORT_* flags. But a perfectly acceptable
alternative exists, ARRAY_UNIQUE_STRICT.I'm aware of the previous effort (https://externals.io/message/118952)
made regarding the flag ARRAY_UNIQUE_IDENTICAL. While this is technically
correct and follows existing convention (e.g. ARRAY_FILTER_USE_*), I
personally feel it's a bit awkward.ARRAY_UNIQUE_STRICT is, I think, a bit more intuitive. Especially today,
asdeclare(strict_types=1)has become more common and even encouraged,
particularly for those who love PHPStan level max haha.Pull it, test it, break it. Let's do this!
https://github.com/php/php-src/compare/master...jmarble:php-src:feature/array-unique-sort-strict
Object identity and value are different things...
https://3v4l.org/uZTsN
https://3v4l.org/uZTsN$white == new Color("white")
That's comparing the values of the objects' properties (which may or may
not be relevant to its "effective value" - the comparison applies to
private properties as well) and considering the aggregate to be the
"value of the object".Regardless, the comparison is certainly not useful to me (where
recursively grovelling around in the objects' properties would be
prohibitively expensive if not fatal), and doesn't makearray_unique()
any more helpful in deduplicating.
If someone can grant me Karma (username jmarble), I'm happy to start the
process of submitting an RFC for an ARRAY_UNIQUE_STRICT flag.
Thank you!
On Sun, Oct 26, 2025 at 10:50 AM Jason Marble <
jmarble@intuitivetechnology.com> wrote:
Here's a nice example inspired by Rob's comparison of object identity and
value:
https://gist.github.com/jmarble/c86b5b0b3373498c889bc9c5579105a8On Sat, Oct 25, 2025 at 2:01 PM Jason Marble <
jmarble@intuitivetechnology.com> wrote:Rob has convinced me SORT_STRICT is semantically incorrect. I agree
SORT_BINARY has merit, though I'm having difficulty with the implementation.I think I got too focused on convention wanting to align naming
convention with the existing SORT_* flags. But a perfectly acceptable
alternative exists, ARRAY_UNIQUE_STRICT.I'm aware of the previous effort (https://externals.io/message/118952)
made regarding the flag ARRAY_UNIQUE_IDENTICAL. While this is technically
correct and follows existing convention (e.g. ARRAY_FILTER_USE_*), I
personally feel it's a bit awkward.ARRAY_UNIQUE_STRICT is, I think, a bit more intuitive. Especially today,
asdeclare(strict_types=1)has become more common and even encouraged,
particularly for those who love PHPStan level max haha.Pull it, test it, break it. Let's do this!
https://github.com/php/php-src/compare/master...jmarble:php-src:feature/array-unique-sort-strict
Object identity and value are different things...
https://3v4l.org/uZTsN
https://3v4l.org/uZTsN$white == new Color("white")
That's comparing the values of the objects' properties (which may or may
not be relevant to its "effective value" - the comparison applies to
private properties as well) and considering the aggregate to be the
"value of the object".Regardless, the comparison is certainly not useful to me (where
recursively grovelling around in the objects' properties would be
prohibitively expensive if not fatal), and doesn't makearray_unique()
any more helpful in deduplicating.
If someone can grant me Karma (username jmarble), I'm happy to start the
process of submitting an RFC for an ARRAY_UNIQUE_STRICT flag.
Thank you!
RFC karma granted. Good luck with the RFC!
Christoph
Thank you sir!
On Wed, Oct 29, 2025 at 3:39 AM Christoph M. Becker cmbecker69@gmx.de
wrote:
If someone can grant me Karma (username jmarble), I'm happy to start the
process of submitting an RFC for an ARRAY_UNIQUE_STRICT flag.
Thank you!RFC karma granted. Good luck with the RFC!
Christoph