Hi internals
There's an open bug report that array_unique doesn't work for enums:
https://github.com/php/php-src/issues/9775
This comes down to the fact that array_unique internally sorts the
array before iterating over it to remove duplicates, and that enums
are intentionally incomparable.
Foo::Bar < Foo::Baz // false
Foo::Baz < Foo::Bar // false
Unfortunately, this means that array_unique might coincidentally work
fine if the array is already sorted, or gets correctly sorted by
chance while breaking otherwise.
To solve this, I propose adding an ARRAY_UNIQUE_IDENTICAL option that
can be passed to array_uniques $flags which uses identical operator
(===) semantics. Internally it uses a new hashmap that allows using
arbitrary PHP values as keys to efficiently remove duplicates. This is
slightly over-engineered for this use case. However, this data
structure will be required for implementing ADTs to deduplicate
instances with the same values. This hashmap is a heavily minimized
version of the teds extensions StrictHashMap [1].
Time complexity of this function is O(n). With the exception of
SORT_STRING
(which uses PHPs existing hashmap in a very similar
fashion and also has O(n)) it should scale better than the other sort
options which are O(n log n).
Here's a link to the implementation:
https://github.com/php/php-src/pull/9882/files
If there are no concerns or complaints I'd like to merge this into PHP
8.3. Otherwise I will create an RFC. Looking forward to your feedback.
Ilija
[1| https://github.com/TysonAndre/pecl-teds/blob/main/teds_stricthashmap.c
To solve this, I propose adding an ARRAY_UNIQUE_IDENTICAL option that
can be passed to array_uniques $flags which uses identical operator
(===) semantics. Internally it uses a new hashmap that allows using
arbitrary PHP values as keys to efficiently remove duplicates. This is
slightly over-engineered for this use case. However, this data
structure will be required for implementing ADTs to deduplicate
instances with the same values. This hashmap is a heavily minimized
version of the teds extensions StrictHashMap [1].
As a regular developer, I look from the usage side, not from the
implementation side.
From such perspective, remembering that filtering unique enums
requires this exact flag
is somewhat "dirty": on the surface, nothing tells that enums are
non-comparable.
You will have to memorize yet another PHP quirk, or be able to build a
logical chain:
- enums are non-comparable by default
- enums have no default string value (if not baked by a string)
- array_unique internally sorts an array
- default flag for array_unique compares the string representations of its items
- thus it won't work for enums in a general case
This requires deeper knowledge of PHP, which is for sure a valuable
skill, but not as wide-spread
as we would like to. Most of the devs will first try to use
array_unique()
with the default flag, perhaps
not even knowing about flags existence. Each usage of such a flag will
not reveal intent by itself
and at least will require a comment for readable codebases.
Perhaps an alternative idea is to provide a default string value for
enums which are not baked,
Nikolas had already brought up this idea earlier.
You will have to memorize yet another PHP quirk, or be able to build a
logical chain:
- enums are non-comparable by default
- enums have no default string value (if not baked by a string)
- array_unique internally sorts an array
- default flag for array_unique compares the string representations of its items
- thus it won't work for enums in a general case
Actually, I think this is already the case for "normal" objects - I had no idea that array_unique used a string cast to compare objects, so am very surprised that it will not consider objects of completely different classes unique, if they happen to have the same string value: https://3v4l.org/UGCvB
Making backed enums work with their backing value would be equally confusing to me - Day::MONDAY and Month::JANUARY might both be backed by a 1, but they are certainly distinct values. I'd much rather get an error that made me check the manual and find a flag than have one of them silently discarded.
Regards,
--
Rowan Tommins
[IMSoP]
You will have to memorize yet another PHP quirk, or be able to build a
logical chain:
- enums are non-comparable by default
- enums have no default string value (if not baked by a string)
- array_unique internally sorts an array
- default flag for array_unique compares the string representations of its items
- thus it won't work for enums in a general case
Actually, I think this is already the case for "normal" objects - I had no idea that array_unique used a string cast to compare objects, so am very surprised that it will not consider objects of completely different classes unique, if they happen to have the same string value: https://3v4l.org/UGCvB
Making backed enums work with their backing value would be equally confusing to me - Day::MONDAY and Month::JANUARY might both be backed by a 1, but they are certainly distinct values. I'd much rather get an error that made me check the manual and find a flag than have one of them silently discarded.
I agree. In my opinion, we should consider to always raise a
warning on attempts to compare incomparable values. As it is now,
silently returning false looks like a footgun to me.
--
Christoph M. Becker
A bit off topic, but not entirely:
In my opinion, adding another flag isn't the real fix. Any function
which does comparisons should take a callable for users to provide any
comparison they wish. An iteratively better API would be:
function array_unique<T>(list<T> $array, callable(T $a, T $b): int
$comparator);
Of course, there are other things like instead of using int for 0
,
-1
, 1
, we could have used an enum but we don't have one today. I
just mean the core idea of taking callable is better than mucking
around with flags while also allowing for custom comparison. Note that
it doesn't necessarily prevent optimizations either. For instance, if
they had passed php_compare
or some function which represents $a <=> $b
, we could identify that just as we identify a specific flag
and take an optimized pass.
Note that as enums aren't "comparable" directly, we could have
provided a custom comparator in this case, no "fix" necessary in core.
Of course, this complaining doesn't fix the situation we are in. My
first impression is that might be better to provide one or more
alternative functions and to deprecate array_unique
.
To avoid noise I will respond to all e-mails at once.
Hi someniatko
Perhaps an alternative idea is to provide a default string value for
enums which are not baked,
Nikolas had already brought up this idea earlier.
As others have mentioned, this opens the door for type coercion issues
and only really works for string enums. Int enums don't have an
obvious string representation. Using the name as a string value is not
intuitive (and would be really confusing when passing int enums to
string parameters) and using the stringified backed int value would
lead to collisions for most enums, as they are most likely to be
continuous sequences starting from 0.
Hi Rowan
Actually, I think this is already the case for "normal" objects - I had no idea that array_unique used a string cast to compare objects, so am very surprised that it will not consider objects of completely different classes unique, if they happen to have the same string value: https://3v4l.org/UGCvB
The default string strategy is not great. Unfortunately, changing this
is likely not an option as there is no clear migration path.
Making backed enums work with their backing value would be equally confusing to me - Day::MONDAY and Month::JANUARY might both be backed by a 1, but they are certainly distinct values. I'd much rather get an error that made me check the manual and find a flag than have one of them silently discarded.
I agree that a warning/error when comparing incompatible values would
be optimal. This is however off-topic and would requrie an RFC, which
I was originally hoping to avoid here.
HI Levi
In my opinion, adding another flag isn't the real fix. Any function
which does comparisons should take a callable for users to provide any
comparison they wish. An iteratively better API would be:function array_unique<T>(list<T> $array, callable(T $a, T $b): int
$comparator);
IMO the fact that the array is sorted (in some cases, not all cases)
is an implementation detail. The identity implementation proposed here
does not sort the array but uses a hashmap internally. A php_compare
function most likely isn't going to be useful outside of array_unique
in which case it likely shouldn't be generally available, especially
if it isn't even used internally but optimized out.
Of course, this complaining doesn't fix the situation we are in. My
first impression is that might be better to provide one or more
alternative functions and to deprecatearray_unique
.
The question is if adding this option makes array_unique worse. I
don't think so. I agree that a dedicated new array_unique with saner
default values might be a good idea, but these two options are
mutually exclusive.
I'm not convinced that providing a version that accepts a closure is
necessary. Nowadays the vast majority of cases should use identity
semantics. Removing duplicates with different comparison strategies
leads to randomness. For ['1e3', 1000] both '1e3' and 1000 are valid
candidates for removal. It would make more sense to canonicalize the
array values, like converting them to integers in this case, before
then using identity semantics.
Anyway, as the feedback wasn't completely unanimous I will create an
RFC for this proposal.
Ilija
Hi Levi Morrison,
A bit off topic, but not entirely:
In my opinion, adding another flag isn't the real fix. Any function
which does comparisons should take a callable for users to provide any
comparison they wish. An iteratively better API would be:function array_unique<T>(list<T> $array, callable(T $a, T $b): int $comparator);
Of course, there are other things like instead of using int for
0
,
-1
,1
, we could have used an enum but we don't have one today. I
just mean the core idea of taking callable is better than mucking
around with flags while also allowing for custom comparison. Note that
it doesn't necessarily prevent optimizations either. For instance, if
they had passedphp_compare
or some function which represents$a <=> $b
, we could identify that just as we identify a specific flag
and take an optimized pass.
- Calling php functions from C is fairly slow. Sorting compared to hash maps is also slow
If we did want a specialized implementation for user-provided
equality criteria, ?callable(T $a): U $iteratee = null
would seems more practical to me
(Calling a function n
times rather than n log n
times would be faster,
and this would work even for cases such as
[$obj->nonObjectField1, $obj->field2->toNormalizedRepresentation()]
instead of using a comparator in most cases
(by putting both arrays into the internal hash map)
(same approach as https://lodash.com/docs/#uniqBy)
-
<=>
isn't a stable order for int/string/float (etc) in various ways,
so some comparators would have issues and return duplicates.It's possible to implement stable comparisons, but I don't really expect enthusiasm for that
https://github.com/TysonAndre/pecl-teds/#stable-comparison -
I expect a majority of cases could use the
ARRAY_UNIQUE_IDENTICAL
directly.
(or use that onarray_map()
ped values to find the keys of the original array to use)
So requiring the use of $comparator would result in longer code and more possible sources of bugs
(e.g. a comparator returning$a - $b
might overflow/underflow int if users don't realize<=>
should be used)
Thanks,
Tyson