Related to the optimisation made by Sara Golemon here:
https://github.com/php/php-src/commit/c74bc87c74f48bc55541b3bf2fc67d595f58a3b5
I often define a function like this, which checks if an array is "vector
like" i.e. has keys 0,1,2..N:
function is_vectorlike(array $a): bool {
$i = 0;
foreach ($a as $k => $v) {
if ($k !== $i++) {
return false;
}
}
return true;
}
The problem is that this function is O(n), but in PHP7 an array that is
vector-like is likely to be packed and without holes (HT_IS_PACKED(x) &&
HT_IS_WITHOUT_HOLES(x)), in which case it is known to be vector-like
without needing to iterate over it, but PHP code can't check for this (nor
should it be able to, since it's really an implementation detail).
Would it be a good idea to define this is_vectorlike() function in the PHP
runtime, so it can short circuit to return true on packed arrays without
holes? The above code would be a suitable polyfill.
Related to the optimisation made by Sara Golemon here:
https://github.com/php/php-src/commit/c74bc87c74f48bc55541b3bf2fc67d595f58a3b5I often define a function like this, which checks if an array is "vector
like" i.e. has keys 0,1,2..N:function is_vectorlike(array $a): bool {
$i = 0;
foreach ($a as $k => $v) {
if ($k !== $i++) {
return false;
}
}
return true;
}The problem is that this function is O(n), but in PHP7 an array that is
vector-like is likely to be packed and without holes (HT_IS_PACKED(x) &&
HT_IS_WITHOUT_HOLES(x)), in which case it is known to be vector-like
without needing to iterate over it, but PHP code can't check for this (nor
should it be able to, since it's really an implementation detail).Would it be a good idea to define this is_vectorlike() function in the PHP
runtime, so it can short circuit to return true on packed arrays without
holes? The above code would be a suitable polyfill.
+1, I've been thinking of making a similar suggestion. We can bikeshed
the name (it should certainly start with "array_"), but I think there's
a very good case for having an optimised implementation built in, given
the opportunities for short-cutting based on representation details.
As an example use case, serialization formats often dynamically switch
between an "array"/"vector"/"list", and a "hash"/"dictionary"/"table". I
came upon this example recently:
https://github.com/php-amqplib/php-amqplib/blob/master/PhpAmqpLib/Wire/AMQPAbstractCollection.php#L218
Regards,
--
Rowan Collins
[IMSoP]
Related to the optimisation made by Sara Golemon here:
https://github.com/php/php-src/commit/c74bc87c74f48bc55541b3
bf2fc67d595f58a3b5I often define a function like this, which checks if an array is "vector
like" i.e. has keys 0,1,2..N:function is_vectorlike(array $a): bool {
$i = 0;
foreach ($a as $k => $v) {
if ($k !== $i++) {
return false;
}
}
return true;
}+1, I've been thinking of making a similar suggestion. We can bikeshed the
name (it should certainly start with "array_"), but I think there's a very
good case for having an optimised implementation built in, given the
opportunities for short-cutting based on representation details.
Wouldn't be better to introduce a new native type "vector"? I know it
sounds obnoxious introducing a new basic data type, but I think having a
workaround like is_vectorlike() wouldn't completely address the problem.
Other than serializations I also felt, many many times, the need to also
type-hint the input parameters with vector:
function do_something(array $a) {
if (!is_vectorlike($a))
throw new Exception("I don't like the input");
// do something with $a
}
I feel like going back to the dark ages when I had to manually check for
is_string()
, is_int()
on the input parameters. Think how much better it
would be to be able to:
function do_something_ex(vector $a) {
// do something with $a
}
Also it would be perfectly acceptable to implicitly cast a vector to an
array, since a vector is basically a packed array.
How does it sound? Complete madness?
$a = [ "a", "b", "c" ];
is_vector($a); // true
is_array($a); // true
unset($a[1]);
is_vector($a); // false
is_array($a); // true
Wouldn't be better to introduce a new native type "vector"? I know it
sounds obnoxious introducing a new basic data type, but I think having
a workaround like is_vectorlike() wouldn't completely address the problem.
...
function do_something_ex(vector $a) {
// do something with $a
}
I can certainly sympathise with the desire to have this built into the
type hint, but I think it leads to awkward questions about what kind of
"type" this would be.
If it was a "real" type, that would imply users could explicitly
construct it in some way, presumably with a new keyword. Indeed, in
strict_types mode, they would presumably be obliged to do so, or cast
their values, leading to noise like do_something_ex( (vector)$some_array );
It would be better in that case to have generics and a built-in List<T>
type, so we could type hint for List<int> without the performance hit of
checking the member types on every function call.
If, on the other hand, it was a pseudo-type only available in type
hints, then it raises the question of why this particular check gets a
keyword but others don't. For instance, positiveInt, nonNegativeInt,
nonEmptyString, finiteFloat, etc, would all be similar in
implementation: "assert type is X and condition Y is true". If not a
real type, then "vector" is fundamentally a constrained domain of
"array" values.
I also wouldn't call is_vectorlike() a workaround, as it serves a
different use case and would be necessary even if we had a vector type
hint. You can't choose a code path based on a type hint, since we don't
have function overloading (which is probably sensible in a dynamic type
system). We can always add the type hint later, as happened with the
"callable" type hint.
$a = [ "a", "b", "c" ];
is_vector($a); // true
is_array($a); // true
unset($a[1]);
is_vector($a); // false
is_array($a); // true
Other than the name of the function, this is exactly how the proposed
function would behave. Note that in this example, "vector" doesn't
behave like a type, since is_array()
returns true, and gettype()
would
still return 'array'.
Regards,
--
Rowan Collins
[IMSoP]
On Tue, May 2, 2017 at 7:55 PM, Rowan Collins rowan.collins@gmail.com
wrote:
+1, I've been thinking of making a similar suggestion. We can bikeshed the
name (it should certainly start with "array_"), but I think there's a very
good case for having an optimised implementation built in, given the
opportunities for short-cutting based on representation details.
Something like array_is_vectorlike(), array_is_vector() or
array_is_sequential() sound good to me. You could also do the opposite,
array_is_assoc().
As an example use case, serialization formats often dynamically switch
between an "array"/"vector"/"list", and a "hash"/"dictionary"/"table". I
came upon this example recently: https://github.com/php-amqplib
/php-amqplib/blob/master/PhpAmqpLib/Wire/AMQPAbstractCollection.php#L218
Another reason this function would be of utility is that it would
discourage even more inefficient implementations that allocate new arrays,
like the one in that code ("array_keys($val) === range(0, count($val) -
1)"), and nearly every single answer to this StackOverflow question (except
mine):
https://stackoverflow.com/questions/173400/how-to-check-if-php-array-is-associative-or-sequential/
For what it's worth, the Drupal assertion inspector calls these "Strict
Arrays" since that's what they are - arrays in the true sense of the term
found in all other languages. What PHP is calling an "array" is more
accurately a "map" or "hash"
There is value in this validation - https://www.drupal.org/SA-CORE-2014-005
relied on the attacker providing an sql injection where the code was
expecting numeric keys.
There's no easy solution - this is just one of those things where PHP is
painted into a corner by backwards compatibility. Calling them vectors just
seems weird though. Calling them strict arrays at a language level has its
own problems.
On Tue, May 2, 2017 at 7:55 PM, Rowan Collins rowan.collins@gmail.com
wrote:+1, I've been thinking of making a similar suggestion. We can bikeshed
the
name (it should certainly start with "array_"), but I think there's a
very
good case for having an optimised implementation built in, given the
opportunities for short-cutting based on representation details.Something like array_is_vectorlike(), array_is_vector() or
array_is_sequential() sound good to me. You could also do the opposite,
array_is_assoc().As an example use case, serialization formats often dynamically switch
between an "array"/"vector"/"list", and a "hash"/"dictionary"/"table". I
came upon this example recently: https://github.com/php-amqplib
/php-amqplib/blob/master/PhpAmqpLib/Wire/AMQPAbstractCollection.php#L218Another reason this function would be of utility is that it would
discourage even more inefficient implementations that allocate new arrays,
like the one in that code ("array_keys($val) === range(0, count($val) -
1)"), and nearly every single answer to this StackOverflow question (except
mine):
https://stackoverflow.com/questions/173400/how-to-check-
if-php-array-is-associative-or-sequential/
For what it's worth, the Drupal assertion inspector calls these "Strict
Arrays" since that's what they are - arrays in the true sense of the term
found in all other languages. What PHP is calling an "array" is more
accurately a "map" or "hash"
I'm not sure on that terminology. Many languages / environments have
"sparse" arrays, arrays with negative indexes, and other variants. It's
true that PHP's "array" type goes well beyond those, but we're talking
about more than "non-associative" here (or I hope we are).
I think possibly the best term would be "list" - a collection where
order is preserved, but there is no separate notion of keys: on a
"list-like array", $foo[42] can be seen as "access the 43rd value", not
"access the value with key 42".
Regards,
--
Rowan Collins
[IMSoP]
For what it's worth, the Drupal assertion inspector calls these "Strict
Arrays" since that's what they are - arrays in the true sense of the term
found in all other languages. What PHP is calling an "array" is more
accurately a "map" or "hash"I'm not sure on that terminology. Many languages / environments have
"sparse" arrays, arrays with negative indexes, and other variants. It's
true that PHP's "array" type goes well beyond those, but we're talking
about more than "non-associative" here (or I hope we are).I think possibly the best term would be "list" - a collection where
order is preserved, but there is no separate notion of keys: on a
"list-like array", $foo[42] can be seen as "access the 43rd value", not
"access the value with key 42".Regards,
The terminology here is, as is often the case, very blurry. Many
programming languages call resizable arrays vectors. Most often this
stems from the actual implementation, or simply because they require a
different name for the keyword. In Java it would be ArrayList
, whereas
Vector
is the thread-safe counter-part to it.
However, I think that vector is the best name.
--
Richard "Fleshgrinder" Fussenegger
The problem is that this function is O(n), but in PHP7 an array that is
vector-like is likely to be packed and without holes (HT_IS_PACKED(x) &&
HT_IS_WITHOUT_HOLES(x)), in which case it is known to be vector-like
without needing to iterate over it,
Just want to underline the "is likely to be" part of this sentence.
The implementation can't assume an array is not vector-like just
because those checks fail. It's possible to manipulate an array into
being vector-like, but not packed, or having holes.
Not really commenting on the proposal yet, which I'm a big
shoulder-shruggy about. Just highlighting the caveat for any eventual
implementation. We can short-circuit the O(n) loop if it is
packed/no-holes, but if not then we need to walk it to be certain.
This is probably fine since this check is likely used in places where
the array is expected to be vector like and non-vector should be a
slow path anyway.
-Sara
Just want to underline the "is likely to be" part of this sentence.
The implementation can't assume an array is not vector-like just
because those checks fail. It's possible to manipulate an array into
being vector-like, but not packed, or having holes.Not really commenting on the proposal yet, which I'm a big
shoulder-shruggy about. Just highlighting the caveat for any eventual
implementation. We can short-circuit the O(n) loop if it is
packed/no-holes, but if not then we need to walk it to be certain.
This is probably fine since this check is likely used in places where
the array is expected to be vector like and non-vector should be a
slow path anyway.
Maybe I'm misunderstanding what "with holes" means in this case, but I
would have expected any array with holes to trivially return false,
here, because we're looking for arrays with a complete set of
consecutive keys.
That aside, there are 3 cases here:
A) arrays that are not vector / list like
B) arrays that have been constructed and manipulated in various ways,
such that they are now vector / list like, but aren't packed in memory
C) arrays that have been consistently constructed and handled as a
"list" / "vector", and will be packed and hole-less in memory
- A can be determined (in userland or internally) by walking the array,
and short-cutting to false when you find the first non-integer or
non-sequential key. - B can only be determined by walking the whole array, right to the end.
- In userland, C is indistinguishable from B, so you have to walk right
to the end. This is where the big win comes in for an internal function,
because we can go from examining all the keys to examining none of them.
I don't know the exact caveats of when the engine packs arrays, but I
would expect C to be a common case, because as you say the function will
often be called when expecting vector-like arrays, and those
vector-like arrays will often have been constructed using only basic
operations like ['a', 'b', 'c'] or $foo[] = 'd'
So the worst case behaviour of an internal function is exactly the same
as an optimised userland implementation; but a fairly common best case
goes from a guaranteed O(n) to constant.
Regards,
--
Rowan Collins
[IMSoP]
On Fri, May 5, 2017 at 8:18 AM, Rowan Collins rowan.collins@gmail.com
wrote:
Maybe I'm misunderstanding what "with holes" means in this case, but I
would have expected any array with holes to trivially return false, here,
because we're looking for arrays with a complete set of consecutive keys.That aside, there are 3 cases here:
A) arrays that are not vector / list like
B) arrays that have been constructed and manipulated in various ways, such
that they are now vector / list like, but aren't packed in memory
C) arrays that have been consistently constructed and handled as a "list"
/ "vector", and will be packed and hole-less in memory
- A can be determined (in userland or internally) by walking the array,
and short-cutting to false when you find the first non-integer or
non-sequential key.- B can only be determined by walking the whole array, right to the end.
- In userland, C is indistinguishable from B, so you have to walk right to
the end. This is where the big win comes in for an internal function,
because we can go from examining all the keys to examining none of them.I don't know the exact caveats of when the engine packs arrays, but I
would expect C to be a common case, because as you say the function will
often be called when expecting vector-like arrays, and those vector-like
arrays will often have been constructed using only basic operations like
['a', 'b', 'c'] or $foo[] = 'd'So the worst case behaviour of an internal function is exactly the same as
an optimised userland implementation; but a fairly common best case goes
from a guaranteed O(n) to constant.
Exactly, and I would add that in the case of a truly associative array (A)
it will return false on the first key that doesn't match 0,1,2…N, which is
likely to be the first one, so even that is likely to be fast.
Putting it in a table:
keys checked keys checked
keys (PHP version) (C version)
0,1,2…N (packed) N 0
0,1,2…N N N
⋮ ⋮ ⋮
0,1,?… 3 3
0,?,?… 2 2
?,?,?… 1 1
The top and bottom rows are the common cases.
A) arrays that are not vector / list like
B) arrays that have been constructed and manipulated in various ways, such
that they are now vector / list like, but aren't packed in memory
C) arrays that have been consistently constructed and handled as a "list" /
"vector", and will be packed and hole-less in memory
- A can be determined (in userland or internally) by walking the array, and
short-cutting to false when you find the first non-integer or non-sequential
key.- B can only be determined by walking the whole array, right to the end.
- In userland, C is indistinguishable from B, so you have to walk right to
the end. This is where the big win comes in for an internal function,
because we can go from examining all the keys to examining none of them.
Correct. The "B" class is the primary case I was referencing. I was
thinking there might be another class: Array manipulated such that it
is now vector like even without holes, but that the flag hadn't been
reset. Looking into zend_hash, I can now see that my assumption about
that was wrong, so nevermind. :)
I don't know the exact caveats of when the engine packs arrays, but I would
expect C to be a common case, because as you say the function will often be
called when expecting vector-like arrays, and those vector-like arrays
will often have been constructed using only basic operations like ['a', 'b',
'c'] or $foo[] = 'd'
Absolutely. Were calls to this written in C, they would be wrapped in
an EXPECTED() macro. I'm not super worried about the perf (which is
certainly no worse than doing it in userland), I just wanted to point
out that not all invocations of is_vectorlike() would be O(1).
That's really all.
So the worst case behaviour of an internal function is exactly the same as
an optimised userland implementation; but a fairly common best case goes
from a guaranteed O(n) to constant.
Yes, which is why at worst, this is just a tiny no-harm
micro-optimization API. And thus my unemotional shrug later in the
message. :)
-Sara
Related to the optimisation made by Sara Golemon here:
https://github.com/php/php-src/commit/c74bc87c74f48bc55541b3bf2fc67d
595f58a3b5I often define a function like this, which checks if an array is "vector
like" i.e. has keys 0,1,2..N:function is_vectorlike(array $a): bool {
$i = 0;
foreach ($a as $k => $v) {
if ($k !== $i++) {
return false;
}
}
return true;
}The problem is that this function is O(n), but in PHP7 an array that is
vector-like is likely to be packed and without holes (HT_IS_PACKED(x) &&
HT_IS_WITHOUT_HOLES(x)), in which case it is known to be vector-like
without needing to iterate over it, but PHP code can't check for this (nor
should it be able to, since it's really an implementation detail).Would it be a good idea to define this is_vectorlike() function in the PHP
runtime, so it can short circuit to return true on packed arrays without
holes? The above code would be a suitable polyfill.
I just read this thread and am wondering what exactly is the use case? Like
are you going to do something if it is vector-like, and do something
different (or not do anything at all) if it's not vector-like? I mean, if
you have to crawl it, and need a vector, why not just call array_values and
guarantee you have a vector?
Also, given the implementation above, why does the array have to be ksorted
correctly to count? Specifically, why isn't this array considered vector
like?
$arr = [
1=> 1,
0=> 0,
];
Wouldn't the following be a better test?
function is_vectorlike(array $a): bool {
$l = count($a);
for (i=0; $i<$l, $i++) {
if (!array_key_exists($i, $a)) {
return false;
}
}
return true;
}
Also, given the implementation above, why does the array have to be
ksorted
correctly to count? Specifically, why isn't this array considered vector
like?
$arr = [
1=> 1,
0=> 0,
];
I suppose the question would be, given that array (or a similar
out-of-order array), should is_vectorlike afford you the expectation that
in a foreach loop:
foreach ($arr as $i => $v) {
// do something
}
that (within the length of the array) the value $arr[$i + 1] has not yet
been processed by the code in the loop, and is also the next item to be
processed.
Or maybe more succinctly, should is_vectorlike guarantee that items are
iterated over in order?
On Tue, May 2, 2017 at 3:13 AM, Jesse Schalken me@jesseschalken.com
wrote:Related to the optimisation made by Sara Golemon here:
https://github.com/php/php-src/commit/c74bc87c74f48bc55541b3bf2fc67d
595f58a3b5I often define a function like this, which checks if an array is "vector
like" i.e. has keys 0,1,2..N:function is_vectorlike(array $a): bool {
$i = 0;
foreach ($a as $k => $v) {
if ($k !== $i++) {
return false;
}
}
return true;
}The problem is that this function is O(n), but in PHP7 an array that is
vector-like is likely to be packed and without holes (HT_IS_PACKED(x) &&
HT_IS_WITHOUT_HOLES(x)), in which case it is known to be vector-like
without needing to iterate over it, but PHP code can't check for this
(nor
should it be able to, since it's really an implementation detail).Would it be a good idea to define this is_vectorlike() function in the
PHP
runtime, so it can short circuit to return true on packed arrays without
holes? The above code would be a suitable polyfill.I just read this thread and am wondering what exactly is the use case? Like
are you going to do something if it is vector-like, and do something
different (or not do anything at all) if it's not vector-like? I mean, if
you have to crawl it, and need a vector, why not just call array_values and
guarantee you have a vector?Also, given the implementation above, why does the array have to be ksorted
correctly to count? Specifically, why isn't this array considered vector
like?
$arr = [
1=> 1,
0=> 0,
];Wouldn't the following be a better test?
function is_vectorlike(array $a): bool {
$l = count($a);
for (i=0; $i<$l, $i++) {
if (!array_key_exists($i, $a)) {
return false;
}
}
return true;
}
I just read this thread and am wondering what exactly is the use case?
Like
are you going to do something if it is vector-like, and do something
different (or not do anything at all) if it's not vector-like? I mean,
if
you have to crawl it, and need a vector, why not just call array_values
and
guarantee you have a vector?
I gave one use case in an earlier message: many serialisation formats have a different form for ordered lists vs key-value pairs. As an obvious example, look at JSON arrays and objects, and many other formats have similar types.
It can also be quite intuitive to have a function accept either a list of prepared items or a set of key-value pairs representing structured data. For instance, HTTP headers might be prepared like ['User-Agent: example'] or structured like ['User-Agent' => 'example']. Turning the second into the first requires more than just running array_values, but can be skipped if the input is known to be vector-like.
Arrays that have integer keys, but not all sequential, can be a bit of an edge case, but treating them as a list throws away information, so it's usually safer to treat them as key-value pairs.
Regards,
--
Rowan Collins
[IMSoP]
On Sat, May 6, 2017 at 9:42 AM, Rowan Collins rowan.collins@gmail.com
wrote:
I just read this thread and am wondering what exactly is the use case?
Like
are you going to do something if it is vector-like, and do something
different (or not do anything at all) if it's not vector-like? I mean,
if
you have to crawl it, and need a vector, why not just call array_values
and
guarantee you have a vector?I gave one use case in an earlier message: many serialisation formats have
a different form for ordered lists vs key-value pairs. As an obvious
example, look at JSON arrays and objects, and many other formats have
similar types.
Ah, so you would do something like
private serializeArray(array $data) {
return is_vector($data) ? $this->serializeVector($data) :
$this->serializeAssoc($data);
}
is that right?
It can also be quite intuitive to have a function accept either a list of
prepared items or a set of key-value pairs representing structured data.
For instance, HTTP headers might be prepared like ['User-Agent: example']
or structured like ['User-Agent' => 'example']. Turning the second into the
first requires more than just running array_values, but can be skipped if
the input is known to be vector-like.
Ah, that's a great example and makes perfect sense now! Thanks :)
Arrays that have integer keys, but not all sequential, can be a bit of an
edge case, but treating them as a list throws away information, so it's
usually safer to treat them as key-value pairs.Regards,
--
Rowan Collins
[IMSoP]
I gave one use case in an earlier message: many serialisation formats
have
a different form for ordered lists vs key-value pairs. As an obvious
example, look at JSON arrays and objects, and many other formats have
similar types.Ah, so you would do something like
private serializeArray(array $data) {
return is_vector($data) ? $this->serializeVector($data) :
$this->serializeAssoc($data);
}is that right?
I do this to pack php arrays into MessagePack's arrays/maps:
https://github.com/rybakit/msgpack.php/blob/master/src/Packer.php#L112-L114
--
Thank you and best regards,
Eugene Leonovich