[RFC] Reflection for references

6 years ago by Nikita Popov — view source

unread

Hi internals,

I'd like to propose the addition of a ReflectionReference class, as
described in the following RFC:
https://wiki.php.net/rfc/reference_reflection

This topic was previously discussed at https://externals.io/message/102638.
The TL;DR is that some libraries need a way to detect references and
determine whether two references are the same. They previously used an ugly
hack to achieve this, but this hack will no longer (reliably) work due to
the introduction of typed properties, so we need to do something about this
for the 7.4 release.

Regards,
Nikita

6 years ago by Marco Pivetta — view source

unread

A few issues with the RFC so far:

ReflectionReference seems to be designed around arrays only: maybe
ReflectionArrayKeyReference or such?
Can the getId() return type be restricted to either int or string?
Why is it a union type right now? Technical limitation?
ReflectionReference#__construct() should be private, since it is
unusable anyway
fromArrayElem => fromArrayElement
is this open for inheritance? If so, what scenarios would fit inheriting
from ReflectionReference?
what happens when ReflectionReference::fromArrayElement() is given
invalid data, such as non-existing keys? I'd expect it to throw.
instead of $ref1->getId() === $ref2->getId(), $ref1->matches($ref2)
or such.
what are possible scenarios for getting the identifier as a primitive,
and then storing it somewhere (like an array of reference identifiers)?
There seems to be a lot of design around ReflectionReference#getId()
to avoid leaking internal pointer information: can it be completely
dropped, if we have $ref1->matches($ref2) instead?

Marco Pivetta

http://twitter.com/Ocramius

http://ocramius.github.com/

Hi internals,

I'd like to propose the addition of a ReflectionReference class, as
described in the following RFC:
https://wiki.php.net/rfc/reference_reflection

This topic was previously discussed at https://externals.io/message/102638
.
The TL;DR is that some libraries need a way to detect references and
determine whether two references are the same. They previously used an ugly
hack to achieve this, but this hack will no longer (reliably) work due to
the introduction of typed properties, so we need to do something about this
for the 7.4 release.

Regards,
Nikita

6 years ago by Nicolas Grekas — view source

unread

Thank you Nikita, I will definitely need that RFC!

Answering to Marco:

ReflectionReference seems to be designed around arrays only: maybe

ReflectionArrayKeyReference or such?

I agree with the wording in the RFC: the array is the primitive data
structure in PHP that holds references.
ALL other data structures can be turned to arrays. That includes objects
with (array) and scopes with get_defined_vars().
Adding eg support for objects directly would open several questions, like:
what about magic methods?, etc.

So to me, we really need the simplest yet most powerful primite, and here
it is, anything else can be built on top.

Can the getId() return type be restricted to either int or string?

A string would be enough I suppose, but I'll let Nikita anwer :)

what are possible scenarios for getting the identifier as a primitive,

Indexing. Eg VarCloner needs an index of already discovered references to
do its job.

There seems to be a lot of design around ReflectionReference#getId()
to avoid leaking internal pointer information: can it be completely
dropped, if we have $ref1->matches($ref2) instead?

Nope, see previous answer. As such, I think matches() would be added
complexity for core with no extra benefit/power for users.

In short, I really like the RFC as is :)
(I removed your other comments Marco because I don't have anything special
to say on them)

Cheers,
Nicolas

6 years ago by Nikita Popov — view source

unread

A few issues with the RFC so far:

ReflectionReference seems to be designed around arrays only: maybe
ReflectionArrayKeyReference or such?

ReflectionReference works with arbitrary references. Arrays only come in as
a detail of construction. Constructors such as fromLocalVariable() or
fromObjectProperty() could be added in the future under this design.

Can the getId() return type be restricted to either int or
string? Why is it a union type right now? Technical limitation?

To allow changes in the future. The real return value here is an int, but
the size increase due to hashing makes it a string. If we find a way to
avoid that (or decided that we don't care about leaking addresses), we
could change this to a (faster) int-based API.

I'd be okay with renaming this to getHash() and restricting the return
value to string. We could add getId() returning int in the future, same as
we did for objects.

ReflectionReference#__construct() should be private, since it is
unusable anyway

Done.

fromArrayElem => fromArrayElement

Done.

is this open for inheritance? If so, what scenarios would fit
inheriting from ReflectionReference?

Marked as final.

what happens when ReflectionReference::fromArrayElement() is given
invalid data, such as non-existing keys? I'd expect it to throw.

It will throw, yes. Added to the RFC:

If $array is not an array or $key is not an integer or string, a
TypeError is thrown. If $array[$key] does not exist, a ReflectionException
is thrown.

instead of $ref1->getId() === $ref2->getId(), $ref1->matches($ref2)
or such.

what are possible scenarios for getting the identifier as a primitive,

and then storing it somewhere (like an array of reference identifiers)?

There seems to be a lot of design around ReflectionReference#getId()

to avoid leaking internal pointer information: can it be completely
dropped, if we have $ref1->matches($ref2) instead?

As Nicolas already wrote, the ID / Hash functionality is needed to allow
hashtable lookups. Having just matches() would require iterating over the
whole set of known references and comparison each and every one. With the
getId() API, it's just a single HT lookup.

We could add $ref1->equals($ref2) as an additional convenience method,
but it would not replace the getId() API.

Thanks for the feedback!
Nikita

6 years ago by Marco Pivetta — view source

unread

instead of $ref1->getId() === $ref2->getId(), $ref1->matches($ref2)

or such.

what are possible scenarios for getting the identifier as a primitive,

and then storing it somewhere (like an array of reference identifiers)?

There seems to be a lot of design around ReflectionReference#getId()

to avoid leaking internal pointer information: can it be completely
dropped, if we have $ref1->matches($ref2) instead?

As Nicolas already wrote, the ID / Hash functionality is needed to allow
hashtable lookups. Having just matches() would require iterating over the
whole set of known references and comparison each and every one. With the
getId() API, it's just a single HT lookup.

We could add $ref1->equals($ref2) as an additional convenience method,
but it would not replace the getId() API.

I'd say that if the indexing is a fairly common scenario (seems like it,
coming from Nicolas, who is the most direct API consumer here), the
->matches() isn't needed at all.

Marco Pivetta

http://twitter.com/Ocramius

http://ocramius.github.com/

6 years ago by Nicolas Grekas — view source

unread

About using "int" as return type for getId():

Can the getId() return type be restricted to either int or

string? Why is it a union type right now? Technical limitation?

To allow changes in the future. The real return value here is an int, but
the size increase due to hashing makes it a string. If we find a way to
avoid that (or decided that we don't care about leaking addresses), we
could change this to a (faster) int-based API.

spl_object_hash() uses a XOR to hide internal addresses - so they're
already "leaking".
we also added spl_object_id() which proved much better in term of memory
usage - not only CPU.

Mixing both arguments/approaches, what about returning a XORed int? Are we
OK with that? It'd be a welcome storage/CPU optimization from my pov.

6 years ago by Nikita Popov — view source

unread

On Tue, Jan 15, 2019 at 1:09 PM Nicolas Grekas nicolas.grekas+php@gmail.com
wrote:

About using "int" as return type for getId():

Can the getId() return type be restricted to either int or

string? Why is it a union type right now? Technical limitation?

To allow changes in the future. The real return value here is an int, but
the size increase due to hashing makes it a string. If we find a way to
avoid that (or decided that we don't care about leaking addresses), we
could change this to a (faster) int-based API.

spl_object_hash() uses a XOR to hide internal addresses - so they're
already "leaking".
we also added spl_object_id() which proved much better in term of memory
usage - not only CPU.

Mixing both arguments/approaches, what about returning a XORed int? Are we
OK with that? It'd be a welcome storage/CPU optimization from my pov.

A XORed int is not much better than returning the address directly -- it
will scramble the base address, but for example not the offsets between the
references (if the XOR key is the same).

If people think that this is good enough for us, then of course it's better
to go with the int for performance reasons. I've picked the sha1 here to be
conservative. How performance critical is this API?

Nikita

6 years ago by Nikita Popov — view source

unread

On Tue, Jan 15, 2019 at 1:09 PM Nicolas Grekas <
nicolas.grekas+php@gmail.com> wrote:

About using "int" as return type for getId():

Can the getId() return type be restricted to either int or

string? Why is it a union type right now? Technical limitation?

To allow changes in the future. The real return value here is an int, but
the size increase due to hashing makes it a string. If we find a way to
avoid that (or decided that we don't care about leaking addresses), we
could change this to a (faster) int-based API.

spl_object_hash() uses a XOR to hide internal addresses - so they're
already "leaking".
we also added spl_object_id() which proved much better in term of memory
usage - not only CPU.

Mixing both arguments/approaches, what about returning a XORed int? Are
we OK with that? It'd be a welcome storage/CPU optimization from my pov.

A XORed int is not much better than returning the address directly -- it
will scramble the base address, but for example not the offsets between the
references (if the XOR key is the same).

If people think that this is good enough for us, then of course it's
better to go with the int for performance reasons. I've picked the sha1
here to be conservative. How performance critical is this API?

Nikita

I'd like to move forward with the RFC as is -- I'm uncomfortable with
switching this to a simple XOR for security reasons. I think we should
start off with the current implementation, with the option of switching to
an integer return value at a later time.

Nikita

6 years ago by Nicolas Grekas — view source

unread

I'd like to move forward with the RFC as is -- I'm uncomfortable with
switching this to a simple XOR for security reasons. I think we should
start off with the current implementation, with the option of switching to
an integer return value at a later time.

Works for me, thank you!

Nicolas

6 years ago by Stanislav Malyshev — view source

unread

Hi!

I'd like to propose the addition of a ReflectionReference class, as
described in the following RFC:
https://wiki.php.net/rfc/reference_reflection

Do I understand correctly that the main use case here is to know if two
variables (treating this term expansively) point, by reference, to the
same zval/location? If so, I'd probably prefer a function that gives
answer to exactly this question.

Currently proposed API sounds a bit weird to me - why it's only
constructed from array element? It looks like a strange limitation.

Additionally, I'm not sure I understand whether there's a case for
"knowing whether array element is a reference to something" without it
being the use case above (i.e. comparing two things). I have no idea
what VarCloner does so it would be useful to have some clarity there.

The TL;DR is that some libraries need a way to detect references and
determine whether two references are the same. They previously used an ugly

Is this the same case or two different cases? Also, why do they need to
do it - what they are doing with this information afterwards?

--
Stas Malyshev
smalyshev@gmail.com

6 years ago by Nikita Popov — view source

unread

On Thu, Jan 17, 2019 at 6:48 AM Stanislav Malyshev smalyshev@gmail.com
wrote:

Hi!

I'd like to propose the addition of a ReflectionReference class, as
described in the following RFC:
https://wiki.php.net/rfc/reference_reflection

Do I understand correctly that the main use case here is to know if two
variables (treating this term expansively) point, by reference, to the
same zval/location? If so, I'd probably prefer a function that gives
answer to exactly this question.

Nearly. The main use case is to compare an arbitrary number of references
efficiently, rather than only two. As an example, consider something like
this:

$refs = [];
foreach ($array as $key => $value) {
$ref = ReflectionReference::fromArrayElement($array, $key);
if (null !== $ref) {
$refs[$ref->getId()][] = $key;
}
}

This will partition the keys into groups that all point to the same
reference. For $array = ['foo' => &$ref1, 'bar' => &$ref1, 'baz' => &$ref2,
'abc' => &$ref2] this will result in $refs = [$ref1Id => ['foo', 'bar'],
$ref2Id => ['baz', 'abc']].

This is not exactly what we want to do, but it illustrates the general type
of code that this API allows you to write.

Currently proposed API sounds a bit weird to me - why it's only

constructed from array element? It looks like a strange limitation.

This is explained in the RFC:

Construction of ReflectionReference instances faces the following
problem: A PHP function can either accept an argument by value or by
reference. This needs to be declared in the function signature, making it
impossible to distinguish whether the passed value was a reference
originally or not. To determine whether or not something is a reference,
access to the parent structure is necessary (which might be an array,
object property table or symbol table).

There is simply no other way to do this PHP short of introducing a new
language construct specifically for this purpose, which I think is overkill.

It's the same issue as debug_zval_dump() has (which strips references) and
why XDebug has a xdebug_debug_zval() function which works on a
pseudo-variable-name syntax instead. If you want to inspect whether
something is a reference, you need access to the parent structure. The most
general parent structure in PHP is an array, because this covers all of
arrays (trivially), objects (via array cast) and even variables (via
get_defined_vars() or $GLOBALS, depending on the kind of variable you're
interested in.)

Additionally, I'm not sure I understand whether there's a case for

"knowing whether array element is a reference to something" without it
being the use case above (i.e. comparing two things). I have no idea
what VarCloner does so it would be useful to have some clarity there.

The TL;DR is that some libraries need a way to detect references and
determine whether two references are the same. They previously used an
ugly

Is this the same case or two different cases? Also, why do they need to
do it - what they are doing with this information afterwards?

Nicolas can explain that better than me, but from what I understand
VarCloner converts a PHP value (of arbitrary complexity) into an object
representation that's more amenable to further processing. E.g. instead of
having PHP references, there will be a common object that's used everywhere
a certain PHP reference was used. So the first time we see a reference we
want to create a stub object for it, and then if we see a reference again
(stored in a different place) we want to replace it with the existing stub
object.

Nikita