Hi internals,
As you are surely aware, serialization in PHP is a big mess. Said mess is
caused by some fundamental issues in the serialization format, and
exacerbated by the existence of the Serializable interface. Fixing the
serialization format is likely not possible at this point, but we can
replace Serializable with a better alternative and I'd like to start a
discussion on that.
The problem is essentially that Serializable::serialize() is expected to
return a string, which is generally obtained by recursively calling
serialize()
in the Serializable::serialize() implementation. This
serialize()
call shares state information with the outer serialize()
, to
ensure that two references to the same object (or the same reference) will
continue referring to a single object/reference after serialization.
This causes two big issues:
First, the implementation is highly order-dependent. If
Serializable::serialize() contains multiple calls to serialize()
, then
calls to unserialize()
have to be repeated in the same order in
Serializable::unserialize(), otherwise unserialization may fail or be
corrupted. In particular this means that using parent::serialize() and
parent::unserialize() is unsafe. (See also
https://bugs.php.net/bug.php?id=66052 and linked bugs.)
Second, the existence of Serializable introduces security issues that we
cannot fix. Allowing the execution of PHP code during unserialization is
unsafe, and even innocuous looking code is easily exploited. We have
recently mitigated __wakeup() based attacks by delaying __wakeup() calls
until the end of the unserialization. We cannot do the same for
Serializable::unserialize() calls, as their design strictly requires the
unserialization context to still be active during the call. Similarly,
Serializable prevents an up-front validation pass of the serialized string,
as the format used for Serializable objects is user-defined.
The delayed __wakeup() mitigation mentioned in the previous point also
interacts badly with Serializable, because we have to delay __wakeup()
calls to the end of the unserialization, which in particular also implies
that Serializable::unserialize() sees objects prior to wakeup. (See also
https://bugs.php.net/bug.php?id=74436.)
In the end, everything comes down to the fact that Serializable requires
nested serialization calls with context sharing.
The alternative mechanism (__sleep + __wakeup) does not have these issues
(anymore), but it is not sufficiently flexible for general use: Notably,
__sleep() allows you to limit which properties are serialized, but the
properties still have to actually exist on the object.
I'd like to propose the addition of a new mechanism which essentially works
the same way as Serializable, but uses arrays instead of strings and does
not share context. I'm not sure about the naming (RealSerializable,
anyone?), so I'll just go with magic methods __serialize() and
__unserialize() for now:
public function __serialize() : array;
public function __unserialize(array $data) : void;
From a userland perspective the implementation should be the same as for
Serializable methods, but with interior serialize()
/unserialize() calls
stripped out. Right now Serializable implementations already usually work
by doing something like "return serialize([ ... ])", this would change it
to just "return [ ... ]" and move the serialize()
/unserialize() call into
the engine, where we can perform it safely and robustly.
The new methods should reuse the "O" serialization format, rather than
introducing a new one. This allows a measure of interoperability with
previous PHP versions, which can still decode serialized strings from newer
versions using __wakeup().
If an object has both __wakeup() and __unserialize(), then __unserialize()
should be called. If an object implements both Serializable::unserialize()
and __unserialize(), then we should invoke one or the other based on
whether "C" or "O" serialization is used.
Thoughts?
Nikita
I know my voice is doesn't mean anything but IMHO interface with magic
methods could bring more inconsistency.
I know PHP is consistently inconsistent but I would prefer if it is
posdible to fix an issue with present method naming.
Cheers
21.04.2017 11:39 "Nikita Popov" nikita.ppv@gmail.com napisał(a):
Hi internals,
As you are surely aware, serialization in PHP is a big mess. Said mess is
caused by some fundamental issues in the serialization format, and
exacerbated by the existence of the Serializable interface. Fixing the
serialization format is likely not possible at this point, but we can
replace Serializable with a better alternative and I'd like to start a
discussion on that.The problem is essentially that Serializable::serialize() is expected to
return a string, which is generally obtained by recursively calling
serialize()
in the Serializable::serialize() implementation. This
serialize()
call shares state information with the outerserialize()
, to
ensure that two references to the same object (or the same reference) will
continue referring to a single object/reference after serialization.This causes two big issues:
First, the implementation is highly order-dependent. If
Serializable::serialize() contains multiple calls toserialize()
, then
calls tounserialize()
have to be repeated in the same order in
Serializable::unserialize(), otherwise unserialization may fail or be
corrupted. In particular this means that using parent::serialize() and
parent::unserialize() is unsafe. (See also
https://bugs.php.net/bug.php?id=66052 and linked bugs.)Second, the existence of Serializable introduces security issues that we
cannot fix. Allowing the execution of PHP code during unserialization is
unsafe, and even innocuous looking code is easily exploited. We have
recently mitigated __wakeup() based attacks by delaying __wakeup() calls
until the end of the unserialization. We cannot do the same for
Serializable::unserialize() calls, as their design strictly requires the
unserialization context to still be active during the call. Similarly,
Serializable prevents an up-front validation pass of the serialized string,
as the format used for Serializable objects is user-defined.The delayed __wakeup() mitigation mentioned in the previous point also
interacts badly with Serializable, because we have to delay __wakeup()
calls to the end of the unserialization, which in particular also implies
that Serializable::unserialize() sees objects prior to wakeup. (See also
https://bugs.php.net/bug.php?id=74436.)In the end, everything comes down to the fact that Serializable requires
nested serialization calls with context sharing.The alternative mechanism (__sleep + __wakeup) does not have these issues
(anymore), but it is not sufficiently flexible for general use: Notably,
__sleep() allows you to limit which properties are serialized, but the
properties still have to actually exist on the object.I'd like to propose the addition of a new mechanism which essentially works
the same way as Serializable, but uses arrays instead of strings and does
not share context. I'm not sure about the naming (RealSerializable,
anyone?), so I'll just go with magic methods __serialize() and
__unserialize() for now:public function __serialize() : array; public function __unserialize(array $data) : void;
From a userland perspective the implementation should be the same as for
Serializable methods, but with interiorserialize()
/unserialize() calls
stripped out. Right now Serializable implementations already usually work
by doing something like "return serialize([ ... ])", this would change it
to just "return [ ... ]" and move theserialize()
/unserialize() call into
the engine, where we can perform it safely and robustly.The new methods should reuse the "O" serialization format, rather than
introducing a new one. This allows a measure of interoperability with
previous PHP versions, which can still decode serialized strings from newer
versions using __wakeup().If an object has both __wakeup() and __unserialize(), then __unserialize()
should be called. If an object implements both Serializable::unserialize()
and __unserialize(), then we should invoke one or the other based on
whether "C" or "O" serialization is used.Thoughts?
Nikita
On Fri, Apr 21, 2017 at 2:47 PM, Michał Brzuchalski <
michal.brzuchalski@gmail.com> wrote:
I know my voice is doesn't mean anything but IMHO interface with magic
methods could bring more inconsistency.I know PHP is consistently inconsistent but I would prefer if it is
posdible to fix an issue with present method naming.Cheers
Magic methods have a distinct backwards compatibility advantage. They allow
you to add __serialize/__unserialize to an existing class that currently
uses Serializable. Older PHP versions will then use the Serializable
implementation, while never versions will use __serialize/__unserialize. An
interface makes this a lot more complicated, because you either have to
bump your PHP version requirement (unlikely), or you have to provide a shim
interface for older PHP versions. This shim interface would then be part of
any library currently implementing Serializable, which seems sub-optimal to
me. That's why I think magic methods are better for this case (though I
don't strongly care).
Also, to answer an OTR question: We cannot simply reuse the Serializable
interface by allowing an array return value from
Serializable::unserialize(). The array return value is only a means to an
end: the important part is that the new serialization mechanism does not
share serialization state -- using arrays instead of strings is just a
convenient way to achieve this. However, Serializable::unserialize()
currently shares the state and we cannot change this without breaking BC --
so we cannot reuse this interface.
Nikita
21.04.2017 11:39 "Nikita Popov" nikita.ppv@gmail.com napisał(a):
Hi internals,
As you are surely aware, serialization in PHP is a big mess. Said mess is
caused by some fundamental issues in the serialization format, and
exacerbated by the existence of the Serializable interface. Fixing the
serialization format is likely not possible at this point, but we can
replace Serializable with a better alternative and I'd like to start a
discussion on that.The problem is essentially that Serializable::serialize() is expected to
return a string, which is generally obtained by recursively calling
serialize()
in the Serializable::serialize() implementation. This
serialize()
call shares state information with the outerserialize()
, to
ensure that two references to the same object (or the same reference) will
continue referring to a single object/reference after serialization.This causes two big issues:
First, the implementation is highly order-dependent. If
Serializable::serialize() contains multiple calls toserialize()
, then
calls tounserialize()
have to be repeated in the same order in
Serializable::unserialize(), otherwise unserialization may fail or be
corrupted. In particular this means that using parent::serialize() and
parent::unserialize() is unsafe. (See also
https://bugs.php.net/bug.php?id=66052 and linked bugs.)Second, the existence of Serializable introduces security issues that we
cannot fix. Allowing the execution of PHP code during unserialization is
unsafe, and even innocuous looking code is easily exploited. We have
recently mitigated __wakeup() based attacks by delaying __wakeup() calls
until the end of the unserialization. We cannot do the same for
Serializable::unserialize() calls, as their design strictly requires the
unserialization context to still be active during the call. Similarly,
Serializable prevents an up-front validation pass of the serialized
string,
as the format used for Serializable objects is user-defined.The delayed __wakeup() mitigation mentioned in the previous point also
interacts badly with Serializable, because we have to delay __wakeup()
calls to the end of the unserialization, which in particular also implies
that Serializable::unserialize() sees objects prior to wakeup. (See also
https://bugs.php.net/bug.php?id=74436.)In the end, everything comes down to the fact that Serializable requires
nested serialization calls with context sharing.The alternative mechanism (__sleep + __wakeup) does not have these issues
(anymore), but it is not sufficiently flexible for general use: Notably,
__sleep() allows you to limit which properties are serialized, but the
properties still have to actually exist on the object.I'd like to propose the addition of a new mechanism which essentially
works
the same way as Serializable, but uses arrays instead of strings and does
not share context. I'm not sure about the naming (RealSerializable,
anyone?), so I'll just go with magic methods __serialize() and
__unserialize() for now:public function __serialize() : array; public function __unserialize(array $data) : void;
From a userland perspective the implementation should be the same as for
Serializable methods, but with interiorserialize()
/unserialize() calls
stripped out. Right now Serializable implementations already usually work
by doing something like "return serialize([ ... ])", this would change it
to just "return [ ... ]" and move theserialize()
/unserialize() call into
the engine, where we can perform it safely and robustly.The new methods should reuse the "O" serialization format, rather than
introducing a new one. This allows a measure of interoperability with
previous PHP versions, which can still decode serialized strings from
newer
versions using __wakeup().If an object has both __wakeup() and __unserialize(), then __unserialize()
should be called. If an object implements both Serializable::unserialize()
and __unserialize(), then we should invoke one or the other based on
whether "C" or "O" serialization is used.Thoughts?
Nikita
Please allow me to up vote for this proposal.
I'm working on some serialization logic these days, and Serializable
is
totally broken: it breaks internal references in serialized data
structures, and breaks custom serializers (e.g. igbinaryà from inspecting
nested structures.
The proposal here would just fix all these issues.
+1000 from me for what it matters :)
2017-04-21 15:50 GMT+02:00 Nikita Popov nikita.ppv@gmail.com:
On Fri, Apr 21, 2017 at 2:47 PM, Michał Brzuchalski <
michal.brzuchalski@gmail.com> wrote:I know my voice is doesn't mean anything but IMHO interface with magic
methods could bring more inconsistency.I know PHP is consistently inconsistent but I would prefer if it is
posdible to fix an issue with present method naming.Cheers
Magic methods have a distinct backwards compatibility advantage. They allow
you to add __serialize/__unserialize to an existing class that currently
uses Serializable. Older PHP versions will then use the Serializable
implementation, while never versions will use __serialize/__unserialize. An
interface makes this a lot more complicated, because you either have to
bump your PHP version requirement (unlikely), or you have to provide a shim
interface for older PHP versions. This shim interface would then be part of
any library currently implementing Serializable, which seems sub-optimal to
me. That's why I think magic methods are better for this case (though I
don't strongly care).Also, to answer an OTR question: We cannot simply reuse the Serializable
interface by allowing an array return value from
Serializable::unserialize(). The array return value is only a means to an
end: the important part is that the new serialization mechanism does not
share serialization state -- using arrays instead of strings is just a
convenient way to achieve this. However, Serializable::unserialize()
currently shares the state and we cannot change this without breaking BC --
so we cannot reuse this interface.Nikita
21.04.2017 11:39 "Nikita Popov" nikita.ppv@gmail.com napisał(a):
Hi internals,
As you are surely aware, serialization in PHP is a big mess. Said mess
is
caused by some fundamental issues in the serialization format, and
exacerbated by the existence of the Serializable interface. Fixing the
serialization format is likely not possible at this point, but we can
replace Serializable with a better alternative and I'd like to start a
discussion on that.The problem is essentially that Serializable::serialize() is expected to
return a string, which is generally obtained by recursively calling
serialize()
in the Serializable::serialize() implementation. This
serialize()
call shares state information with the outerserialize()
, to
ensure that two references to the same object (or the same reference)
will
continue referring to a single object/reference after serialization.This causes two big issues:
First, the implementation is highly order-dependent. If
Serializable::serialize() contains multiple calls toserialize()
, then
calls tounserialize()
have to be repeated in the same order in
Serializable::unserialize(), otherwise unserialization may fail or be
corrupted. In particular this means that using parent::serialize() and
parent::unserialize() is unsafe. (See also
https://bugs.php.net/bug.php?id=66052 and linked bugs.)Second, the existence of Serializable introduces security issues that we
cannot fix. Allowing the execution of PHP code during unserialization is
unsafe, and even innocuous looking code is easily exploited. We have
recently mitigated __wakeup() based attacks by delaying __wakeup() calls
until the end of the unserialization. We cannot do the same for
Serializable::unserialize() calls, as their design strictly requires the
unserialization context to still be active during the call. Similarly,
Serializable prevents an up-front validation pass of the serialized
string,
as the format used for Serializable objects is user-defined.The delayed __wakeup() mitigation mentioned in the previous point also
interacts badly with Serializable, because we have to delay __wakeup()
calls to the end of the unserialization, which in particular also
implies
that Serializable::unserialize() sees objects prior to wakeup. (See also
https://bugs.php.net/bug.php?id=74436.)In the end, everything comes down to the fact that Serializable requires
nested serialization calls with context sharing.The alternative mechanism (__sleep + __wakeup) does not have these
issues
(anymore), but it is not sufficiently flexible for general use: Notably,
__sleep() allows you to limit which properties are serialized, but the
properties still have to actually exist on the object.I'd like to propose the addition of a new mechanism which essentially
works
the same way as Serializable, but uses arrays instead of strings and
does
not share context. I'm not sure about the naming (RealSerializable,
anyone?), so I'll just go with magic methods __serialize() and
__unserialize() for now:public function __serialize() : array; public function __unserialize(array $data) : void;
From a userland perspective the implementation should be the same as for
Serializable methods, but with interiorserialize()
/unserialize() calls
stripped out. Right now Serializable implementations already usually
work
by doing something like "return serialize([ ... ])", this would change
it
to just "return [ ... ]" and move theserialize()
/unserialize() call
into
the engine, where we can perform it safely and robustly.The new methods should reuse the "O" serialization format, rather than
introducing a new one. This allows a measure of interoperability with
previous PHP versions, which can still decode serialized strings from
newer
versions using __wakeup().If an object has both __wakeup() and __unserialize(), then
__unserialize()
should be called. If an object implements both
Serializable::unserialize()
and __unserialize(), then we should invoke one or the other based on
whether "C" or "O" serialization is used.Thoughts?
Nikita
-----Original Message-----
From: Nikita Popov [mailto:nikita.ppv@gmail.com]
Sent: Friday, April 21, 2017 12:39 PM
To: PHP internals internals@lists.php.net
Subject: [PHP-DEV] A replacement for the Serializable interfaceHi internals,
I'd like to propose the addition of a new mechanism which essentially works the
same way as Serializable, but uses arrays instead of strings and does not share
context. I'm not sure about the naming (RealSerializable, anyone?), so I'll just go
with magic methods __serialize() and
__unserialize() for now:public function __serialize() : array; public function __unserialize(array $data) : void;
I think this is a very interesting idea that's definitely worth exploring.
One thing I think is worth doing is reviewing some of the countless issues we've had with unserialize()
over the last year, and try to estimate how many of those would not be relevant with this new approach (which you may have already done). If most of them become irrelevant then this is probably the right direction to go towards. Would you go as far as saying that you think that would be safe to run on untrusted input?
Thoughts?
Generally, thumbs up from me.
Thanks!
Zeev