unserialize() & unicode issues

19 years ago by Antony Dovgal — view source

unread

Hello all.

I'm currently working on unicode support in serialize()/unserialize() and stuck with some issues.
Here they are:

What to do with unserializing serialized unicode strings when unicode_semantics is Off?
I presume it's safe to create & return IS_UNICODE in this case ?
Classnames are serialized without U: or s: prefix, but I can detect unicode string by it's leading "".
It's looks kinda tricky, but on the other hand forward slash can't appear there if it's not unicode.
Or should I change it to use U:/s: prefixes? (Didn't try it yet, so I can't say how difficult it would be).

The other problem here is that we can't use unicode class names when unicode_semantics is Off because in this case class_table stores them as IS_STRING and we won't be able to find class entry by it's unicode name (thanks to Val for noticing this).

Currently serialize() produces valid \u0000 sequences, which can be parsed/restored perfectly fine when reading them from a file or returning from serialize().
But specifying them as a const string won't work as these sequences get parsed in compile time.

Short example:
<?php
var_dump(unserialize('U:2:"\u0061\u0061";')); // won't work
var_dump(unserialize(serialize("aa"))); // works
var_dump('U:2:"\u0061\u0061";'); //produces unicode(9) "U:2:"aa";"
?>
IMO the best way here is to change serialize() output to produce something else (for example \pu0000 instead of \u0000) - in this case it works just fine.

Comments?

--
Wbr,
Antony Dovgal

19 years ago by Andrei Zmievski — view source

unread

Yes, serialization is a problem. I would actually advocate putting a
marker in the serialized file that indicates what the value of
unicode_semantics switch was during the serialization, and if the
value is different during deserialization, refuse to load it or start
a new session. One really should not be changing that switch on a
whim in-between sessions.

-Andrei

Hello all.

I'm currently working on unicode support in serialize()/unserialize
() and stuck with some issues.
Here they are:

What to do with unserializing serialized unicode strings when
unicode_semantics is Off?
I presume it's safe to create & return IS_UNICODE in this case ?

Classnames are serialized without U: or s: prefix, but I can
detect unicode string by it's leading "".
It's looks kinda tricky, but on the other hand forward slash can't
appear there if it's not unicode.
Or should I change it to use U:/s: prefixes? (Didn't try it yet, so
I can't say how difficult it would be).

The other problem here is that we can't use unicode class names
when unicode_semantics is Off because in this case class_table
stores them as IS_STRING and we won't be able to find class entry
by it's unicode name (thanks to Val for noticing this).

Currently serialize() produces valid \u0000 sequences, which can
be parsed/restored perfectly fine when reading them from a file or
returning from serialize().
But specifying them as a const string won't work as these sequences
get parsed in compile time.

Short example:
<?php
var_dump(unserialize('U:2:"\u0061\u0061";')); // won't work
var_dump(unserialize(serialize("aa"))); // works
var_dump('U:2:"\u0061\u0061";'); //produces unicode(9) "U:2:"aa";"
?>
IMO the best way here is to change serialize() output to produce
something else (for example \pu0000 instead of \u0000) - in this
case it works just fine.

Comments?

--
Wbr, Antony Dovgal

19 years ago by Antony Dovgal — view source

unread

Yes, serialization is a problem. I would actually advocate putting a
marker in the serialized file that indicates what the value of
unicode_semantics switch was during the serialization, and if the
value is different during deserialization, refuse to load it or start
a new session. One really should not be changing that switch on a
whim in-between sessions.

Why? It loads/works perfectly fine except for the problems I've mentioned.
Also, you can't put any markers to the serialized text (at least it sounds very bad to me), so it won't help you in this case.

--
Wbr,
Antony Dovgal

19 years ago by Andrei Zmievski — view source

unread

The problems you encountered are fairly big, I wouldn't just dismiss
them.

-Andrei

Yes, serialization is a problem. I would actually advocate putting
a marker in the serialized file that indicates what the value of
unicode_semantics switch was during the serialization, and if the
value is different during deserialization, refuse to load it or
start a new session. One really should not be changing that
switch on a whim in-between sessions.

Why? It loads/works perfectly fine except for the problems I've
mentioned.
Also, you can't put any markers to the serialized text (at least
it sounds very bad to me), so it won't help you in this case.

--
Wbr, Antony Dovgal

19 years ago by Marcus Boerger — view source

unread

Hello Antony,

why can't we put a marker there? shouldn't we be able to add a flag
and allow that flag in older php versions disallowing to load those
serialized data in case it shows unicode semantics? Right now we'd
simply fail in case of unicode usage in any serialized data with old
php versions. Adding those types now would at least to show a more
specific and thus helpful error message. On te other hand havin the
semantics flag being optional would allow HEAD to unserialize data
from older version without any problem.

regards
marcus

Friday, September 9, 2005, 3:58:15 PM, you wrote:

Yes, serialization is a problem. I would actually advocate putting a
marker in the serialized file that indicates what the value of
unicode_semantics switch was during the serialization, and if the
value is different during deserialization, refuse to load it or start
a new session. One really should not be changing that switch on a
whim in-between sessions.

Why? It loads/works perfectly fine except for the problems I've mentioned.
Also, you can't put any markers to the serialized text (at least it
sounds very bad to me), so it won't help you in this case.

Best regards,
Marcus

19 years ago by Antony Dovgal — view source

unread

Hello Antony,

why can't we put a marker there? shouldn't we be able to add a flag
and allow that flag in older php versions disallowing to load those
serialized data in case it shows unicode semantics?

You mean adding a marker to be able to fail with a nice error msg?
I don't think the marker is needed for that (see below).

Right now we'd
simply fail in case of unicode usage in any serialized data with old
php versions.

Yes.
And I think it's reasonable as nobody told that the old versions will be forward compatible.

Adding those types now would at least to show a more
specific and thus helpful error message.

I suppose it's better to change 4.4.1/5.0.6/5.1 to fail gracefully when they find an unknown prefix ("U:").

On te other hand havin the
semantics flag being optional would allow HEAD to unserialize data
from older version without any problem.

Didn't get that paragraph... =|
Do we have any problems with serialized data from older versions?
I don't see any of them.

--
Wbr,
Antony Dovgal

19 years ago by Marcus Boerger — view source

unread

Hello Antony,

Wednesday, September 14, 2005, 8:58:31 PM, you wrote:

Hello Antony,

why can't we put a marker there? shouldn't we be able to add a flag
and allow that flag in older php versions disallowing to load those
serialized data in case it shows unicode semantics?

You mean adding a marker to be able to fail with a nice error msg?
I don't think the marker is needed for that (see below).

Right now we'd
simply fail in case of unicode usage in any serialized data with old
php versions.

Yes.
And I think it's reasonable as nobody told that the old versions will be forward compatible.

Adding those types now would at least to show a more
specific and thus helpful error message.

I suppose it's better to change 4.4.1/5.0.6/5.1 to fail gracefully when they find an unknown prefix ("U:").

On te other hand havin the
semantics flag being optional would allow HEAD to unserialize data
from older version without any problem.

Didn't get that paragraph... =|
Do we have any problems with serialized data from older versions?
I don't see any of them.

Well right now we don't fail gracefully and i don't think we should unless
we are dealing with something introduced in later versions that doesn't hurt
or to generate helpfull error messages that explicitly tell you what new
stuff the serialized data contains the old version being runned cannot deal
with (e.g. Unicode data).

Best regards,
Marcus

19 years ago by Antony Dovgal — view source

unread

Well right now we don't fail gracefully

Right, but it could be done easily.

and i don't think we should unless
we are dealing with something introduced in later versions that doesn't hurt
or to generate helpfull error messages that explicitly tell you what new
stuff the serialized data contains the old version being runned cannot deal
with (e.g. Unicode data).

Yes, that's exactly what I'm talking about.

--
Wbr,
Antony Dovgal

19 years ago by Andi Gutmans — view source

unread

Not coming with a solution, but I believe this would be a bad idea. I
do think some people will be using IS_UNICODE strings when
unicode_semantics=off, mainly for existing applications. They may
want to serialize Unicode strings even though their classes are
IS_STRING. It might make sense to raise an error though if a "class"
is used, but if it's just a value or a hash key, then those are valid
in unicode_semantics=off.

Andi

At 06:44 AM 9/9/2005, Andrei Zmievski wrote:

Yes, serialization is a problem. I would actually advocate putting a
marker in the serialized file that indicates what the value of
unicode_semantics switch was during the serialization, and if the
value is different during deserialization, refuse to load it or start
a new session. One really should not be changing that switch on a
whim in-between sessions.

-Andrei

Hello all.

I'm currently working on unicode support in serialize()/unserialize
() and stuck with some issues.
Here they are:

What to do with unserializing serialized unicode strings when
unicode_semantics is Off?
I presume it's safe to create & return IS_UNICODE in this case ?

Classnames are serialized without U: or s: prefix, but I can
detect unicode string by it's leading "".
It's looks kinda tricky, but on the other hand forward slash can't
appear there if it's not unicode.
Or should I change it to use U:/s: prefixes? (Didn't try it yet, so
I can't say how difficult it would be).

The other problem here is that we can't use unicode class names
when unicode_semantics is Off because in this case class_table
stores them as IS_STRING and we won't be able to find class entry
by it's unicode name (thanks to Val for noticing this).

Currently serialize() produces valid \u0000 sequences, which can
be parsed/restored perfectly fine when reading them from a file or
returning from serialize().
But specifying them as a const string won't work as these sequences
get parsed in compile time.

Short example:
<?php
var_dump(unserialize('U:2:"\u0061\u0061";')); // won't work
var_dump(unserialize(serialize("aa"))); // works
var_dump('U:2:"\u0061\u0061";'); //produces unicode(9) "U:2:"aa";"
?>
IMO the best way here is to change serialize() output to produce
something else (for example \pu0000 instead of \u0000) - in this
case it works just fine.

Comments?

--
Wbr, Antony Dovgal

19 years ago by Antony Dovgal — view source

unread

Even if the class name is in Unicode, we can try to convert it to ASCII
and fail only in the case when we can't find its class entry in the list.

So, I don't see any need in markers and other fairly major changes.

Not coming with a solution, but I believe this would be a bad idea. I
do think some people will be using IS_UNICODE strings when
unicode_semantics=off, mainly for existing applications. They may
want to serialize Unicode strings even though their classes are
IS_STRING. It might make sense to raise an error though if a "class"
is used, but if it's just a value or a hash key, then those are valid
in unicode_semantics=off.

Andi

At 06:44 AM 9/9/2005, Andrei Zmievski wrote:

Yes, serialization is a problem. I would actually advocate putting a
marker in the serialized file that indicates what the value of
unicode_semantics switch was during the serialization, and if the
value is different during deserialization, refuse to load it or start
a new session. One really should not be changing that switch on a
whim in-between sessions.

--
Wbr,
Antony Dovgal

19 years ago by val khokhlov — view source

unread

Hello Antony,

Tuesday, September 13, 2005, 11:21:21 AM, you wrote:

AD> Even if the class name is in Unicode, we can try to convert it to ASCII
AD> and fail only in the case when we can't find its class entry in the list.
I think, it's not the only way.
If we don't care about being compatible with previous PHP's
serialize(), a more portable way is to store class/property names in
unicode (if unicode_semantics=off when serializing, convert hash keys to
unicode). Since we do know script encoding, we can always downgrade
unicoded names into local encoding.

--
Best regards,
val mailto:val@vk.kiev.ua

19 years ago by Antony Dovgal — view source

unread

Hello Antony,

Tuesday, September 13, 2005, 11:21:21 AM, you wrote:

AD> Even if the class name is in Unicode, we can try to convert it to ASCII
AD> and fail only in the case when we can't find its class entry in the list.
I think, it's not the only way.
If we don't care about being compatible with previous PHP's
serialize(), a more portable way is to store class/property names in
unicode (if unicode_semantics=off when serializing, convert hash keys to
unicode). Since we do know script encoding, we can always downgrade
unicoded names into local encoding.

So you propose to store strings/hash keys/class names in Unicode even if unicode_semantics is Off ?
It looks like adding unnecessary overhead to me.

--
Wbr,
Antony Dovgal

19 years ago by Derick Rethans — view source

unread

Hello Antony,

Tuesday, September 13, 2005, 11:21:21 AM, you wrote:

AD> Even if the class name is in Unicode, we can try to convert it to ASCII
AD> and fail only in the case when we can't find its class entry in the
AD> list.
I think, it's not the only way.
If we don't care about being compatible with previous PHP's
serialize(), a more portable way is to store class/property names in
unicode (if unicode_semantics=off when serializing, convert hash keys to
unicode). Since we do know script encoding, we can always downgrade
unicoded names into local encoding.

So you propose to store strings/hash keys/class names in Unicode even if
unicode_semantics is Off ?
It looks like adding unnecessary overhead to me.

But needed, as even with the semantics off, you can get unicode strings.
Which can end up as array keys.

Derick

--
Derick Rethans
http://derickrethans.nl | http://ez.no | http://xdebug.org

19 years ago by Antony Dovgal — view source

unread

Hello Antony,

Tuesday, September 13, 2005, 11:21:21 AM, you wrote:

AD> Even if the class name is in Unicode, we can try to convert it to ASCII
AD> and fail only in the case when we can't find its class entry in the
AD> list.
I think, it's not the only way.
If we don't care about being compatible with previous PHP's
serialize(), a more portable way is to store class/property names in
unicode (if unicode_semantics=off when serializing, convert hash keys to
unicode). Since we do know script encoding, we can always downgrade
unicoded names into local encoding.

So you propose to store strings/hash keys/class names in Unicode even if
unicode_semantics is Off ?
It looks like adding unnecessary overhead to me.

But needed, as even with the semantics off, you can get unicode strings.
Which can end up as array keys.

Yes, in this case there is no way to avoid converting (while doing unserialize()),
but I don't see any point in creating Unicode strings when serializing with unicode_semantics is Off.

--
Wbr,
Antony Dovgal

19 years ago by Pierre Joye — view source

unread

Yes, in this case there is no way to avoid converting (while doing unserialize()),
but I don't see any point in creating Unicode strings when serializing with unicode_semantics is Off.

If I use serialized data on different hosts with different php, I can
see a need of having unicode strings in serialize even if
unicode_semantics is off.

--Pierre

19 years ago by Derick Rethans — view source

unread

But needed, as even with the semantics off, you can get unicode strings.
Which can end up as array keys.

Yes, in this case there is no way to avoid converting (while doing
unserialize()), but I don't see any point in creating Unicode strings when
serializing with unicode_semantics is Off.

Why not? An array can have a unicode string element, you want to
properly serialize that, as you can't always downconvert.

Derick

--
Derick Rethans
http://derickrethans.nl | http://ez.no | http://xdebug.org

19 years ago by val khokhlov — view source

unread

Hello Antony,

Tuesday, September 13, 2005, 12:42:57 PM, you wrote:

AD> So you propose to store strings/hash keys/class names in Unicode
AD> even if unicode_semantics is Off ?
yes - those items that are encoded into unicode when
unicode_semantics is on (afair, class names and property names for serialize)

AD> It looks like adding unnecessary overhead to me.
it's an overhead when you serialize and unserialize data on the same
system with the same php.ini unicode settings; but when transferring data to
other systems or changing something, you need to either use unicode, or
specify encoding for the serialized data (so, you can convert them to the
actual encoding when unserializing)

--
Best regards,
val mailto:val@vk.kiev.ua