Why are serialized strings wrapped in double quotes? (s:<len>:"<string>")

1 year ago by Sanford Whiteman — view source

unread

Howdy all, haven't posted in ages but good to see the list going strong.

I'd like a little background on something we've long accepted: why
does the serialization format need double quotes around a string, even
though the byte length is explicit?

Example:

s:5:"hello";

All else being equal I would think we could have just

s:5:hello;

and skip forward 5 bytes. Instead we need to be aware of the leading
and trailing " in our state machine but I'm not sure what the
advantage is.

Was this just to make strings look more 'stringy', even though the
format isn't meant to be human-readable?

I read (the archive of) Kris's blog post:

https://web.archive.org/web/20170813190508/http://blog.koehntopp.info/index.php/2407-php-understanding-unserialize/

but that didn't shed any light. Zigzagging through the source wasn't
getting me there as fast as someone who was there from the beginning.

The reason for my question is I'm writing a blog post about a SaaS app
that (don't gasp/laugh) returns serialize() format from one of its
APIs. In discussing why the PHP format can make sense vs. JSON, I
wanted to point to the faster parsing you get with length-prefixed
strings. Then I started wondering about why we have both the length
prefix and the extra quotes.

Thanks,

Sandy

1 year ago by Ilija Tovilo — view source

unread

Hi Sandy

I'd like a little background on something we've long accepted: why
does the serialization format need double quotes around a string, even
though the byte length is explicit?

Example:

s:5:"hello";

All else being equal I would think we could have just

s:5:hello;

Was this just to make strings look more 'stringy', even though the
format isn't meant to be human-readable?

I don't have the historical context, but I'm assuming that's it. PHPs
serialization format is not efficient, and I don't think that was ever
the primary focus. If you need something more efficient, you can try
https://github.com/igbinary/igbinary which is aimed to be a drop-in
replacement.

Ilija

1 year ago by Sanford Whiteman — view source

unread

I don't have the historical context, but I'm assuming that's it. PHPs
serialization format is not efficient, and I don't think that was ever
the primary focus.

Thanks Ilija. That'll have to suffice unless someone remembers a specific
decision (searching all the old Internals posts nothing came up). Most of my
readers are pretty junior but I hate to say something that conflicts with
their intuition.

— S.

Hi Sandy

I'd like a little background on something we've long accepted: why
does the serialization format need double quotes around a string, even
though the byte length is explicit?

Example:

s:5:"hello";

All else being equal I would think we could have just

s:5:hello;

Was this just to make strings look more 'stringy', even though the
format isn't meant to be human-readable?

I don't have the historical context, but I'm assuming that's it. PHPs
serialization format is not efficient, and I don't think that was ever
the primary focus. If you need something more efficient, you can try
https://github.com/igbinary/igbinary which is aimed to be a drop-in
replacement.

Ilija

--

To unsubscribe, visit: https://www.php.net/unsub.php

1 year ago by Jim Winstead — view source

unread

Howdy all, haven't posted in ages but good to see the list going strong.

I'd like a little background on something we've long accepted: why
does the serialization format need double quotes around a string, even
though the byte length is explicit?

I enjoy spelunking in the history of the project, so I did some digging. It looks to me like Kris didn't quite get the history correct. Boris did propose a form of serialization first, but it looks like what became serialize() and unserialize() came into the project another way.

https://marc.info/?l=php-general&m=90222513234434&w=2

The serialize() and unserialize() functions were first added in PHP 3.0.5 with that same encoding for strings that you're asking about. Here is the original proposal for adding the functions from Jani Lehtimäki:

https://news-web.php.net/php.dev/1444

The were originally conceived as var_save() and var_load() and operated on files, but you can see the file format uses the same string encoding, although it used single quotes.

It was committed to CVS by Stig here, but unfortunately the emails to the list didn't include newly-added files.

https://news-web.php.net/php.dev/1540

I'm not sure if the old CVS history is preserved somewhere, but based on what appeared in 3.0.5, that format probably goes back to the beginning and it doesn't look like there was any on-list discussion about it.

Jim

1 year ago by Sanford Whiteman — view source

unread

Nice work, Jim.

I enjoy spelunking in the history of the project, so I did some digging. It
looks to me like Kris didn't quite get the history correct. Boris did propose
a form of serialization first, but it looks like what became serialize() and
unserialize() came into the project another way.

https://marc.info/?l=php-general&m=90222513234434&w=2

The serialize() and unserialize() functions were first added in PHP 3.0.5
with that same encoding for strings that you're asking about. Here is the
original proposal for adding the functions from Jani Lehtimäki:

https://news-web.php.net/php.dev/1444

The were originally conceived as var_save() and var_load() and operated on
files, but you can see the file format uses the same string encoding, although it used single quotes.

It was committed to CVS by Stig here, but unfortunately the emails to the list didn't include newly-added files.

https://news-web.php.net/php.dev/1540

Huh. So the quotes may have just stuck around from eval()-related approaches
without being officially discussed. In the grand scheme even if you're wasting 2
bytes for every string that could be a tiny % on average.

The format's fascinating because it unmistakably works, and binary
igbinary/msgpack aside, it's a pretty good byte-stream encoding. If you take
Sergey's results it's way faster than JSON, at least when it's PHP doing the
unserialization:
https://grechin.org/2021/04/06/php-json-encode-vs-serialize-performance-comparison.html

— S.

1 year ago by michal.brzuchalski@gmail.com — view source

unread

Hi Sandy,

wt., 6 lut 2024 o 21:19 Sanford Whiteman figureonecpr@gmail.com
napisał(a):

Howdy all, haven't posted in ages but good to see the list going strong.

I'd like a little background on something we've long accepted: why
does the serialization format need double quotes around a string, even
though the byte length is explicit?

Example:

s:5:"hello";

All else being equal I would think we could have just

s:5:hello;

and skip forward 5 bytes. Instead we need to be aware of the leading
and trailing " in our state machine but I'm not sure what the
advantage is.

You inspired me to play with serialization format to spot even more
unnecessary chars https://3v4l.org/DLh1U
From my PoV there are more candidates to reduce and still keep the safety,
for eg:
removing leading ':' before array/object and trailing ';' inside brackets,
you reduce by 2 bytes

a:4:{i:0;i:123;i:1;b:1;i:2;d:1.1;i:3;s:3:"baz";}

Could be simply

a:4{i:0;i:123;i:1;b:1;i:2;d:1.1;i:3;s:3:baz}

This example saves 4 bytes: double-quotes, one ; and :

If you go further all types that require size/length also don't need extra
double-colon meaning:
a:4 could become a4
s:3 could become s3

The same could apply to O: and E:

O3:Foo:5{s4:date;O17:DateTimeImmutable:3{s4:date;s26:2024-02-08
08:41:10.009742;s13:timezone_type;i:3;s8:timezone;s16:Europe/Amsterdam}s6:*foo;s11:Foo
bar
baz;s8:Foobar;i:123456789;s3:tbl;a4{i:0;i:123;i:1;b:1;i:2;d:1.1;i:3;s3:baz}s8:*color;E12:Color:Yellow}

This is still readable by humans and keep the size/length in all places
where needed.
My attached example is poor but shows up to ~20% size reduction.

Interestingly when an array is serialized as object property it is not
followed by ; in field list https://3v4l.org/4p6ve

O:3:"Foo":2:{s:3:"foo";a:3:{i:0;i:1;i:1;i:2;i:2;i:3;}s:3:"bar";s:3:"baz";}

Missing ; between }s was a surprise to me.

Best regards,
Michał Marcin Brzuchalski

1 year ago by Sanford Whiteman — view source

unread

Hi Michał,

Thursday, February 8, 2024, 2:58:52 AM, you wrote:

You inspired me to play with serialization format to spot even more
unnecessary chars https://3v4l.org/DLh1U
From my PoV there are more candidates to reduce and still keep the safety,
for eg:
removing leading ':' before array/object and trailing ';' inside brackets,
you reduce by 2 bytes

a:4:{i:0;i:123;i:1;b:1;i:2;d:1.1;i:3;s:3:"baz";}

Could be simply

a:4{i:0;i:123;i:1;b:1;i:2;d:1.1;i:3;s:3:baz}

This example saves 4 bytes: double-quotes, one ; and :

If you go further all types that require size/length also don't need extra
double-colon meaning:
a:4 could become a4
s:3 could become s3

The same could apply to O: and E:

O3:Foo:5{s4:date;O17:DateTimeImmutable:3{s4:date;s26:2024-02-08
08:41:10.009742;s13:timezone_type;i:3;s8:timezone;s16:Europe/Amsterdam}s6:*foo;s11:Foo
bar
baz;s8:Foobar;i:123456789;s3:tbl;a4{i:0;i:123;i:1;b:1;i:2;d:1.1;i:3;s3:baz}s8:*color;E12:Color:Yellow}

This is still readable by humans and keep the size/length in all places
where needed.

Amazing. To my eyes it's more readable too.

Here's another one: leading numeral implies Integer 'i' (so only
'd', 'b' and 's<len>' are necessary). Or maybe that goes too far.

Interestingly when an array is serialized as object property it is not
followed by ; in field list https://3v4l.org/4p6ve

O:3:"Foo":2:{s:3:"foo";a:3:{i:0;i:1;i:1;i:2;i:2;i:3;}s:3:"bar";s:3:"baz";}

Missing ; between }s was a surprise to me.

Yeah, that almost seems like a bug that unserialize() tolerates.

— S.

1 year ago by michal.brzuchalski@gmail.com — view source

unread

czw., 8 lut 2024 o 20:10 Sanford Whiteman figureonecpr@gmail.com
napisał(a):

Hi Michał,

Thursday, February 8, 2024, 2:58:52 AM, you wrote:
...

O3:Foo:5{s4:date;O17:DateTimeImmutable:3{s4:date;s26:2024-02-08

08:41:10.009742;s13:timezone_type;i:3;s8:timezone;s16:Europe/Amsterdam}s6:*foo;s11:Foo
bar

baz;s8:Foobar;i:123456789;s3:tbl;a4{i:0;i:123;i:1;b:1;i:2;d:1.1;i:3;s3:baz}s8:*color;E12:Color:Yellow}

This is still readable by humans and keep the size/length in all places
where needed.

Amazing. To my eyes it's more readable too.

Just wondering, while null is encoded just as N the booleans are encoded
with b:0 or b:1
I can imagine this could also be just T and F

Here's another one: leading numeral implies Integer 'i' (so only
'd', 'b' and 's<len>' are necessary). Or maybe that goes too far.

I was there in the very first link you can spot it but also believe this
goes too far.

All above already goes far beyond what you initially asked and I know that.
I just like to share what can find.

Cheers,
Michał Marcin Brzuchalski

1 year ago by Robert Landers — view source

unread

On Fri, Feb 9, 2024 at 8:13 AM Michał Marcin Brzuchalski
michal.brzuchalski@gmail.com wrote:

czw., 8 lut 2024 o 20:10 Sanford Whiteman figureonecpr@gmail.com
napisał(a):

Hi Michał,

Thursday, February 8, 2024, 2:58:52 AM, you wrote:
...

O3:Foo:5{s4:date;O17:DateTimeImmutable:3{s4:date;s26:2024-02-08

08:41:10.009742;s13:timezone_type;i:3;s8:timezone;s16:Europe/Amsterdam}s6:*foo;s11:Foo
bar

baz;s8:Foobar;i:123456789;s3:tbl;a4{i:0;i:123;i:1;b:1;i:2;d:1.1;i:3;s3:baz}s8:*color;E12:Color:Yellow}

This is still readable by humans and keep the size/length in all places
where needed.

Amazing. To my eyes it's more readable too.

Just wondering, while null is encoded just as N the booleans are encoded
with b:0 or b:1
I can imagine this could also be just T and F

Here's another one: leading numeral implies Integer 'i' (so only
'd', 'b' and 's<len>' are necessary). Or maybe that goes too far.

I was there in the very first link you can spot it but also believe this
goes too far.

All above already goes far beyond what you initially asked and I know that.
I just like to share what can find.

Cheers,
Michał Marcin Brzuchalski

If I recall correctly, there is also a \0 (null character) hiding in
the serialized string as well. It's incredibly annoying when
copy/pasting as sometimes it gets stripped out. It might be worth
removing as well.

Robert Landers
Software Engineer
Utrecht NL

1 year ago by Casper Langemeijer — view source

unread

I'd like a little background on something we've long accepted: why
does the serialization format need double quotes around a string, even
though the byte length is explicit?

Instead we need to be aware of the leading and trailing " in our state
machine but I'm not sure what the advantage is.

Dunno why, but is has made my life much easier. I've seen many situation where serialized data was converted from CP1252 to UTF8. Then the string length changes and unserialization leads to an error condition. Without the quotes possibly many cases would go undetected.

Was this just to make strings look more 'stringy', even though the
format isn't meant to be human-readable?

In my mind the format is a nice pragmatic middle between reasonably efficient, reasonably robust, too feature-complete (too many allowed_classes) and somewhat human readable. At least enough for incidental debugging or manual tinkering.