Hi Internals,
I think this will be my last proposal for quite some while :)
But this time, I'd like to propose bundling the
https://github.com/crazyxman/simdjson_php extension
with some major modifications.
The proposed OO API is included in the description of the
PR that I've just created: https://github.com/php/php-src/pull/6551
The main motivation behind this RFC is two-fold:
- the underlying simdjson library (https://github.com/simdjson/simdjson)
which is used by ext/simdjson provides huge performance gains
compared to ext/json (see some benchmark results in the PR) - we can support new use-cases, most notably the so called "on-demand"
parsing: https://github.com/simdjson/simdjson/blob/master/doc/ondemand.md
(This is not implemented currently)
Originally, I planned to include the new API in ext/json, but unfortunately,
simdjson is written is C++, so it would make C++ as a hard dependency,
which was not the case so far. That's why I opted for creating ext/simdjson.
Please let me know if you have any feedback.
Regards:
Máté
Le 29/12/2020 à 17:57, Máté Kocsis a écrit :
Hi Internals,
I think this will be my last proposal for quite some while :)
But this time, I'd like to propose bundling the
https://github.com/crazyxman/simdjson_php extension
with some major modifications.The proposed OO API is included in the description of the
PR that I've just created: https://github.com/php/php-src/pull/6551The main motivation behind this RFC is two-fold:
- the underlying simdjson library (https://github.com/simdjson/simdjson)
which is used by ext/simdjson provides huge performance gains
compared to ext/json (see some benchmark results in the PR)- we can support new use-cases, most notably the so called "on-demand"
parsing: https://github.com/simdjson/simdjson/blob/master/doc/ondemand.md
(This is not implemented currently)Originally, I planned to include the new API in ext/json, but unfortunately,
simdjson is written is C++, so it would make C++ as a hard dependency,
which was not the case so far. That's why I opted for creating ext/simdjson.Please let me know if you have any feedback.
the library seems young and very active, and so probably not available
in most distributions.
I would prefer to use the common way
- propose as pecl extension
- wait for maturity (extension AND library)
- propose for merge in php-src
IMHO: too early for this one.
Remi
Regards:
Máté
Le 29/12/2020 à 17:57, Máté Kocsis a écrit :
Hi Internals,
I think this will be my last proposal for quite some while :)
But this time, I'd like to propose bundling the
https://github.com/crazyxman/simdjson_php extension
with some major modifications.The proposed OO API is included in the description of the
PR that I've just created: https://github.com/php/php-src/pull/6551The main motivation behind this RFC is two-fold:
- the underlying simdjson library (https://github.com/simdjson/simdjson)
which is used by ext/simdjson provides huge performance gains
compared to ext/json (see some benchmark results in the PR)- we can support new use-cases, most notably the so called "on-demand"
parsing:
https://github.com/simdjson/simdjson/blob/master/doc/ondemand.md
(This is not implemented currently)Originally, I planned to include the new API in ext/json, but
unfortunately,
simdjson is written is C++, so it would make C++ as a hard dependency,
which was not the case so far. That's why I opted for creating
ext/simdjson.Please let me know if you have any feedback.
the library seems young and very active, and so probably not available
in most distributions.I would prefer to use the common way
- propose as pecl extension
- wait for maturity (extension AND library)
- propose for merge in php-src
IMHO: too early for this one.
I agree it's too early as the library is too young and won't be available
in many distros. The PECL path is better in this case IMO as it will allow
some time .
That said the simdjson lib looks pretty impressive. Definitely looks more
advanced than the ext json parser that I wrote more than 6 years ago. In my
defese, I had no grant for that. :) Anyway really cool to see what can be
done and how much it's possible to speed it up.
Was thinking that it would be good to consider some kind of plugable
decoder where another extension could register a parsing callback.
Something similar to what we have for parser but instead for the whole
decoding. That would allow to still use current parser in json_decode but
if simdjson available / configured in ini, then it would used instead and
would be just faster. Not sure if all options are supported though - for
example don't see any note about UTF8 substitution
(JSON_INVALID_UTF8_SUBSTITUTE).
Cheers
Jakub
Hi Remi and Jakub,
I agree it's too early as the library is young and won't be available in
many distros. The PECL path is better in this case IMO as it will allow
some time .
In my opinion, this is a case where making an
exception is worth considering.
Should the simdjson library be written in C,
I'd propose to add the new API + parser to
ext/json directly, since ext/simdjson is just a
very small wrapper around the parser, and
not a complex piece of code in itself (compared to other parts of php-src).
Also, I think the performance benefit of using
the simdjson parser is so major that it would
be a pity if people had to wait for years until
the functionality becomes generally available
as a core extension. As json_encode()
and
json_decode()
are very easy to use, my guess
is that a 3rd party JSON-related extension
would never get an adoption large enough,
because only those people would install it
who have really reached the limitations of ext/json.
By the way, it has just come to my mind that
our company is also affected by these
limitations. Sometimes we have to parse
very large JSON documents, and in some
cases these can end up being truncated.
Fortunately we only need a specific part of
the data, so someone wrote a partial "parser"
(this is euphemism) tailored for the schema
in question. Rather than having to use
custom hackery, it would be so much better
if PHP would offer partial parsing out of the
box, like what the proposed
JsonParser::getKeyValue() does.
That said, the cost-benefit ratio of having
simdjson in core seems advantageous for me.
Was thinking that it would be good to consider some kind of plugable
decoder where another extension could register a parsing callback.
Something similar to what we have for parser but instead for the whole
decoding. That would allow to still use current parser in json_decode but
if simdjson available / configured in ini, then it would used instead and
would be just faster. Not sure if all options are supported though - for
example don't see any note about UTF8 substitution
(JSON_INVALID_UTF8_SUBSTITUTE).
This is a very interesting approach, and it reminds me about the hashing
registry RFC
in certain extent.
And you are right, as far as I saw, these flags
are not supported by ext/simdjson. But to be
honest, I haven't analyzed yet how difficult it
would be to have a reasonably full compatibility with ext/json.
Cheers:
Máté
Sorry, I need to add a small correction for my previous email:
JsonParser::getKeyValue() wouldn't allow parsing a truncated JSON
document. The on-demand parsing API I mentioned in my opening
message is what would make such use-case possible.
Besides, I accept if you feel it's too early for a new JSON API to be added
to the core, so if the majority of stakeholders also share this opinion,
I'll
try to make an improved version of ext/simdjson available via PECL.
Máté
Hi,
Hi Remi and Jakub,
I agree it's too early as the library is young and won't be available in
many distros. The PECL path is better in this case IMO as it will allow
some time .In my opinion, this is a case where making an
exception is worth considering.
The PECL path doesn't mean that the extension won't be used. From the user
point of view, it doesn't really matter that much as distro packages are
available even for PECL extensions and it can easily added to Docker image
as well. The advantage is that it will give some time for the library to be
available in distros and possibly more stable. I think it would actually
help with stabilizing the API and introducing new features quickly. It
means all the mentioned features could be provided to users in the
extension release cycle and not waiting a year for new PHP version.
Should the simdjson library be written in C,
I'd propose to add the new API + parser to
ext/json directly, since ext/simdjson is just a
very small wrapper around the parser, and
not a complex piece of code in itself (compared to other parts of php-src).
I think this wouldn't be really an option even if it was in C because
ext/json is enabled by default so you couldn't have external dependency on
such a new library. Otherwise it would mean that you couldn't install PHP
if that library is not available. Of course unless the library is bundled
but that is not a good idea for the maintanance.
Also, I think the performance benefit of using
the simdjson parser is so major that it would
be a pity if people had to wait for years until
the functionality becomes generally available
as a core extension. Asjson_encode()
and
json_decode()
are very easy to use, my guess
is that a 3rd party JSON-related extension
would never get an adoption large enough,
because only those people would install it
who have really reached the limitations of ext/json.
But this proposal is not about changing json_decode. It introduces a new
API that can be in the same way introduce in the PECL extension. As I said
above, having that in PECL doesn't mean that it's not available and people
have to wait for it.
It would be great to actually see what the performance benefit is in the
real applications. The benchmarks relies on repeated calls which is not
always the way how it's in the application (e.g. due to processor cache).
Also it might not be such a huge perf increase for the most apps as the
actual parsing is not usually the app bottleneck. That said I think it will
bring some considerable improvements anyway in the apps especially in those
doing lots of parsing but would be great to see how much it is in reality.
By the way, it has just come to my mind that
our company is also affected by these
limitations. Sometimes we have to parse
very large JSON documents, and in some
cases these can end up being truncated.
Fortunately we only need a specific part of
the data, so someone wrote a partial "parser"
(this is euphemism) tailored for the schema
in question. Rather than having to use
custom hackery, it would be so much better
if PHP would offer partial parsing out of the
box, like what the proposed
JsonParser::getKeyValue() does.
As you mentiened, this could be possible in ondemand API which looks really
useful indeed. There are more things that are pretty useful like JSON
Pointer, better error reporting and UTF8 validation that could be
potentially also re-used in encoder. I think it would be great to have at
least some of the features in the extension before it gets to the core.
Especially thinking about the error reporting which should no longer depend
on global state.
One note about the proposed API. As it's not part of the ext/json, it
shouldn't be called JsonParser but rather SimdJsonParser to reflect that
it's part of the simdjson extension. That's the convention that is used for
other exts and it's also less confusing for users because that class won't
be available for many users initially - at least until the library is
available or extension installed / enabled. The methods also shouldn't be
all static but rather instance should be provided that would allow getting
errors or using the ondemand mode.
That said, the cost-benefit ratio of having
simdjson in core seems advantageous for me.Was thinking that it would be good to consider some kind of plugable
decoder where another extension could register a parsing callback.
Something similar to what we have for parser but instead for the whole
decoding. That would allow to still use current parser in json_decode but
if simdjson available / configured in ini, then it would used instead and
would be just faster. Not sure if all options are supported though - for
example don't see any note about UTF8 substitution
(JSON_INVALID_UTF8_SUBSTITUTE).This is a very interesting approach, and it reminds me about the hashing
registry RFC
in certain extent.And you are right, as far as I saw, these flags
are not supported by ext/simdjson. But to be
honest, I haven't analyzed yet how difficult it
would be to have a reasonably full compatibility with ext/json.
Yeah it would be really cool to be able to natively speed json_decode to
get instant improvement for the apps as the new API usually takes time to
get used. This pluggable API is in my eyes the way how to do it because we
might still need to have the current parser available for the cases where
simdjson lib is not available. But as you say it requires some analysis
first.
I think for now it would be great to create a repo for that extension under
php org in Github, get it to PECL and then add more features into it. If I
manage to find some time, I would be happy to contribute some bits or at
least help with reviews. Once it gets a bit more stable, we could consider
it for inclusion to the core.
Cheers
Jakub
Hi Internals,
I think this will be my last proposal for quite some while :)
But this time, I'd like to propose bundling the
https://github.com/crazyxman/simdjson_php extension
with some major modifications.The proposed OO API is included in the description of the
PR that I've just created: https://github.com/php/php-src/pull/6551The main motivation behind this RFC is two-fold:
- the underlying simdjson library (https://github.com/simdjson/simdjson)
which is used by ext/simdjson provides huge performance gains
compared to ext/json (see some benchmark results in the PR)- we can support new use-cases, most notably the so called "on-demand"
parsing: https://github.com/simdjson/simdjson/blob/master/doc/ondemand.md
(This is not implemented currently)Originally, I planned to include the new API in ext/json, but
unfortunately,
simdjson is written is C++, so it would make C++ as a hard dependency,
which was not the case so far. That's why I opted for creating
ext/simdjson.Please let me know if you have any feedback.
Regards:
Máté
Same as the others, I don't think it makes sense to bundle this at this
point in time -- though I'm also pretty skeptical about bundling it at all.
The end result would be that we have two JSON APIs, one that is always
available but slower, and another that is optional but faster. That is
really not great. Would make a bit more sense if simdjson was just a
backend that could be optionally enabled to speed up the normal JSON API.
I'd also be concerned about compatibility -- JSON parsing is a bit of a
minefield and many parsers have subtle differences. I remember this
comparison page http://seriot.ch/json/pruned_results.png which showed that
PHP 7 (which is when we switched to jsond) had one of the only fully
conforming JSON implementations. It doesn't seem unlikely that simdjson
will have differences somewhere, and it would be unfortunate if there were
two JSON APIs with subtly different behavior.
I think the main value proposition (even though it is not part of your
initial proposal) here would be a streaming JSON API, that does not require
parsing or even loading the whole JSON document at once. I think the
addition of such an API would be valuable -- but wouldn't it be possible to
introduce this based on our existing JSON parser?
Nikita
Hi Jakub and Nikita,
Thank you very much for your elaborate answers, you made good points.
In the meanwhile I also realized that there are just too many unknowns
currently with this extension, so I cancelled my initial plans of adding
ext/simdjson as a core extension.
I also agree that using the simdjson library as a backend would be a very
interesting thing, so I'll try to experiment with this approach first.
Slight incompatibilities between the backends would be very problematic for
sure, so if we go this way then the implementations should be exercised by
a definitive test suite like the one Nikita linked.
Máté
Hey Máté,
In the meanwhile I also realized that there are just too many unknowns
currently with this extension, so I cancelled my initial plans of adding
ext/simdjson as a core extension.
Since the majority of issues revolve around uncertainty/compatibility, is
there perhaps a way for using the extension
with an "overload json_encode()
and json_decode()
mode" (remember
mbstring?), so that brave bleeding edge
users can experiment with it as an opt-in replacement?
That would certainly allow for some more precise field testing over the
next few years.
Marco Pivetta