Newsgroups: php.internals Path: news.php.net Xref: news.php.net php.internals:118473 Return-Path: Delivered-To: mailing list internals@lists.php.net Received: (qmail 75500 invoked from network); 25 Aug 2022 20:22:06 -0000 Received: from unknown (HELO php-smtp4.php.net) (45.112.84.5) by pb1.pair.com with SMTP; 25 Aug 2022 20:22:06 -0000 Received: from php-smtp4.php.net (localhost [127.0.0.1]) by php-smtp4.php.net (Postfix) with ESMTP id 6E3F1180564 for ; Thu, 25 Aug 2022 13:22:05 -0700 (PDT) X-Spam-Checker-Version: SpamAssassin 3.4.2 (2018-09-13) on php-smtp4.php.net X-Spam-Level: X-Spam-Status: No, score=-2.1 required=5.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,SPF_HELO_NONE,SPF_PASS, T_SCC_BODY_TEXT_LINE autolearn=no autolearn_force=no version=3.4.2 X-Spam-ASN: AS24940 176.9.0.0/16 X-Spam-Virus: No X-Envelope-From: Received: from chrono.xqk7.com (chrono.xqk7.com [176.9.45.72]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by php-smtp4.php.net (Postfix) with ESMTPS for ; Thu, 25 Aug 2022 13:22:04 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=bastelstu.be; s=mail20171119; t=1661458923; bh=XQFSp02lMxG6DWL7r4BeuU5w19HCdLYVdyZ9C5ltxeA=; h=Date:Subject:To:Cc:References:From:In-Reply-To:From; b=Dm5271u1hvknzQMwhwTBcrbIDWdmhr+1h2qbwBO/yg6UHHPONnpMyAZiDg4SxJqcU DcUut7h6u1LZBOtEAXglOHVqDagp6OLV7p27mQBQHKRE/HtbfwH6v6RcksBgrWNhAu 5pEdJaWUx43CdS9GFrxieYsSJcsg/MOja6es5mQcASVbQMCoHp06PpttY6P4CFYCAQ SV+0TXIbf2lGVwo4jf4ktXdgVzMKsKZuKscHT5/xcizJjBTFsOzdmFRvpdetqgAUfM i0mNrbhnAZVHWLzq5KhSuefXxmgHz7t70s+h/Pf58L6+V44bqK/ynI5ld3th6W8HiC HlMZ23Xz1YW3w== Content-Type: multipart/mixed; boundary="------------F8QhGn0Bo47s5idxFVM0OVBU" Message-ID: <59c258e7-9410-8572-b904-e8421aae9867@bastelstu.be> Date: Thu, 25 Aug 2022 22:22:02 +0200 MIME-Version: 1.0 Content-Language: en-US To: David Gebler , juan carlos morales Cc: PHP Internals List References: In-Reply-To: Subject: Re: [PHP-DEV] RFC json_validate() - status: Under Discussion From: tim@bastelstu.be (=?UTF-8?Q?Tim_D=c3=bcsterhus?=) --------------F8QhGn0Bo47s5idxFVM0OVBU Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 8bit Hi On 8/25/22 21:11, David Gebler wrote: > There are many examples of userland code which could be faster and more > memory efficient if they were written in C and compiled in, so the mere > fact this proposal may introduce a somewhat faster way of validating a JSON > string over decoding it is not necessarily a sufficient reason to include > it. > > Are there are examples of raising issues for frameworks or systems saying > they need to validate some JSON but the only existing solutions available > to them are causing memory limit errors, or taking too long? The Stack > Overflow question linked on the RFC says "I need a really, really fast > method of checking if a string is JSON or not." The proposed function is meant to be used for validation. Validation processes by definition need to deal with untrusted data. So the input data might even be actively malicious in order to tie up resources on the server (DoS attack - single D there). > In most real world use cases [that I've encountered over the years] JSON > blobs tend to be quite small. I have dealt with much, much larger JSON Yes well-formed JSON from a trusted source tends to be small-ish. But a validation function also needs to deal with non-well-formed JSON, otherwise you would not need to validate it. I was able to use up an extra 100 MB of RAM with a 3 MB input that is invalid JSON when using json_decode(), just for it to reject the input. For json_validate() the extra memory (as per memory_get_peak_usage()) required for the same operation effectively zero. It was able to deal with 60 MB of input just fine. I've attached the script I used for the test. I left out the actual JSON string to not give script kiddies a loaded weapon, but you likely should be able to craft some input yourself. > blobs, up to a few hundred MB, and in those cases I've used a streaming > parser. If you're talking about JSON that size, a streaming parser is the > only realistic answer - you probably don't want to drop a 300MB string in > to this RFC's new function either, if performance and memory efficiency is > your concern. > > So I'm curious as to whether a real world example can be given where the > efficiency difference between json_decode and a new json_validate function > would be important to the system, whether anyone's encountered a scenario > where this would have made a real difference to them. > While my example is not a real world example, I don't believe it's a stretch to say it can be applied as-is to the real world. So IMO: - The proposed function does exactly what it promises to do, not more, not less. - If it's introduced, then it is going to be the obvious choice for JSON validation and at the same time it is going to be the best choice for JSON validation. I strongly believe it is a good thing if users are steered to make the correct choice by default without needing to invest brain cycles. - The patch is pretty small, because the hard work of JSON parsing is already implemented. - Userland implementations are non-obvious and non-trivial as evidenced by the examples in the RFC: They are all slightly different and one of them even mishandles a plain `false` input, because it does not check json_last_error(). - Userland implementations are also either less efficient (relying on json_decode()) or potentially inconsistent (hand-rolling a validating parser). Best regards Tim Düsterhus --------------F8QhGn0Bo47s5idxFVM0OVBU Content-Type: application/x-php; name="json-test.php" Content-Disposition: attachment; filename="json-test.php" Content-Transfer-Encoding: base64 PD9waHAKCiRzdHIgPSAnZXhlcmNpc2UgbGVmdCBmb3IgdGhlIHJlYWRlcic7CnZhcl9kdW1w KHN0cmxlbigkc3RyKSk7Cgp2YXJfZHVtcChtZW1vcnlfZ2V0X3BlYWtfdXNhZ2UoKSAvIDEw MjQgLyAxMDI0KTsKCnRyeSB7Cgl2YXJfZHVtcChqc29uX2RlY29kZSgkc3RyLCBmbGFnczog SlNPTl9USFJPV19PTl9FUlJPUikpOwp9IGNhdGNoIChcRXhjZXB0aW9uICRlKSB7CgllY2hv ICRlLT5nZXRNZXNzYWdlKCksIFBIUF9FT0w7Cn0KCnZhcl9kdW1wKG1lbW9yeV9nZXRfcGVh a191c2FnZSgpIC8gMTAyNCAvIDEwMjQpOwo= --------------F8QhGn0Bo47s5idxFVM0OVBU--