Newsgroups: php.internals Path: news.php.net Xref: news.php.net php.internals:119554 Return-Path: Delivered-To: mailing list internals@lists.php.net Received: (qmail 19700 invoked from network); 15 Feb 2023 02:35:46 -0000 Received: from unknown (HELO php-smtp4.php.net) (45.112.84.5) by pb1.pair.com with SMTP; 15 Feb 2023 02:35:46 -0000 Received: from php-smtp4.php.net (localhost [127.0.0.1]) by php-smtp4.php.net (Postfix) with ESMTP id 6966C18033A for ; Tue, 14 Feb 2023 18:35:45 -0800 (PST) X-Spam-Checker-Version: SpamAssassin 3.4.2 (2018-09-13) on php-smtp4.php.net X-Spam-Level: X-Spam-Status: No, score=-2.1 required=5.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,NICE_REPLY_A,SPF_HELO_PASS, SPF_PASS,T_SCC_BODY_TEXT_LINE,UNPARSEABLE_RELAY autolearn=no autolearn_force=no version=3.4.2 X-Spam-ASN: AS16276 149.56.0.0/16 X-Spam-Virus: No X-Envelope-From: Received: from tls2.org (tls2.org [149.56.142.28]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange ECDHE (P-256) server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by php-smtp4.php.net (Postfix) with ESMTPS for ; Tue, 14 Feb 2023 18:35:44 -0800 (PST) Received: from [127.0.0.1] (localhost [127.0.0.1]) (Authenticated sender: thruska@cubiclesoft.com) with ESMTPSA id E1DA03F03D DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=cubiclesoft.com; s=default; t=1676428544; bh=w1dUQy6xgq+Ugw4F6UMD3Cz9hlG+nt9iDzc/DGpEYjI=; h=Date:Subject:To:References:From:In-Reply-To:From; b=IC6Il7D4Z/0S5bZhlE7oVnD7PNDy9fY+z1Poa7zn9qUikbvkbZMDfLSIq+OlbJ1Az Tukxj9PuCKu2oi1tII1D0h1JilV9Ur/Yo4to73HFX8Y3vtt9Cgr5erqUVaKGVpu8t/ xQpDyt48KYcrlFvt4NRTg5zu1mAK9+M/MY5ivfKeB5yFPCvFFxxZYaRUDpaq5XwVrK 8pRHxXtKoGucxXY1wO1gcgv5OJgMmuHp7u8JVLUktyxTf7wt9hr83yt40daJ49JO/H PMBvcUoUEHhR7ABqEa2Qoh9sf3ing+N1O/4ixSimuk447OCmyCr3hUzYqMolF0yW4Z WkmGEIRFjqYhA== Message-ID: <7e86a2d2-b971-592c-64e3-e86c13b5be80@cubiclesoft.com> Date: Tue, 14 Feb 2023 19:35:42 -0700 MIME-Version: 1.0 User-Agent: Mozilla/5.0 (X11; Linux i686; rv:11.0) Gecko/20120327 Thunderbird/11.0.1 Content-Language: en-US To: Rowan Tommins , internals@lists.php.net References: <92c4514f-70e3-75c9-7084-9e29641e25e7@gmail.com> In-Reply-To: <92c4514f-70e3-75c9-7084-9e29641e25e7@gmail.com> Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 7bit Subject: Re: [PHP-DEV] [RFC] Working With Substrings From: thruska@cubiclesoft.com (Thomas Hruska) On 2/14/2023 2:02 PM, Rowan Tommins wrote: > On 14/02/2023 15:32, Thomas Hruska wrote: >> Hello Internals, >> >> I would like to start the discussion on adding several functions and >> parameters to existing functions for improved substring handling in PHP: >> >> https://wiki.php.net/rfc/working_with_substrings > > > Hi Thomas, > > Thanks for your effort on this, I think efficient string handling > functions would be a major help for the ecosystem, allowing library > authors to do things in plain PHP code which currently defer to C > extensions just for performance. > > > My first thought opening the RFC was to see a function signature with 9 > arguments and immediately wonder how to refactor it into something more > manageable. Just writing *tests* for all the combinations sounds like a > nightmare, let alone understanding code that uses them all. I agree with and understand this sentiment. > As I read through, I had a similar feeling about the need to > copy-and-paste the same two parameters onto so many functions. > > > Luckily, I think the RFC contains the seed of the solution to both > problems: what you refer to as "virtual buffers". These seem to be > crying out to be a new data type, with their own API - probably using OO > style, given general fashions. I thought about that but didn't know how well it would be received nor, perhaps more importantly, the direction it should take (i.e. a formal Zend type in the engine, extending the existing zend_string type, a class, some combination, or something else entirely). All of the more advanced options I came up with would have required some code changes to the PHP source itself with a new data type being the most involved and probably the most controversial. As a result, I ended up deciding to go the "simple" function(s) route in the qolfuncs extension and then, after that, use the RFC process to kickstart the conversation while also showing a proof-of-concept that demonstrates performance can be notably improved in certain areas that have traditionally not done well. I figured it would be kind of difficult to get folks excited about strings/buffers (yawn!) if there weren't also some sort of ballpark in-context metrics/benchmarks to show the potential gains to make the effort worthwhile. > Framed around that, I think we can split out a few different concerns: > > * Methods to take a string, and make a new, writeable buffer pointing at > all or part of it > * Methods to access parts of a buffer, as a string or another buffer > * Methods to efficiently write to, delete from, or overwrite, parts of a > buffer > * Methods to explicitly manage the memory used by the buffer > * Finally, support for writing to, or reading from, a buffer instead of > a string in a number of existing functions Those sound fine. Just a couple thoughts: Being able to pass a new buffer type around to many of the same functions as zend_strings could introduce its own can of worms. Something to keep in mind for sure. Calling any function/method in PHP is an "expensive" operation. Once the code finally gets into the function body and past the input validation phase is when performant C routine calls can finally happen (e.g. native memcpy/memmove/memset calls that, in turn, use SIMD instructions). It's all the prior setup that takes the longest amount of time. I don't think there's a way around that without losing buffer overflow protections, which means there will ultimately always be a hard upper limit on what can be done in PHP userland. But we at least now know the userland performance ceiling for inline buffer manipulation is somewhere roughly around 2-3 times higher than current userland options on average. Based on the benchmarks I've run, that gain largely negates the function/method call overhead problem for the immediate future. > Thinking about exactly what those methods should look like leads me to > my next thought: we should be learning from prior art here. Are there > other languages which already do this well, which PHP could emulate? Are > there other languages which already do this *badly*, whose mistakes PHP > could explicitly learn from? > > What comes to my mind immediately is that both Java and C# have > "StringBuilder" classes, which cover at least some of these use cases. > C#, in particular, had a lot of very smart people paid to design it, > able to learn from mistakes Java had already made. Okay. I'm not entirely sure what the next step here should be. Should I go research the above, or go back and develop/test and then propose something concrete in an OO direction and gather feedback at that point, or should we hash it out a bit more here on the list to get a more specific direction to go in? Regardless, I've updated the RFC to reflect your response as Open Issue 9. Thank you for taking the time to look at the RFC and responding. -- Thomas Hruska CubicleSoft President CubicleSoft has over 80 original open source projects and counting. Plus a couple of commercial/retail products. What software are you looking to build?