Newsgroups: php.internals Path: news.php.net Xref: news.php.net php.internals:119559 Return-Path: Delivered-To: mailing list internals@lists.php.net Received: (qmail 79198 invoked from network); 15 Feb 2023 19:35:15 -0000 Received: from unknown (HELO php-smtp4.php.net) (45.112.84.5) by pb1.pair.com with SMTP; 15 Feb 2023 19:35:15 -0000 Received: from php-smtp4.php.net (localhost [127.0.0.1]) by php-smtp4.php.net (Postfix) with ESMTP id 7AE3518033A for ; Wed, 15 Feb 2023 11:35:14 -0800 (PST) X-Spam-Checker-Version: SpamAssassin 3.4.2 (2018-09-13) on php-smtp4.php.net X-Spam-Level: X-Spam-Status: No, score=-2.1 required=5.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,NICE_REPLY_A,SPF_HELO_PASS, SPF_PASS,T_SCC_BODY_TEXT_LINE,UNPARSEABLE_RELAY autolearn=no autolearn_force=no version=3.4.2 X-Spam-ASN: AS16276 149.56.0.0/16 X-Spam-Virus: No X-Envelope-From: Received: from tls2.org (tls2.org [149.56.142.28]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange ECDHE (P-256) server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by php-smtp4.php.net (Postfix) with ESMTPS for ; Wed, 15 Feb 2023 11:35:13 -0800 (PST) Received: from [127.0.0.1] (localhost [127.0.0.1]) (Authenticated sender: thruska@cubiclesoft.com) with ESMTPSA id BB03A3F15D DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=cubiclesoft.com; s=default; t=1676489713; bh=YDNHYQ+DOVy1sh0Q4uRggXYWNNGpJ+hRArXjtbPV9a0=; h=Date:Subject:To:References:From:In-Reply-To:From; b=ld9AbTgoTtg2RHAMg9XZ5qw7VZvXvjCc6VW5koRIFe6PUpJ4rKGWYN+qnsLgYBTim w7XNv6Cn6bhAqQQUNs1rSjur0TGD5Ud9XawMBqXVkJwutgdJvXQO7vw8AdXl+9AEoA UVRTBWRzLZkDnvxHFKidVSUtTcLLSHp5caIJ1dfynm3YnBeMG6mZHfDWQ1ppVsdyRR KZuXhAtHkzrbqn9FjDyNx59N6NefD2e72I/0l/m8SJjeXsQISTYAR0tTQKvnXGrVfp hr2H+rX78g+J1zazrNQlbbG7/4bMh/va95jhYDTIKZx/PLf1mxI8zcEPjq03AgfeyT zPVHiZZ9OCskA== Message-ID: Date: Wed, 15 Feb 2023 12:35:10 -0700 MIME-Version: 1.0 User-Agent: Mozilla/5.0 (X11; Linux i686; rv:11.0) Gecko/20120327 Thunderbird/11.0.1 Content-Language: en-US To: Lydia de Jongh , internals@lists.php.net References: <92c4514f-70e3-75c9-7084-9e29641e25e7@gmail.com> <7e86a2d2-b971-592c-64e3-e86c13b5be80@cubiclesoft.com> <84204896-F9CE-4186-8A72-573A0B46FC1D@gmail.com> In-Reply-To: Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 8bit Subject: Re: [PHP-DEV] [RFC] Working With Substrings From: thruska@cubiclesoft.com (Thomas Hruska) On 2/15/2023 6:03 AM, Lydia de Jongh wrote: > Hi, > Very interesting topic! On which I have NO experience 🙈 > > > In some other languages every variable IS an object..... by default. > > As far as I understand, the code above is meant as internal. > But what if any variable is a small object. > Has this been ever considered? Or would it use too much performance? > > $oString = 'my text'; > > $oString->toUpper(); > > echo $oString; // 'MY TEXT' The above represents a significant amount of scope creep but it's certainly interesting. So let's explore it a bit and gauge the response. The above code will currently throw an error. Significant global adoption of such a change will take a fairly long time - probably a decade, maybe longer. AFAIK, there is nothing technically preventing the core Zend engine from accepting a -> token after a string variable and calling a function that performs an inline modification of the string. As a brief test, I just ran the example code through PHP and got: "PHP Fatal error: Uncaught Error: Call to a member function toUpper() on string in test.php:4" The error message shows that Zend engine clearly already recognizes toUpper() as an attempted function/method call on a string...it just doesn't know what to do with it. So the logic for supporting -> method calls on strings appears, at least from my very brief test, to already be mostly in place. Nice! Supporting this would likely result in two distinct internal functions that would have to be maintained. One inline string-object method variant that can avoid copy-on-write (e.g. $var->toUpper()) and one that only does copy-on-write (e.g. strtoupper()). Repeat that for all of the existing string functions. Alternatively, the main function body for each function could move into its own function that has a parameter for distinguishing the difference between "function (copy) vs. method (possibly inline)" calls, which would create some additional overhead for the existing ext/standard/string functions. The average performance loss for regular function calls would need to be benchmarked. Nobody likes seeing performance losses even if they end up being a less than 1% reduction. C function calls are way faster than PHP userland but they still have some overhead. This is just a thought exploration of how it could be implemented. With this approach, a $var->repeat("\x00", 4096, 50) could work to start at position 50 and write 4,096 zero bytes. But that again adds a parameter for an offset. But maybe $var[50...4096 + 50]->repeat("\x00", 4096) could solve that? That's a bit awkward to look at, requires adding range support to strings (and maybe arrays too because you know someone will want that as well), and probably breaks a lot of things. However, I'm not sure this idea can be used with virtual buffers that expressly set their size. zend_string (how strings are stored) simply doesn't have support for it. There's a length member but no size member. Internally, the zend_string implementation assumes length + 1 = size. If you got this far and know how PHP, C, and CPU hardware works, you can skip ahead to the last two paragraphs. The next few paragraphs delves into some details to try to explain to Lydia (and others who are following along) what's going on under the hood with why I focused on substrings. Apologies in advance for my rambling. Avoiding copy-on-write requires the internal reference count total (refcount) to effectively be 1. Reference counting helps reduce the number of times a copy is made. Fewer copies generally results in faster performance. A refcount of 1 does happen more frequently when inside a loop. In real world code, depending on what is being done, the first loop iteration might have many references to a string while the second loop iteration that is operating on the same data might have a refcount of just one. This situation happens frequently enough to consider inline options. Memory allocation is one of the slower operations in computer programs. Ideally, a program makes as few allocation requests to the system as possible. PHP avoids making system calls to allocate memory by pooling reclaimed memory into multiple memory pools for reuse. Copying strings from one buffer to another buffer is also avoided by leveraging reference counting. However, this creates the scenario where every modified string has its buffer copied from one buffer to the next. Let's take this fairly common but simple code to see what happens in Zend engine: $pos = strrpos($str, "/"); $str = substr($str, 0, $pos + 1); The above substr() results in one "logical" memory allocation and one logical free operation (whether it actually makes system calls to allocate/free memory is way beyond the scope of this paragraph) and one memory copy operation. We say we want the substring of a certain size, which allocates space to create a temporary copy that can hold that string. Then the data is copied from one buffer to another buffer. Then we assign the temporary copy to the original input variable. That causes the original value, assuming nothing else is referencing it (aka a refcount of 0), to eventually re-enter the memory pool for future allocations and assigns the temporary to the variable. All of that is done transparently to the user so the user generally doesn't have to worry about memory allocation strategies. There's no good way to detect this situation to optimize it, although I'm sure the JIT does try to do so on some level when it is enabled. As a side-effect, there are also no built-in tools currently available to care about memory allocation strategies for individual allocations when the need does arise. There are some controls for managing garbage collection but those have global impact. Doing that operation one time is fast enough and not really a problem. Doing it 1,000,000 times in a loop is where we end up constantly copying memory around when we could potentially work on the same memory buffer the entire time. We still might end up using the same memory buffers over and over due to recycling them through the PHP memory pool, which means the buffers might get to sit in the L1 or L2 cache in the CPU, but it does leave some performance on the table because copying a buffer or portions of it repeatedly can be an unnecessary operation. Buffers that are larger than the CPU's cache line sizes are going to suffer the most because there will be constant requests to main memory for the information that the CPU needs to modify and will constantly flush the cache lines and stall out while waiting for more data to arrive. That's not exactly optimal/ideal. Modifying the same buffer inline will be more likely stay in the L1 and L2 cache lines and therefore be much closer to the CPU core, resulting in notably faster performance. Pointers in C are much faster than copying memory. The problem is exposing pointers to userland, especially in Internet-facing software. Pointers are notoriously unsafe - just look at the zillion buffer overflow vulnerabilities (CVEs) that are reported annually across all software products. Copy-on-write, by comparison, is a much safer operation at the cost of performance. However, pointers let us just point at a substring or general chunk of memory instead of copying it, which significantly reduces the overhead since pointers are simple integer values that contain a memory address. And those values are small enough to sit in CPU registers, which are blazing fast. CPUs only have a handful of registers though because each register dramatically increases the cost of the CPU die. So if we can just point at the memory we want to "extract" instead of actually copying the data into its own string object, we can potentially save a ton of CPU cycles, especially when working with data inside a loop. Overall, I think substrings offer the most obvious/apparent area for performance gains and probably have, implementation details aside, the least amount of friction. But maybe we should consider the larger ecosystem of string functions as well? Or should this just be a possible longer term idea that requires more thought and research and thus the scope should be limited and we put Lydia's idea under Future Scope in the RFC? Other thoughts/comments? Added as Open Issue 10 to the RFC. Thank you for your input. -- Thomas Hruska CubicleSoft President CubicleSoft has over 80 original open source projects and counting. Plus a couple of commercial/retail products. What software are you looking to build?