Newsgroups: php.internals
Path: news.php.net
Xref: news.php.net php.internals:119559
Message-ID: <d22c1618-6e5c-a9c6-b4ee-7fb253e02e9f@cubiclesoft.com>
Date: Wed, 15 Feb 2023 12:35:10 -0700
MIME-Version: 1.0
User-Agent: Mozilla/5.0 (X11; Linux i686; rv:11.0) Gecko/20120327
 Thunderbird/11.0.1
Content-Language: en-US
To: Lydia de Jongh <flexjoly@gmail.com>, internals@lists.php.net
References: <e352423f-b740-07c9-2c4a-996112e17bbe@cubiclesoft.com>
 <92c4514f-70e3-75c9-7084-9e29641e25e7@gmail.com>
 <7e86a2d2-b971-592c-64e3-e86c13b5be80@cubiclesoft.com>
 <E963ED74-0404-4A5D-9811-8D1E662F764A@gmail.com>
 <84204896-F9CE-4186-8A72-573A0B46FC1D@gmail.com>
 <CAM9Wwz7Si98GDoJHaUKoJtOWt_UzzkjacohP4Z0XdRJsMnOPgg@mail.gmail.com>
In-Reply-To: <CAM9Wwz7Si98GDoJHaUKoJtOWt_UzzkjacohP4Z0XdRJsMnOPgg@mail.gmail.com>
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
Subject: Re: [PHP-DEV] [RFC] Working With Substrings
From: thruska@cubiclesoft.com (Thomas Hruska)

On 2/15/2023 6:03 AM, Lydia de Jongh wrote:
> Hi,
> Very interesting topic! On which I have NO experience 🙈
> 
> 
> In some other languages every variable IS an object..... by default.
> 
> As far as I understand, the code above is meant as internal.
> But what if any variable is a small object.
> Has this been ever considered? Or would it use too much performance?
> 
> $oString = 'my text';
> 
> $oString->toUpper();
> 
> echo $oString;  // 'MY TEXT'

The above represents a significant amount of scope creep but it's 
certainly interesting.  So let's explore it a bit and gauge the response.

The above code will currently throw an error.  Significant global 
adoption of such a change will take a fairly long time - probably a 
decade, maybe longer.

AFAIK, there is nothing technically preventing the core Zend engine from 
accepting a -> token after a string variable and calling a function that 
performs an inline modification of the string.

As a brief test, I just ran the example code through PHP and got:  "PHP 
Fatal error:  Uncaught Error: Call to a member function toUpper() on 
string in test.php:4"  The error message shows that Zend engine clearly 
already recognizes toUpper() as an attempted function/method call on a 
string...it just doesn't know what to do with it.  So the logic for 
supporting -> method calls on strings appears, at least from my very 
brief test, to already be mostly in place.  Nice!

Supporting this would likely result in two distinct internal functions 
that would have to be maintained.  One inline string-object method 
variant that can avoid copy-on-write (e.g. $var->toUpper()) and one that 
only does copy-on-write (e.g. strtoupper()).  Repeat that for all of the 
existing string functions.  Alternatively, the main function body for 
each function could move into its own function that has a parameter for 
distinguishing the difference between "function (copy) vs. method 
(possibly inline)" calls, which would create some additional overhead 
for the existing ext/standard/string functions.  The average performance 
loss for regular function calls would need to be benchmarked.  Nobody 
likes seeing performance losses even if they end up being a less than 1% 
reduction.  C function calls are way faster than PHP userland but they 
still have some overhead.  This is just a thought exploration of how it 
could be implemented.

With this approach, a $var->repeat("\x00", 4096, 50) could work to start 
at position 50 and write 4,096 zero bytes.  But that again adds a 
parameter for an offset.  But maybe $var[50...4096 + 50]->repeat("\x00", 
4096) could solve that?  That's a bit awkward to look at, requires 
adding range support to strings (and maybe arrays too because you know 
someone will want that as well), and probably breaks a lot of things.

However, I'm not sure this idea can be used with virtual buffers that 
expressly set their size.  zend_string (how strings are stored) simply 
doesn't have support for it.  There's a length member but no size 
member.  Internally, the zend_string implementation assumes length + 1 = 
size.


If you got this far and know how PHP, C, and CPU hardware works, you can 
skip ahead to the last two paragraphs.  The next few paragraphs delves 
into some details to try to explain to Lydia (and others who are 
following along) what's going on under the hood with why I focused on 
substrings.  Apologies in advance for my rambling.


Avoiding copy-on-write requires the internal reference count total 
(refcount) to effectively be 1.  Reference counting helps reduce the 
number of times a copy is made.  Fewer copies generally results in 
faster performance.  A refcount of 1 does happen more frequently when 
inside a loop.  In real world code, depending on what is being done, the 
first loop iteration might have many references to a string while the 
second loop iteration that is operating on the same data might have a 
refcount of just one.  This situation happens frequently enough to 
consider inline options.

Memory allocation is one of the slower operations in computer programs. 
Ideally, a program makes as few allocation requests to the system as 
possible.  PHP avoids making system calls to allocate memory by pooling 
reclaimed memory into multiple memory pools for reuse.  Copying strings 
from one buffer to another buffer is also avoided by leveraging 
reference counting.  However, this creates the scenario where every 
modified string has its buffer copied from one buffer to the next. 
Let's take this fairly common but simple code to see what happens in 
Zend engine:

$pos = strrpos($str, "/");
$str = substr($str, 0, $pos + 1);

The above substr() results in one "logical" memory allocation and one 
logical free operation (whether it actually makes system calls to 
allocate/free memory is way beyond the scope of this paragraph) and one 
memory copy operation.  We say we want the substring of a certain size, 
which allocates space to create a temporary copy that can hold that 
string.  Then the data is copied from one buffer to another buffer. 
Then we assign the temporary copy to the original input variable.  That 
causes the original value, assuming nothing else is referencing it (aka 
a refcount of 0), to eventually re-enter the memory pool for future 
allocations and assigns the temporary to the variable.  All of that is 
done transparently to the user so the user generally doesn't have to 
worry about memory allocation strategies.  There's no good way to detect 
this situation to optimize it, although I'm sure the JIT does try to do 
so on some level when it is enabled.  As a side-effect, there are also 
no built-in tools currently available to care about memory allocation 
strategies for individual allocations when the need does arise.  There 
are some controls for managing garbage collection but those have global 
impact.

Doing that operation one time is fast enough and not really a problem. 
Doing it 1,000,000 times in a loop is where we end up constantly copying 
memory around when we could potentially work on the same memory buffer 
the entire time.  We still might end up using the same memory buffers 
over and over due to recycling them through the PHP memory pool, which 
means the buffers might get to sit in the L1 or L2 cache in the CPU, but 
it does leave some performance on the table because copying a buffer or 
portions of it repeatedly can be an unnecessary operation.  Buffers that 
are larger than the CPU's cache line sizes are going to suffer the most 
because there will be constant requests to main memory for the 
information that the CPU needs to modify and will constantly flush the 
cache lines and stall out while waiting for more data to arrive.  That's 
not exactly optimal/ideal.  Modifying the same buffer inline will be 
more likely stay in the L1 and L2 cache lines and therefore be much 
closer to the CPU core, resulting in notably faster performance.

Pointers in C are much faster than copying memory.  The problem is 
exposing pointers to userland, especially in Internet-facing software. 
Pointers are notoriously unsafe - just look at the zillion buffer 
overflow vulnerabilities (CVEs) that are reported annually across all 
software products.  Copy-on-write, by comparison, is a much safer 
operation at the cost of performance.  However, pointers let us just 
point at a substring or general chunk of memory instead of copying it, 
which significantly reduces the overhead since pointers are simple 
integer values that contain a memory address.  And those values are 
small enough to sit in CPU registers, which are blazing fast.  CPUs only 
have a handful of registers though because each register dramatically 
increases the cost of the CPU die.  So if we can just point at the 
memory we want to "extract" instead of actually copying the data into 
its own string object, we can potentially save a ton of CPU cycles, 
especially when working with data inside a loop.


Overall, I think substrings offer the most obvious/apparent area for 
performance gains and probably have, implementation details aside, the 
least amount of friction.  But maybe we should consider the larger 
ecosystem of string functions as well?  Or should this just be a 
possible longer term idea that requires more thought and research and 
thus the scope should be limited and we put Lydia's idea under Future 
Scope in the RFC?  Other thoughts/comments?

Added as Open Issue 10 to the RFC.  Thank you for your input.

-- 
Thomas Hruska
CubicleSoft President

CubicleSoft has over 80 original open source projects and counting.
Plus a couple of commercial/retail products.

What software are you looking to build?