[RFC] Working With Substrings

2 years ago by Thomas Hruska — view source

unread

Hello Internals,

I would like to start the discussion on adding several functions and
parameters to existing functions for improved substring handling in PHP:

https://wiki.php.net/rfc/working_with_substrings

Please see the Open Issues section for a series of possible
issues/questions that I anticipated could come up, but I'm sure there
will be others. Some are technical questions related to the source code
while others deal with bikeshedding (function names, etc).

--
Thomas Hruska
CubicleSoft President

CubicleSoft has over 80 original open source projects and counting.
Plus a couple of commercial/retail products.

What software are you looking to build?

2 years ago by Rowan Tommins — view source

unread

Hello Internals,

I would like to start the discussion on adding several functions and
parameters to existing functions for improved substring handling in PHP:

https://wiki.php.net/rfc/working_with_substrings

Hi Thomas,

Thanks for your effort on this, I think efficient string handling
functions would be a major help for the ecosystem, allowing library
authors to do things in plain PHP code which currently defer to C
extensions just for performance.

My first thought opening the RFC was to see a function signature with 9
arguments and immediately wonder how to refactor it into something more
manageable. Just writing tests for all the combinations sounds like a
nightmare, let alone understanding code that uses them all.

As I read through, I had a similar feeling about the need to
copy-and-paste the same two parameters onto so many functions.

Luckily, I think the RFC contains the seed of the solution to both
problems: what you refer to as "virtual buffers". These seem to be
crying out to be a new data type, with their own API - probably using OO
style, given general fashions.

Framed around that, I think we can split out a few different concerns:

Methods to take a string, and make a new, writeable buffer pointing at
all or part of it
Methods to access parts of a buffer, as a string or another buffer
Methods to efficiently write to, delete from, or overwrite, parts of a
buffer
Methods to explicitly manage the memory used by the buffer
Finally, support for writing to, or reading from, a buffer instead of
a string in a number of existing functions

Thinking about exactly what those methods should look like leads me to
my next thought: we should be learning from prior art here. Are there
other languages which already do this well, which PHP could emulate? Are
there other languages which already do this badly, whose mistakes PHP
could explicitly learn from?

What comes to my mind immediately is that both Java and C# have
"StringBuilder" classes, which cover at least some of these use cases.
C#, in particular, had a lot of very smart people paid to design it,
able to learn from mistakes Java had already made.

Regards,

--
Rowan Tommins
[IMSoP]

2 years ago by Thomas Hruska — view source

unread

Hello Internals,

I would like to start the discussion on adding several functions and
parameters to existing functions for improved substring handling in PHP:

https://wiki.php.net/rfc/working_with_substrings

Hi Thomas,

Thanks for your effort on this, I think efficient string handling
functions would be a major help for the ecosystem, allowing library
authors to do things in plain PHP code which currently defer to C
extensions just for performance.

My first thought opening the RFC was to see a function signature with 9
arguments and immediately wonder how to refactor it into something more
manageable. Just writing tests for all the combinations sounds like a
nightmare, let alone understanding code that uses them all.

I agree with and understand this sentiment.

As I read through, I had a similar feeling about the need to
copy-and-paste the same two parameters onto so many functions.

Luckily, I think the RFC contains the seed of the solution to both
problems: what you refer to as "virtual buffers". These seem to be
crying out to be a new data type, with their own API - probably using OO
style, given general fashions.

I thought about that but didn't know how well it would be received nor,
perhaps more importantly, the direction it should take (i.e. a formal
Zend type in the engine, extending the existing zend_string type, a
class, some combination, or something else entirely). All of the more
advanced options I came up with would have required some code changes to
the PHP source itself with a new data type being the most involved and
probably the most controversial.

As a result, I ended up deciding to go the "simple" function(s) route in
the qolfuncs extension and then, after that, use the RFC process to
kickstart the conversation while also showing a proof-of-concept that
demonstrates performance can be notably improved in certain areas that
have traditionally not done well. I figured it would be kind of
difficult to get folks excited about strings/buffers (yawn!) if there
weren't also some sort of ballpark in-context metrics/benchmarks to show
the potential gains to make the effort worthwhile.

Framed around that, I think we can split out a few different concerns:

Methods to take a string, and make a new, writeable buffer pointing at
all or part of it

Methods to access parts of a buffer, as a string or another buffer

Methods to efficiently write to, delete from, or overwrite, parts of a
buffer

Methods to explicitly manage the memory used by the buffer

Finally, support for writing to, or reading from, a buffer instead of
a string in a number of existing functions

Those sound fine. Just a couple thoughts:

Being able to pass a new buffer type around to many of the same
functions as zend_strings could introduce its own can of worms.
Something to keep in mind for sure.

Calling any function/method in PHP is an "expensive" operation. Once
the code finally gets into the function body and past the input
validation phase is when performant C routine calls can finally happen
(e.g. native memcpy/memmove/memset calls that, in turn, use SIMD
instructions). It's all the prior setup that takes the longest amount
of time. I don't think there's a way around that without losing buffer
overflow protections, which means there will ultimately always be a hard
upper limit on what can be done in PHP userland. But we at least now
know the userland performance ceiling for inline buffer manipulation is
somewhere roughly around 2-3 times higher than current userland options
on average. Based on the benchmarks I've run, that gain largely negates
the function/method call overhead problem for the immediate future.

Thinking about exactly what those methods should look like leads me to
my next thought: we should be learning from prior art here. Are there
other languages which already do this well, which PHP could emulate? Are
there other languages which already do this badly, whose mistakes PHP
could explicitly learn from?

What comes to my mind immediately is that both Java and C# have
"StringBuilder" classes, which cover at least some of these use cases.
C#, in particular, had a lot of very smart people paid to design it,
able to learn from mistakes Java had already made.

Okay.

I'm not entirely sure what the next step here should be. Should I go
research the above, or go back and develop/test and then propose
something concrete in an OO direction and gather feedback at that point,
or should we hash it out a bit more here on the list to get a more
specific direction to go in?

Regardless, I've updated the RFC to reflect your response as Open Issue
9. Thank you for taking the time to look at the RFC and responding.

--
Thomas Hruska
CubicleSoft President

CubicleSoft has over 80 original open source projects and counting.
Plus a couple of commercial/retail products.

What software are you looking to build?

2 years ago by Rowan Tommins — view source

unread

I thought about that but didn't know how well it would be received nor, perhaps more importantly, the direction it should take (i.e. a formal Zend type in the engine, extending the existing zend_string type, a class, some combination, or something else entirely). All of the more advanced options I came up with would have required some code changes to the PHP source itself with a new data type being the most involved and probably the most controversial.

My instinct was that it could just be a built-in class, with an internal pointer to a zend_string that's completely invisible to userland. Something like how the SimpleXML and DOM objects just point into a libxml parse result.

Then to add to existing functions requires changing an argument type from string to string|Buffer, rather than adding new arguments.

No change to the type system needed, internally or externally, just some code to unwrap the pointer. But perhaps I'm being naive and oversimplifying, as I don't have a deep understanding of the engine.

I'm not entirely sure what the next step here should be. Should I go research the above, or go back and develop/test and then propose something concrete in an OO direction and gather feedback at that point, or should we hash it out a bit more here on the list to get a more specific direction to go in?

Well, those were just my thoughts; maybe someone else will come along shortly with a very different take.

Regards,

--
Rowan Tommins
[IMSoP]

2 years ago by Rowan Tommins — view source

unread

My instinct was that it could just be a built-in class, with an internal pointer to a zend_string that's completely invisible to userland. Something like how the SimpleXML and DOM objects just point into a libxml parse result.

To make this a bit more concrete, what I was picturing was that instead of this example:

str_splice($this->pagemap[$pagepos][0], $x2, $size2, $data, $x, $size);

You would have something like this:

// Wrap an existing zend_string in an object
$destBuffer = Buffer:: fromString($this->pagemap[$pagepos][0]);
// Similar, but also track start and end offsets
$sourceBuffer = Buffer::fromSubString($data, $x, $size);
// Now do the actual memory copy
$destBuffer->splice($x2, $size2, $sourceBuffer);

The explicit size handling parameters of str_splice, and the str_realloc function, would be replaced with methods to get and set the allocated length of a Buffer object. The buffer would only be shrunk when requested, or when cast to string.

The $src_repeat argument feels somewhat out of place in a "splice" operation anyway, and perhaps should be part of a different method.

On a different note, don't forget that we have named parameters now, which is a big help with signatures like this; this example:

$vsize = str_splice($str, $pos, $pos2 - $pos + 1, $embed, 0, null, 1, false, $vsize);

Looks slightly less scary written like this:

$vsize = str_splice($str, $pos, $pos2 - $pos + 1, $embed, shrink: false, dst_lastsize: $vsize);

Regards,

--
Rowan Tommins
[IMSoP]

2 years ago by Lydia de Jongh — view source

unread

Hi,
Very interesting topic! On which I have NO experience 🙈

Op wo 15 feb. 2023 om 08:02 schreef Rowan Tommins rowan.collins@gmail.com:

On 15 February 2023 05:18:50 GMT, Rowan Tommins rowan.collins@gmail.com
wrote:

My instinct was that it could just be a built-in class, with an internal
pointer to a zend_string that's completely invisible to userland. Something
like how the SimpleXML and DOM objects just point into a libxml parse
result.

To make this a bit more concrete, what I was picturing was that instead of
this example:

str_splice($this->pagemap[$pagepos][0], $x2, $size2, $data, $x, $size);

You would have something like this:

// Wrap an existing zend_string in an object
$destBuffer = Buffer:: fromString($this->pagemap[$pagepos][0]);
// Similar, but also track start and end offsets
$sourceBuffer = Buffer::fromSubString($data, $x, $size);
// Now do the actual memory copy
$destBuffer->splice($x2, $size2, $sourceBuffer);

In some other languages every variable IS an object..... by default.

As far as I understand, the code above is meant as internal.
But what if any variable is a small object.
Has this been ever considered? Or would it use too much performance?

$oString = 'my text';

$oString->toUpper();

echo $oString; // 'MY TEXT'

Greetz, Lydia

2 years ago by Thomas Hruska — view source

unread

Hi,
Very interesting topic! On which I have NO experience 🙈

In some other languages every variable IS an object..... by default.

As far as I understand, the code above is meant as internal.
But what if any variable is a small object.
Has this been ever considered? Or would it use too much performance?

$oString = 'my text';

$oString->toUpper();

echo $oString; // 'MY TEXT'

The above represents a significant amount of scope creep but it's
certainly interesting. So let's explore it a bit and gauge the response.

The above code will currently throw an error. Significant global
adoption of such a change will take a fairly long time - probably a
decade, maybe longer.

AFAIK, there is nothing technically preventing the core Zend engine from
accepting a -> token after a string variable and calling a function that
performs an inline modification of the string.

As a brief test, I just ran the example code through PHP and got: "PHP
Fatal error: Uncaught Error: Call to a member function toUpper() on
string in test.php:4" The error message shows that Zend engine clearly
already recognizes toUpper() as an attempted function/method call on a
string...it just doesn't know what to do with it. So the logic for
supporting -> method calls on strings appears, at least from my very
brief test, to already be mostly in place. Nice!

Supporting this would likely result in two distinct internal functions
that would have to be maintained. One inline string-object method
variant that can avoid copy-on-write (e.g. $var->toUpper()) and one that
only does copy-on-write (e.g. strtoupper()). Repeat that for all of the
existing string functions. Alternatively, the main function body for
each function could move into its own function that has a parameter for
distinguishing the difference between "function (copy) vs. method
(possibly inline)" calls, which would create some additional overhead
for the existing ext/standard/string functions. The average performance
loss for regular function calls would need to be benchmarked. Nobody
likes seeing performance losses even if they end up being a less than 1%
reduction. C function calls are way faster than PHP userland but they
still have some overhead. This is just a thought exploration of how it
could be implemented.

With this approach, a $var->repeat("\x00", 4096, 50) could work to start
at position 50 and write 4,096 zero bytes. But that again adds a
parameter for an offset. But maybe $var[50...4096 + 50]->repeat("\x00",
4096) could solve that? That's a bit awkward to look at, requires
adding range support to strings (and maybe arrays too because you know
someone will want that as well), and probably breaks a lot of things.

However, I'm not sure this idea can be used with virtual buffers that
expressly set their size. zend_string (how strings are stored) simply
doesn't have support for it. There's a length member but no size
member. Internally, the zend_string implementation assumes length + 1 =
size.

If you got this far and know how PHP, C, and CPU hardware works, you can
skip ahead to the last two paragraphs. The next few paragraphs delves
into some details to try to explain to Lydia (and others who are
following along) what's going on under the hood with why I focused on
substrings. Apologies in advance for my rambling.

Avoiding copy-on-write requires the internal reference count total
(refcount) to effectively be 1. Reference counting helps reduce the
number of times a copy is made. Fewer copies generally results in
faster performance. A refcount of 1 does happen more frequently when
inside a loop. In real world code, depending on what is being done, the
first loop iteration might have many references to a string while the
second loop iteration that is operating on the same data might have a
refcount of just one. This situation happens frequently enough to
consider inline options.

Memory allocation is one of the slower operations in computer programs.
Ideally, a program makes as few allocation requests to the system as
possible. PHP avoids making system calls to allocate memory by pooling
reclaimed memory into multiple memory pools for reuse. Copying strings
from one buffer to another buffer is also avoided by leveraging
reference counting. However, this creates the scenario where every
modified string has its buffer copied from one buffer to the next.
Let's take this fairly common but simple code to see what happens in
Zend engine:

$pos = strrpos($str, "/");
$str = substr($str, 0, $pos + 1);

The above substr() results in one "logical" memory allocation and one
logical free operation (whether it actually makes system calls to
allocate/free memory is way beyond the scope of this paragraph) and one
memory copy operation. We say we want the substring of a certain size,
which allocates space to create a temporary copy that can hold that
string. Then the data is copied from one buffer to another buffer.
Then we assign the temporary copy to the original input variable. That
causes the original value, assuming nothing else is referencing it (aka
a refcount of 0), to eventually re-enter the memory pool for future
allocations and assigns the temporary to the variable. All of that is
done transparently to the user so the user generally doesn't have to
worry about memory allocation strategies. There's no good way to detect
this situation to optimize it, although I'm sure the JIT does try to do
so on some level when it is enabled. As a side-effect, there are also
no built-in tools currently available to care about memory allocation
strategies for individual allocations when the need does arise. There
are some controls for managing garbage collection but those have global
impact.

Doing that operation one time is fast enough and not really a problem.
Doing it 1,000,000 times in a loop is where we end up constantly copying
memory around when we could potentially work on the same memory buffer
the entire time. We still might end up using the same memory buffers
over and over due to recycling them through the PHP memory pool, which
means the buffers might get to sit in the L1 or L2 cache in the CPU, but
it does leave some performance on the table because copying a buffer or
portions of it repeatedly can be an unnecessary operation. Buffers that
are larger than the CPU's cache line sizes are going to suffer the most
because there will be constant requests to main memory for the
information that the CPU needs to modify and will constantly flush the
cache lines and stall out while waiting for more data to arrive. That's
not exactly optimal/ideal. Modifying the same buffer inline will be
more likely stay in the L1 and L2 cache lines and therefore be much
closer to the CPU core, resulting in notably faster performance.

Pointers in C are much faster than copying memory. The problem is
exposing pointers to userland, especially in Internet-facing software.
Pointers are notoriously unsafe - just look at the zillion buffer
overflow vulnerabilities (CVEs) that are reported annually across all
software products. Copy-on-write, by comparison, is a much safer
operation at the cost of performance. However, pointers let us just
point at a substring or general chunk of memory instead of copying it,
which significantly reduces the overhead since pointers are simple
integer values that contain a memory address. And those values are
small enough to sit in CPU registers, which are blazing fast. CPUs only
have a handful of registers though because each register dramatically
increases the cost of the CPU die. So if we can just point at the
memory we want to "extract" instead of actually copying the data into
its own string object, we can potentially save a ton of CPU cycles,
especially when working with data inside a loop.

Overall, I think substrings offer the most obvious/apparent area for
performance gains and probably have, implementation details aside, the
least amount of friction. But maybe we should consider the larger
ecosystem of string functions as well? Or should this just be a
possible longer term idea that requires more thought and research and
thus the scope should be limited and we put Lydia's idea under Future
Scope in the RFC? Other thoughts/comments?

Added as Open Issue 10 to the RFC. Thank you for your input.

--
Thomas Hruska
CubicleSoft President

CubicleSoft has over 80 original open source projects and counting.
Plus a couple of commercial/retail products.

What software are you looking to build?

2 years ago by Larry Garfield — view source

unread

Hi,
Very interesting topic! On which I have NO experience 🙈

In some other languages every variable IS an object..... by default.

As far as I understand, the code above is meant as internal.
But what if any variable is a small object.
Has this been ever considered? Or would it use too much performance?

$oString = 'my text';

$oString->toUpper();

echo $oString; // 'MY TEXT'

The above represents a significant amount of scope creep but it's
certainly interesting. So let's explore it a bit and gauge the response.

The above code will currently throw an error. Significant global
adoption of such a change will take a fairly long time - probably a
decade, maybe longer.

AFAIK, there is nothing technically preventing the core Zend engine from
accepting a -> token after a string variable and calling a function that
performs an inline modification of the string.

As a brief test, I just ran the example code through PHP and got: "PHP
Fatal error: Uncaught Error: Call to a member function toUpper() on
string in test.php:4" The error message shows that Zend engine clearly
already recognizes toUpper() as an attempted function/method call on a
string...it just doesn't know what to do with it. So the logic for
supporting -> method calls on strings appears, at least from my very
brief test, to already be mostly in place. Nice!

snip

What you're describing here is "scalar methods", which has been discussed on and off for many years. The idea has its proponents, but also its detractors. (I'm in the detractor camp, personally, as I think there are better, more flexible options.)

I would strongly recommend not allowing "faster string manipulation" to scope creep into scalar methods, as that will almost guarantee that it never comes to fruition. :-) IF scalar methods were to happen, they should happen on their own.

--Larry Garfield

2 years ago by Derick Rethans — view source

unread

Hi,
Very interesting topic! On which I have NO experience 🙈

Op wo 15 feb. 2023 om 08:02 schreef Rowan Tommins rowan.collins@gmail.com:

On 15 February 2023 05:18:50 GMT, Rowan Tommins rowan.collins@gmail.com
wrote:

My instinct was that it could just be a built-in class, with an internal
pointer to a zend_string that's completely invisible to userland. Something
like how the SimpleXML and DOM objects just point into a libxml parse
result.

To make this a bit more concrete, what I was picturing was that instead of
this example:

str_splice($this->pagemap[$pagepos][0], $x2, $size2, $data, $x, $size);

You would have something like this:

// Wrap an existing zend_string in an object
$destBuffer = Buffer:: fromString($this->pagemap[$pagepos][0]);
// Similar, but also track start and end offsets
$sourceBuffer = Buffer::fromSubString($data, $x, $size);
// Now do the actual memory copy
$destBuffer->splice($x2, $size2, $sourceBuffer);

In some other languages every variable IS an object..... by default.

As far as I understand, the code above is meant as internal.
But what if any variable is a small object.
Has this been ever considered? Or would it use too much performance?

$oString = 'my text';

$oString->toUpper();

echo $oString; // 'MY TEXT'

Greetz, Lydia

https://wiki.php.net/rfc/unicode_text_processing

And yes, that won't be as fast as just calling strtoupper.

cheers
Derick

2 years ago by Lydia de Jongh — view source

unread

Hi Derick, Thomas,

Op do 16 feb. 2023 om 08:57 schreef Derick Rethans derick@php.net:

https://wiki.php.net/rfc/unicode_text_processing

And yes, that won't be as fast as just calling strtoupper.

cheers
Derick

Looks great!!!

Complex string manipulation inside an object will be faster then all
copying variables around in memory,
like Thomas kindly explained in his post. If I understand correctly....

And it would make php even more mature, gaining from more OOP.

Op wo 15 feb. 2023 om 20:35 schreef Thomas Hruska thruska@cubiclesoft.com:

<......>

Doing that operation one time is fast enough and not really a problem.

Doing it 1,000,000 times in a loop is where we end up constantly copying
memory around when we could potentially work on the same memory buffer
the entire time. We still might end up using the same memory buffers
over and over due to recycling them through the PHP memory pool, which
means the buffers might get to sit in the L1 or L2 cache in the CPU, but
it does leave some performance on the table because copying a buffer or
portions of it repeatedly can be an unnecessary operation. Buffers that
are larger than the CPU's cache line sizes are going to suffer the most
because there will be constant requests to main memory for the
information that the CPU needs to modify and will constantly flush the
cache lines and stall out while waiting for more data to arrive. That's
not exactly optimal/ideal. Modifying the same buffer inline will be
more likely stay in the L1 and L2 cache lines and therefore be much
closer to the CPU core, resulting in notably faster performance.
Pointers in C are much faster than copying memory. The problem is
exposing pointers to userland, especially in Internet-facing software.
Pointers are notoriously unsafe - just look at the zillion buffer
overflow vulnerabilities (CVEs) that are reported annually across all
software products. Copy-on-write, by comparison, is a much safer
operation at the cost of performance. However, pointers let us just
point at a substring or general chunk of memory instead of copying it,
which significantly reduces the overhead since pointers are simple
integer values that contain a memory address. And those values are
small enough to sit in CPU registers, which are blazing fast. CPUs only
have a handful of registers though because each register dramatically
increases the cost of the CPU die. So if we can just point at the

memory we want to "extract" instead of actually copying the data into

its own string object, we can potentially save a ton of CPU cycles,
especially when working with data inside a loop.

Overall, I think substrings offer the most obvious/apparent area for
performance gains and probably have, implementation details aside, the
least amount of friction. But maybe we should consider the larger
ecosystem of string functions as well? Or should this just be a
possible longer term idea that requires more thought and research and
thus the scope should be limited and we put Lydia's idea under Future
Scope in the RFC? Other thoughts/comments?

Added as Open Issue 10 to the RFC. Thank you for your input.

Thomas Hruska

Thanks for your kind and extended explanation.
I know a little about the memory allocations.

But I am not sure about what to conclude from your explanation. If an
object would take less copying around or not.

This memory conversation brings up other old memories ☺... peek, pook,
assembly etc 😍

Greetz, flexJoly (aka Lydia)

11 months ago by Christoph M. Becker — view source

unread

I thought about that but didn't know how well it would be received nor, perhaps more importantly, the direction it should take (i.e. a formal Zend type in the engine, extending the existing zend_string type, a class, some combination, or something else entirely). All of the more advanced options I came up with would have required some code changes to the PHP source itself with a new data type being the most involved and probably the most controversial.

My instinct was that it could just be a built-in class, with an internal pointer to a zend_string that's completely invisible to userland. Something like how the SimpleXML and DOM objects just point into a libxml parse result.

Then to add to existing functions requires changing an argument type from string to string|Buffer, rather than adding new arguments.

No change to the type system needed, internally or externally, just some code to unwrap the pointer. But perhaps I'm being naive and oversimplifying, as I don't have a deep understanding of the engine.

I'm not entirely sure what the next step here should be. Should I go research the above, or go back and develop/test and then propose something concrete in an OO direction and gather feedback at that point, or should we hash it out a bit more here on the list to get a more specific direction to go in?

Well, those were just my thoughts; maybe someone else will come along shortly with a very different take.

I'm very late on this discussion, but I think it is an interesting
topic, and maybe https://github.com/cmb69/php-stringbuilder, which I
had written long ago just to check some assumptions, can serve as POC.
It is certainly possible to have such a string buffer class without
having to patch the engine; it could even be made available as PECL
extension (first).

Note that this StringBuilder uses smart_strs[1] what might be a good
idea or not. But certainly you could use some other internal handling;
interoperability with zend_strings[2] requires to copy the char arrays
in most cases anyway, since these have a fixed length, and if these
copies are reduced to a minimum (i.e. the new class has enough
flexibility to work without casting to and from string), that should be
bearable.

Not sure if that would work for the "gd imageexportpixels() and
imageimportpixels()" RFC[3], but it might be worth investigating.

[1]
https://www.phpinternalsbook.com/php7/internal_types/strings/smart_str.html
[2]
https://www.phpinternalsbook.com/php7/internal_types/strings/zend_strings.html
[3] https://wiki.php.net/rfc/gd_image_export_import_pixels

Cheers,
Christoph

11 months ago by Rob Landers — view source

unread

I thought about that but didn't know how well it would be received nor, perhaps more importantly, the direction it should take (i.e. a formal Zend type in the engine, extending the existing zend_string type, a class, some combination, or something else entirely). All of the more advanced options I came up with would have required some code changes to the PHP source itself with a new data type being the most involved and probably the most controversial.

My instinct was that it could just be a built-in class, with an internal pointer to a zend_string that's completely invisible to userland. Something like how the SimpleXML and DOM objects just point into a libxml parse result.

Then to add to existing functions requires changing an argument type from string to string|Buffer, rather than adding new arguments.

No change to the type system needed, internally or externally, just some code to unwrap the pointer. But perhaps I'm being naive and oversimplifying, as I don't have a deep understanding of the engine.

I'm not entirely sure what the next step here should be. Should I go research the above, or go back and develop/test and then propose something concrete in an OO direction and gather feedback at that point, or should we hash it out a bit more here on the list to get a more specific direction to go in?

Well, those were just my thoughts; maybe someone else will come along shortly with a very different take.

I'm very late on this discussion, but I think it is an interesting
topic, and maybe https://github.com/cmb69/php-stringbuilder, which I
had written long ago just to check some assumptions, can serve as POC.
It is certainly possible to have such a string buffer class without
having to patch the engine; it could even be made available as PECL
extension (first).

Note that this StringBuilder uses smart_strs[1] what might be a good
idea or not. But certainly you could use some other internal handling;
interoperability with zend_strings[2] requires to copy the char arrays
in most cases anyway, since these have a fixed length, and if these
copies are reduced to a minimum (i.e. the new class has enough
flexibility to work without casting to and from string), that should be
bearable.

Not sure if that would work for the "gd imageexportpixels() and
imageimportpixels()" RFC[3], but it might be worth investigating.

[1]
https://www.phpinternalsbook.com/php7/internal_types/strings/smart_str.html
[2]
https://www.phpinternalsbook.com/php7/internal_types/strings/zend_strings.html
[3] https://wiki.php.net/rfc/gd_image_export_import_pixels

Cheers,
Christoph

Huh, I am also very late and somewhat poignant, last weekend, I managed to refactor all zend_strings to contain a char* instead of char[1] and the char* pointed to the memory just after the pointer. It increased zend_string by a few bytes on a 64bit machine, but would allow for some nice optimizations, such as zend_strings sharing memory (effectively removing the need for the current interned strings implementation). I ended up ditching it because it would break literally every extension that does its own allocations instead of calling zend_string_alloc|init() and it was also hard to manage when copying strings, which also some core extensions do instead of calling core zend_string_* functions. Needless to say, "vanilla php" worked fine and all tests passed.

I did submit a small part of my refactoring here: https://github.com/php/php-src/pull/15054 but even something that simple didn't seem well received. So, I won't continue this approach.

But, fwiw, I wouldn't advise changing zend_strings too much, many extensions appear to do one of two things: their own allocations and/or their own copying and/or their own freeing.

— Rob