Hi Andrea,
I have been thinking about this a bit more. Here are a few thoughts.
Considering the added complexity, the effort would only be worth it if we could come up with a solution that would cover more cases. Max. 6-7 character length strings don't really justify the implied overhead, cause they are not ubiquitous enough.
One idea that springs to mind is a form of compression by limiting it to a subset of ASCII characters, perhaps those generally used by identifiers and number-like strings ([a-zA-Z0-9_.]). Those strings are most common (array-keys, JSON-keys, file names, function names, numbers out of DB, etc.). The more characters we can pack into a single datum the less visible the overhead will be performance-wise.
Limiting ourselves to these common 64 characters ([a-zA-Z0-9_.]) would allow us to effectively store (256 / 64) * 7 = 28 characters in those available 7-bytes plus 1 byte (minus pointer tag bit) for the length. Of course unpacking those kind of strings entails CPU and memory reallocation overhead. We can mitigate allocating and deallocating memory over and over again by using a stack-like buffer pool for unpacked small strings with fixed bucket sizes of <length> + <28 bytes> + \x00 for unpacking into.
I actually like your idea of using pointer tagging to distinguish between packed and regular strings, so that we can apply this to all zend_strings, not just ZVAL strings. I don't think it's a real issue that we'd be practically limiting this optimization to 64-bit systems (most that run PHP are 64-bit nowadays anyways). We can either simply deactivate it for 32-bit or use it with the available (256 / 64 * 3) = 12 characters for 32-bit (we'd still have room for the pointer tag bit on 32-bit machines).
Now, I realize that this would be a massiv undertaking (PHP8!). Without proper abstraction we'd be converting strings all over the place. Therefore, one would have to build a separate zend_strings abstraction layer for all common string functions (zend_strcat(), zend_strcmp(), zend_strlen(), zend_strcpy(), zend_strncat(), etc.) that expect zend_string parameters and can operate on both internal types of strings. And, one would actually have to use them throughout the code base. Such abstraction might be interesting and useful in and of itself though.
Let's summarize the trade-off.
The negatives of packed strings are:
- Additional branching (hence occasional branch mispredictions) for distinction of two types of strings
- Memory allocation for unpacking (can be mitigated by using global pre-allocated stack-like buffer pool)
- Extra CPU cycles for decompression/unpacking (can be minimized with proper abstraction of zend_string functions)
The positives of packed strings are:
- No initial separate heap allocation
- One less indirection because no pointer has to be chased
- Implicitly interned, no need for extra interning (if we can guarantee that all eligible strings are converted to packed format before usage)
- Value equals hash key, no need to generate extra hash key (that's a huge plus considering that array-keys will be very likely eligible for packed strings)
- Smaller memory footprint because of compression
- Less CPU usage for comparison (also, no need to ever unpack if we can guarantee that all eligible strings are converted to packed format before usage)
Let me know if you (or anyone else) is interested in discussing this approach further.
Cheers,
Benjamin Coutu
========== Original ==========
From: Benjamin Coutu ben.coutu@zeyos.com
To: Nikita Popov nikita.ppv@gmail.com, Dmitry Stogov dmitry@zend.com, Xinchen Hui xinchen.h@zend.com
Date: Tue, 13 Sep 2016 18:29:10 +0200
Subject: [PHP-DEV] Directly embed small strings in zvals
Hello everyone,
I was wondering if it would make sense to store small strings (length <= 7) directly inside the zval struct, thereby avoiding the need to extra allocate a zend_string, which would also not entail any costly indirection and refcounting for such strings.
The idea would be to add a new sruct
struct { uint8_t len; char val[7]; } sval
to the _zend_value union type in order to embed it directly into the zval struct and use a type flag (zval.u1.v.type_flags) such as IS_SMALL_STRING to destinguish between a regular heap allocated zend_string and the directly embedded compact representation.Small strings are quite common IMHO. In fact quickly sampling my company's PHP code base I found well over 50% of the strings to be of length <= 7. It would save a lot of memory allocations as well as pointer indirection, and could also bypass refcounting logic. Also, comparing small strings for equality would become a trivial operation (just comparing two pre-aligned 64bit integers) - no more need to keep small strings interned.
Of course it wouldn't longer be possible to also persistently store the hash value of a small string, though calculating the hash value for small strings is less costly anyways because less characters equals less iterations, so that might not be an issue in practice.
I don't see such an idea in https://wiki.php.net/php-7.1-ideas and I was wondering: Has anybody experimented with that approach yet? Is it worth discussing?
Please let me know your thoughts,
Ben
Limiting ourselves to these common 64 characters ([a-zA-Z0-9_.]) would
allow us to effectively store (256 / 64) * 7 = 28 characters in those
available 7-bytes plus 1 byte (minus pointer tag bit) for the length.
That's wrong. 256 = 28, and 64 = 26, so you get 8/6*7 = 9 chars.
Not really better than 7 chars, especially considering that all
operations on single characters would be slower than usual.
--
Lauri Kenttä