Hello everyone,
I was wondering if it would make sense to store small strings (length <= 7) directly inside the zval struct, thereby avoiding the need to extra allocate a zend_string, which would also not entail any costly indirection and refcounting for such strings.
The idea would be to add a new sruct struct { uint8_t len; char val[7]; } sval
to the _zend_value union type in order to embed it directly into the zval struct and use a type flag (zval.u1.v.type_flags) such as IS_SMALL_STRING to destinguish between a regular heap allocated zend_string and the directly embedded compact representation.
Small strings are quite common IMHO. In fact quickly sampling my company's PHP code base I found well over 50% of the strings to be of length <= 7. It would save a lot of memory allocations as well as pointer indirection, and could also bypass refcounting logic. Also, comparing small strings for equality would become a trivial operation (just comparing two pre-aligned 64bit integers) - no more need to keep small strings interned.
Of course it wouldn't longer be possible to also persistently store the hash value of a small string, though calculating the hash value for small strings is less costly anyways because less characters equals less iterations, so that might not be an issue in practice.
I don't see such an idea in https://wiki.php.net/php-7.1-ideas and I was wondering: Has anybody experimented with that approach yet? Is it worth discussing?
Please let me know your thoughts,
Ben
--
Bejamin Coutu
ben.coutu@zeyos.com
ZeyOS, Inc.
http://www.zeyos.com
Hi Ben!
Benjamin Coutu wrote:
I was wondering if it would make sense to store small strings (length <= 7) directly inside the zval struct, thereby avoiding the need to extra allocate a zend_string, which would also not entail any costly indirection and refcounting for such strings.
The idea would be to add a new sruct
struct { uint8_t len; char val[7]; } sval
to the _zend_value union type in order to embed it directly into the zval struct and use a type flag (zval.u1.v.type_flags) such as IS_SMALL_STRING to destinguish between a regular heap allocated zend_string and the directly embedded compact representation.Small strings are quite common IMHO. In fact quickly sampling my company's PHP code base I found well over 50% of the strings to be of length <= 7. It would save a lot of memory allocations as well as pointer indirection, and could also bypass refcounting logic. Also, comparing small strings for equality would become a trivial operation (just comparing two pre-aligned 64bit integers) - no more need to keep small strings interned.
Of course it wouldn't longer be possible to also persistently store the hash value of a small string, though calculating the hash value for small strings is less costly anyways because less characters equals less iterations, so that might not be an issue in practice.
I don't see such an idea in https://wiki.php.net/php-7.1-ideas and I was wondering: Has anybody experimented with that approach yet? Is it worth discussing?
Funnily enough, I was thinking of trying to implement this recently.
It's an interesting idea.
I've previously tried implementing this with a slightly different
approach, whereby the string is embedded within a tagged zend_string
pointer, rather than as a separate zval type. This would mean you only
need to change the zend_string functions and the string macros, and you
can use such strings anywhere a zend_string pointer is used, not just in
zvals.
My quick-and-dirty implementation can be found here:
https://github.com/php/php-src/compare/master...TazeTSchnitzel:packedStrings
However, I don't think it benchmarked that well, and it broke some
things (when you contain the string data inside the pointer itself, it
breaks certain assumptions existing code makes). I don't know how
performant such an approach could be, considering it adds an extra
branch at the site of every index into a string (though some sort of
caching, or compiler optimisations, might improve this). And I don't
know quite how much stuff it breaks.
My inspiration was a blog post about how Apple's Objective-C packs
strings inside pointers, where possible. I think it was this one:
https://www.mikeash.com/pyblog/friday-qa-2015-07-31-tagged-pointer-strings.html
My approach was less sophisticated than Apple's. Whereas Apple can store
strings as 8-bit, 6-bit or 5-bit, I took the easier approach of just
having 8-bit strings, for direct indexing. For a 64-bit pointer, my
implemention let you have up to 6 bytes for character data, 1 byte for a
null terminator, and 1 byte for containing the pointer tag bit.
Your suggested approach (packing the string inside the zval as another
kind of zval) is also possible, but I haven't tried it. It would have
the advantage of letting you use strings that are a byte longer, but you
could only use it for strings in zvals, and it would possibly require
more code changes overall (adding support for a new zval type
everywhere, vs. only updating code making the wrong assumptions about
zend_string pointers). Also, a potential problem is that possibly a lot
of PHP functions deal with strings only as zend_strings, so PHP might
end up converting back and forth between packed/small string zvals and
zend_strings just to satisfy such functions. In particular,
zend_parse_parameters might have to do that. Your approach is less hacky
and more portable, though. Notably, it works on 32-bit systems (whose
pointers are too small for useful string packing, but which still give
you 8 bytes to play with in the zval).
Regarding hash values, there's actually a simple solution to that, at
least on 64-bit platforms. I changed the hash function to check the
string length, and if it's small enough to be a packed string, use the
packed string as the hash itself. (After all, they're both 64-bit
numbers!) This costs nothing if the string is already a packed string,
and if it isn't, the conversion is extremely simple. Of course, this
comes at the cost of adding an extra branch to the hash function for
non-packed strings.
Anyway, generally, I have no idea if either approach is all that good an
idea or not. More research might be needed. It's certainly fun to
discuss! :)
--
Andrea Faulds
https://ajf.me/