Newsgroups: php.internals Path: news.php.net Xref: news.php.net php.internals:95990 Return-Path: Mailing-List: contact internals-help@lists.php.net; run by ezmlm Delivered-To: mailing list internals@lists.php.net Received: (qmail 71488 invoked from network); 13 Sep 2016 18:26:12 -0000 Received: from unknown (HELO lists.php.net) (127.0.0.1) by localhost with SMTP; 13 Sep 2016 18:26:12 -0000 X-Host-Fingerprint: 137.50.185.177 oa-edu-185-177.wireless.abdn.ac.uk Received: from [137.50.185.177] ([137.50.185.177:12249] helo=localhost.localdomain) by pb1.pair.com (ecelerity 2.1.1.9-wez r(12769M)) with ESMTP id A1/A2-60695-EB448D75 for ; Tue, 13 Sep 2016 14:26:06 -0400 Message-ID: To: internals@lists.php.net References: <20160913163000.D68A15FA84@mx.zeyos.com> X-Mozilla-News-Host: news://news.php.net Date: Tue, 13 Sep 2016 19:26:03 +0100 User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.11; rv:43.0) Gecko/20100101 Firefox/43.0 SeaMonkey/2.40 MIME-Version: 1.0 In-Reply-To: <20160913163000.D68A15FA84@mx.zeyos.com> Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 7bit X-Posted-By: 137.50.185.177 Subject: Re: [PHP-DEV] Directly embed small strings in zvals From: ajf@ajf.me (Andrea Faulds) Hi Ben! Benjamin Coutu wrote: > I was wondering if it would make sense to store small strings (length <= 7) directly inside the zval struct, thereby avoiding the need to extra allocate a zend_string, which would also not entail any costly indirection and refcounting for such strings. > > The idea would be to add a new sruct ``struct { uint8_t len; char val[7]; } sval`` to the _zend_value union type in order to embed it directly into the zval struct and use a type flag (zval.u1.v.type_flags) such as IS_SMALL_STRING to destinguish between a regular heap allocated zend_string and the directly embedded compact representation. > > Small strings are quite common IMHO. In fact quickly sampling my company's PHP code base I found well over 50% of the strings to be of length <= 7. It would save a lot of memory allocations as well as pointer indirection, and could also bypass refcounting logic. Also, comparing small strings for equality would become a trivial operation (just comparing two pre-aligned 64bit integers) - no more need to keep small strings interned. > > Of course it wouldn't longer be possible to also persistently store the hash value of a small string, though calculating the hash value for small strings is less costly anyways because less characters equals less iterations, so that might not be an issue in practice. > > I don't see such an idea in https://wiki.php.net/php-7.1-ideas and I was wondering: Has anybody experimented with that approach yet? Is it worth discussing? Funnily enough, I was thinking of trying to implement this recently. It's an interesting idea. I've previously tried implementing this with a slightly different approach, whereby the string is embedded within a tagged zend_string pointer, rather than as a separate zval type. This would mean you only need to change the zend_string functions and the string macros, and you can use such strings anywhere a zend_string pointer is used, not just in zvals. My quick-and-dirty implementation can be found here: https://github.com/php/php-src/compare/master...TazeTSchnitzel:packedStrings However, I don't think it benchmarked that well, and it broke some things (when you contain the string data inside the pointer itself, it breaks certain assumptions existing code makes). I don't know how performant such an approach could be, considering it adds an extra branch at the site of every index into a string (though some sort of caching, or compiler optimisations, might improve this). And I don't know quite how much stuff it breaks. My inspiration was a blog post about how Apple's Objective-C packs strings inside pointers, where possible. I think it was this one: https://www.mikeash.com/pyblog/friday-qa-2015-07-31-tagged-pointer-strings.html My approach was less sophisticated than Apple's. Whereas Apple can store strings as 8-bit, 6-bit or 5-bit, I took the easier approach of just having 8-bit strings, for direct indexing. For a 64-bit pointer, my implemention let you have up to 6 bytes for character data, 1 byte for a null terminator, and 1 byte for containing the pointer tag bit. Your suggested approach (packing the string inside the zval as another kind of zval) is also possible, but I haven't tried it. It would have the advantage of letting you use strings that are a byte longer, but you could only use it for strings in zvals, and it would possibly require more code changes overall (adding support for a new zval type everywhere, vs. only updating code making the wrong assumptions about zend_string pointers). Also, a potential problem is that possibly a lot of PHP functions deal with strings only as zend_strings, so PHP might end up converting back and forth between packed/small string zvals and zend_strings just to satisfy such functions. In particular, zend_parse_parameters might have to do that. Your approach is less hacky and more portable, though. Notably, it works on 32-bit systems (whose pointers are too small for useful string packing, but which still give you 8 bytes to play with in the zval). Regarding hash values, there's actually a simple solution to that, at least on 64-bit platforms. I changed the hash function to check the string length, and if it's small enough to be a packed string, use the packed string as the hash itself. (After all, they're both 64-bit numbers!) This costs nothing if the string is already a packed string, and if it isn't, the conversion is extremely simple. Of course, this comes at the cost of adding an extra branch to the hash function for non-packed strings. Anyway, generally, I have no idea if either approach is all that good an idea or not. More research might be needed. It's certainly fun to discuss! :) -- Andrea Faulds https://ajf.me/