Newsgroups: php.internals Path: news.php.net Xref: news.php.net php.internals:74196 Return-Path: Mailing-List: contact internals-help@lists.php.net; run by ezmlm Delivered-To: mailing list internals@lists.php.net Received: (qmail 38411 invoked from network); 14 May 2014 17:13:46 -0000 Received: from unknown (HELO lists.php.net) (127.0.0.1) by localhost with SMTP; 14 May 2014 17:13:46 -0000 Authentication-Results: pb1.pair.com header.from=ajf@ajf.me; sender-id=pass Authentication-Results: pb1.pair.com smtp.mail=ajf@ajf.me; spf=pass; sender-id=pass Received-SPF: pass (pb1.pair.com: domain ajf.me designates 198.187.29.240 as permitted sender) X-PHP-List-Original-Sender: ajf@ajf.me X-Host-Fingerprint: 198.187.29.240 imap2-1.ox.privateemail.com Received: from [198.187.29.240] ([198.187.29.240:59227] helo=imap2-1.ox.privateemail.com) by pb1.pair.com (ecelerity 2.1.1.9-wez r(12769M)) with ESMTP id A6/5A-15285-844A3735 for ; Wed, 14 May 2014 13:13:44 -0400 Received: from [192.168.0.200] (unknown [90.203.28.11]) (using TLSv1 with cipher AES128-SHA (128/128 bits)) (No client certificate requested) by mail.privateemail.com (Postfix) with ESMTPSA id 186C75A0086; Wed, 14 May 2014 13:13:38 -0400 (EDT) Content-Type: multipart/alternative; boundary="Apple-Mail=_17EDBFEB-5701-45BC-83B9-38AF9637F4EC" Mime-Version: 1.0 (Mac OS X Mail 7.2 \(1874\)) In-Reply-To: <5373A39E.8050302@hristov.com> Date: Wed, 14 May 2014 18:13:34 +0100 Cc: internals@lists.php.net Message-ID: <4D169A0A-0A59-402B-8479-C63E5C212A63@ajf.me> References: <53732673.3080106@lsces.co.uk> <5373A39E.8050302@hristov.com> To: Andrey Hristov X-Mailer: Apple Mail (2.1874) Subject: Re: [PHP-DEV] [VOTE] [RFC] 64 bit platform improvements for string length and integer From: ajf@ajf.me (Andrea Faulds) --Apple-Mail=_17EDBFEB-5701-45BC-83B9-38AF9637F4EC Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset=windows-1252 On 14 May 2014, at 18:10, Andrey Hristov wrote: > This is purely academical. And the standard library has to support = everything, it's the standard library. PHP is on its own, and if an = addition is of little use to the most of the developers/scripts, why the = heck it should be in/the default. > A good solution is to typedef a php_size_t, leave it to uint32_t and = for those, who need more than 4GB in strings and elements they can just = build with size_t as definition. Offer the choice, don't force. It is not just =93purely academic=94. Here, let me quote Pierre (in 'Re: = [PHP-DEV] [VOTE] [RFC] 64 bit platform improvements for string length = and integer=92, just now) quoting Anthony: > This thread has been pointed out to me by a few people. As the > originator of this patch and concept I feel that I should clarify a > few points. >=20 > # Rationale >=20 > The reason that I originally started this patch was to clean up and > standardize the underlying types. This is to introduce predictability, > portability and type sanity into the engine and entire cphp > implementation. >=20 > ## Rationale for Int64: >=20 > Without this patch, the size of integers (longs) varies based on which > compiler you use. This means that even for identical target > architectures behavior can change with respect to userland code. > Refactoring this allows for consistent sizes that can be relied upon > by the programmer. This is an effort to make it a bit easier to rely > on integer width as a developer. >=20 > And ideally this is a free cost to most implementations, since ints > are already 64 bits wide, so there is no memory overhead. And > performance stays the same as well. >=20 > ## Rationale for size_t (string lengths): >=20 > This has significant advantages. There are some costs to doing it, but > they are not as significant as they may appear on the surface. Let's > dive into it: >=20 > ### It's The Correct Data Type >=20 > The C89 spec indicates in 3.3.3.4 ( > http://port70.net/~nsz/c/c89/rationale/c3.html#size-95t-3-3-3-4 ) that > the size_t type was created specifically for usage in this context. It > is always, 100% guaranteed to be able to hold the bounds of every > possible array element. Strings in C are simply char arrays. > Therefore, the correct data type to use for string sizes (which really > are just an offset qualifier) is size_t. >=20 > Additionally, calloc, malloc, etc all expect parameters of type size_t > for exactly this reason. >=20 > Another good reference on it: http://www.viva64.com/en/a/0050/ >=20 > ### It's The Secure Data Type >=20 > size_t (and ptrdiff_t) are the only C89 types that are 100% guaranteed > to be able to hold the size of any possible object that the compiler > will support. Other types will vary depending on the data model that > the compiler supports, as the spec only defines minimum widths. >=20 > This is so important that CERT issued a coding standard for it: > INT01-C ( = https://www.securecoding.cert.org/confluence/display/seccode/INT01-C.+Use+= rsize_t+or+size_t+for+all+integer+values+representing+the+size+of+an+objec= t > ). >=20 > One of the reasons is that it's difficult to do overflow checks in a > portable way. See VU#162289: https://www.kb.cert.org/vuls/id/162289 . > In there, they recommend using the C99 uintptr_t type, but suggest > using size_t for platforms that don't have uintptr_t support (and > since we target C89 for the engine, that's out). >=20 > Apple's Secure Coding Guide's section on Avoiding Integer Overflows > and Underflows says the same thing: > = https://developer.apple.com/library/mac/documentation/security/conceptual/= securecodingguide/Articles/BufferOverflows.html >=20 > ### About Long Strings >=20 > The fact that changing to size_t allows strings (and arrays) to be > > 4gb is a side-effect. A welcome one, but a side effect none the less. > The primary reason to use it is that it's the correct data type, and > gives you the most safety and security. >=20 > # Response To Concerns Mentioned >=20 > I'll respond here to some of the concerns mentioned in this thread: >=20 > ## size_t uses more memory and will result in more CPU cache misses, > which will result in worse performance >=20 > Well, size_t will use more memory. No doubt about that. >=20 > But the performance side is more nuanced. And as several benchmarks in > this thread indicate, there isn't a practical difference. Heck, the > benchmarks on Windows show an improvement in some cases. >=20 > And there is a reason for that. Since a pointer is a 64 bit data type, > and a int is a 32 bit data type, any time you add the two will result > in extra CPU cycles needed for the cast. This can be clearly seen by > analyzing a simple malloc call with an int vs a size_t param. Here's > the diff: >=20 > < movl $5, -12(%rbp) > < movl -12(%rbp), %eax > < cltq > --- >> movq $5, -16(%rbp) >> movq -16(%rbp), %rax >=20 > Now, a cache miss is much more expensive than a cast, but we don't > have proof that cache misses will actually occur. >=20 > In fact, in the benchmarks, the worst difference is 2%. Which is > hardly significant (as indicated by several people here). But also > notice that in both benchmarks (those done by Microsoft, and those > done by Dmitry), some specific tests actually executed **faster** with > the size_t transforms (namely Hello World, Wordpress, etc). So to say > even 2% is not really the full story. >=20 > We'll come back to the memory thing in a bit. >=20 > ## Macro Renames and ZPP changes >=20 > This was my idea, and I don't think it's been properly justified. >=20 > ### ZPP Changes >=20 > The ZPP changes are critical. The reason is that varargs is casting an > arbitrary block of memory to a type, and then writing to it. So > existing code that does zpp("s", str, &int_len) would wind up with a > buffer overflow. Because zpp would be trying to write a 64 bit value > to a 32 bit container. The other 32 bits would fall off the end, into > who knows what. At BEST this can result in a segfault. At worst, > memory corruption and MASSIVE security vulnerabilities. >=20 > Also note that the compiler *can't* and actively doesn't catch these > types of errors. That means that it's largely luck and testing that > will lead to it. >=20 > So, I chose to break BC and rename the ZPP symbols. Because that WILL > error, and provide the developer with a meaningful indication that an > improper data type was provided. As I considered a fatal error that an > invalid type was supplied was a better way of identifying to the > developer that "HEY, THIS NEEDS TO BE CHANGED ASAP" than just letting > them hit random segfaults at runtime. >=20 > If there is a way to get around this by giving the compiler more > information, then do it. But to just leave the types there, and leave > it to chance if a buffer overflow occurs, is dangerous. Which is why I > made the call that the ZPP types **needed** to be changed. >=20 > ### Macro Renames >=20 > The reason for the rename is largely the same as with the ZPP changes. > The severity of not changing is less (since the compiler will warn and > do an implicit cast for you). But it's still there. Which is why I > chose to change it. This is less critical, but was done to better > indicate to the developer what needs to change to properly support the > new system. >=20 > ## Memory Overhead >=20 > This is definitely a concern. There is a potential to double the > amount of memory that PHP takes. Which on the surface looks enormous. > And if we stop at the surface, we definitely shouldn't do it! >=20 > But as we look deeper, we see that in actuality, the difference is not > double. In fact, most data structures, as identified by Dmitry > himself, only increase by between 6% (zend_op_array) 50% > (zend_string's size). So that "double" figure quickly drops. >=20 > But that's at the structure level. Let's look at what actually happens > in practice. Dmitry himself also provides these answers. The average > memory increase is 8% for Wordpress, and 6% for ZF1. >=20 > Let's put that 8% in context. Wordpress used 12MB, and now it uses > 13MB. 1MB more. That's not overly significant. ZF used 29MB. Now it > uses 31MB. Still not overly significant. >=20 > Don't get me wrong, it's still more. And more is bad. But it's not > nearly as bad as it's being played out to be. >=20 > To put this into context, 5.4 saved up to 50% memory from 5.3 > (depending on benchmark). 8 << 50. >=20 > Now, I'm not saying that memory should be thrown around willy-nilly. > But given the rationale that I gave above, I think the benefits of > sanity, portability and security clearly are significant enough for > the relatively small cost in memory. -- Andrea Faulds http://ajf.me/ --Apple-Mail=_17EDBFEB-5701-45BC-83B9-38AF9637F4EC--