Newsgroups: php.internals Path: news.php.net Xref: news.php.net php.internals:74193 Return-Path: Mailing-List: contact internals-help@lists.php.net; run by ezmlm Delivered-To: mailing list internals@lists.php.net Received: (qmail 30905 invoked from network); 14 May 2014 16:39:37 -0000 Received: from unknown (HELO lists.php.net) (127.0.0.1) by localhost with SMTP; 14 May 2014 16:39:37 -0000 Authentication-Results: pb1.pair.com smtp.mail=pierre.php@gmail.com; spf=pass; sender-id=pass Authentication-Results: pb1.pair.com header.from=pierre.php@gmail.com; sender-id=pass Received-SPF: pass (pb1.pair.com: domain gmail.com designates 209.85.192.41 as permitted sender) X-PHP-List-Original-Sender: pierre.php@gmail.com X-Host-Fingerprint: 209.85.192.41 mail-qg0-f41.google.com Received: from [209.85.192.41] ([209.85.192.41:61070] helo=mail-qg0-f41.google.com) by pb1.pair.com (ecelerity 2.1.1.9-wez r(12769M)) with ESMTP id 2E/B8-15285-74C93735 for ; Wed, 14 May 2014 12:39:35 -0400 Received: by mail-qg0-f41.google.com with SMTP id j5so3247085qga.14 for ; Wed, 14 May 2014 09:39:32 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:date:message-id:subject:from:to:content-type; bh=+qJmhrFRxDep+Eo+ABbLQphskFch12pcZeypSgnFdAw=; b=vGS4ftBLGPeW0aVRo3Qy0wM6B2wwzS9qgzs92Lls2nX6CcxjMwuiULb/uxK/w2HknX EweSQ9edvaqSHnbSwSDMp5v8zt176Uc61Ifdv7i1UwaJU/LRC+YmEQu9ynwfYNLAYqKA BwhMgbs6BJ7L6XH/4UfreLncP8aggyv70iiy4csfP+6P31RfgqDLlKBqAxlvZKGnLFnY FoXgnoHVMUSFtQcKoEQ1KoRMbcpZ1SRSR+5KBboI7fsqgETc3RJdbhnZ0A3mTERjJ7sQ ov9h8nKZ3hrdDZXT8UiZ2E4AfiJEP01gyhUc2NPPWQWn2tJf9KZErdBgbpnDaPUezGJw BGLA== MIME-Version: 1.0 X-Received: by 10.224.69.130 with SMTP id z2mr5153588qai.87.1400085572408; Wed, 14 May 2014 09:39:32 -0700 (PDT) Received: by 10.140.47.231 with HTTP; Wed, 14 May 2014 09:39:32 -0700 (PDT) Date: Wed, 14 May 2014 18:39:32 +0200 Message-ID: To: PHP internals Content-Type: text/plain; charset=UTF-8 Subject: on memory usage with the 64bit patch, and interpretation of various numbers From: pierre.php@gmail.com (Pierre Joye) hi, First we are working on providing updated numbers using 5.6 vs 5.6+patch, to have an actual base reference. Wordpress, Symfony and Drupal will be used. Also, Anthony posted something on IRC which summarizes very well what has been said in the other threads. be recently or in the previous vote to merge the 64bit patch in 5.x. I will simply copy it here. I seriously hope anyone to carefully read it and vote with all facts in mind, keeping the big picture in mind, long term. Thanks. +++ Anthony's post +++ This thread has been pointed out to me by a few people. As the originator of this patch and concept I feel that I should clarify a few points. # Rationale The reason that I originally started this patch was to clean up and standardize the underlying types. This is to introduce predictability, portability and type sanity into the engine and entire cphp implementation. ## Rationale for Int64: Without this patch, the size of integers (longs) varies based on which compiler you use. This means that even for identical target architectures behavior can change with respect to userland code. Refactoring this allows for consistent sizes that can be relied upon by the programmer. This is an effort to make it a bit easier to rely on integer width as a developer. And ideally this is a free cost to most implementations, since ints are already 64 bits wide, so there is no memory overhead. And performance stays the same as well. ## Rationale for size_t (string lengths): This has significant advantages. There are some costs to doing it, but they are not as significant as they may appear on the surface. Let's dive into it: ### It's The Correct Data Type The C89 spec indicates in 3.3.3.4 ( http://port70.net/~nsz/c/c89/rationale/c3.html#size-95t-3-3-3-4 ) that the size_t type was created specifically for usage in this context. It is always, 100% guaranteed to be able to hold the bounds of every possible array element. Strings in C are simply char arrays. Therefore, the correct data type to use for string sizes (which really are just an offset qualifier) is size_t. Additionally, calloc, malloc, etc all expect parameters of type size_t for exactly this reason. Another good reference on it: http://www.viva64.com/en/a/0050/ ### It's The Secure Data Type size_t (and ptrdiff_t) are the only C89 types that are 100% guaranteed to be able to hold the size of any possible object that the compiler will support. Other types will vary depending on the data model that the compiler supports, as the spec only defines minimum widths. This is so important that CERT issued a coding standard for it: INT01-C ( https://www.securecoding.cert.org/confluence/display/seccode/INT01-C.+Use+rsize_t+or+size_t+for+all+integer+values+representing+the+size+of+an+object ). One of the reasons is that it's difficult to do overflow checks in a portable way. See VU#162289: https://www.kb.cert.org/vuls/id/162289 . In there, they recommend using the C99 uintptr_t type, but suggest using size_t for platforms that don't have uintptr_t support (and since we target C89 for the engine, that's out). Apple's Secure Coding Guide's section on Avoiding Integer Overflows and Underflows says the same thing: https://developer.apple.com/library/mac/documentation/security/conceptual/securecodingguide/Articles/BufferOverflows.html ### About Long Strings The fact that changing to size_t allows strings (and arrays) to be > 4gb is a side-effect. A welcome one, but a side effect none the less. The primary reason to use it is that it's the correct data type, and gives you the most safety and security. # Response To Concerns Mentioned I'll respond here to some of the concerns mentioned in this thread: ## size_t uses more memory and will result in more CPU cache misses, which will result in worse performance Well, size_t will use more memory. No doubt about that. But the performance side is more nuanced. And as several benchmarks in this thread indicate, there isn't a practical difference. Heck, the benchmarks on Windows show an improvement in some cases. And there is a reason for that. Since a pointer is a 64 bit data type, and a int is a 32 bit data type, any time you add the two will result in extra CPU cycles needed for the cast. This can be clearly seen by analyzing a simple malloc call with an int vs a size_t param. Here's the diff: < movl $5, -12(%rbp) < movl -12(%rbp), %eax < cltq --- > movq $5, -16(%rbp) > movq -16(%rbp), %rax Now, a cache miss is much more expensive than a cast, but we don't have proof that cache misses will actually occur. In fact, in the benchmarks, the worst difference is 2%. Which is hardly significant (as indicated by several people here). But also notice that in both benchmarks (those done by Microsoft, and those done by Dmitry), some specific tests actually executed **faster** with the size_t transforms (namely Hello World, Wordpress, etc). So to say even 2% is not really the full story. We'll come back to the memory thing in a bit. ## Macro Renames and ZPP changes This was my idea, and I don't think it's been properly justified. ### ZPP Changes The ZPP changes are critical. The reason is that varargs is casting an arbitrary block of memory to a type, and then writing to it. So existing code that does zpp("s", str, &int_len) would wind up with a buffer overflow. Because zpp would be trying to write a 64 bit value to a 32 bit container. The other 32 bits would fall off the end, into who knows what. At BEST this can result in a segfault. At worst, memory corruption and MASSIVE security vulnerabilities. Also note that the compiler *can't* and actively doesn't catch these types of errors. That means that it's largely luck and testing that will lead to it. So, I chose to break BC and rename the ZPP symbols. Because that WILL error, and provide the developer with a meaningful indication that an improper data type was provided. As I considered a fatal error that an invalid type was supplied was a better way of identifying to the developer that "HEY, THIS NEEDS TO BE CHANGED ASAP" than just letting them hit random segfaults at runtime. If there is a way to get around this by giving the compiler more information, then do it. But to just leave the types there, and leave it to chance if a buffer overflow occurs, is dangerous. Which is why I made the call that the ZPP types **needed** to be changed. ### Macro Renames The reason for the rename is largely the same as with the ZPP changes. The severity of not changing is less (since the compiler will warn and do an implicit cast for you). But it's still there. Which is why I chose to change it. This is less critical, but was done to better indicate to the developer what needs to change to properly support the new system. ## Memory Overhead This is definitely a concern. There is a potential to double the amount of memory that PHP takes. Which on the surface looks enormous. And if we stop at the surface, we definitely shouldn't do it! But as we look deeper, we see that in actuality, the difference is not double. In fact, most data structures, as identified by Dmitry himself, only increase by between 6% (zend_op_array) 50% (zend_string's size). So that "double" figure quickly drops. But that's at the structure level. Let's look at what actually happens in practice. Dmitry himself also provides these answers. The average memory increase is 8% for Wordpress, and 6% for ZF1. Let's put that 8% in context. Wordpress used 12MB, and now it uses 13MB. 1MB more. That's not overly significant. ZF used 29MB. Now it uses 31MB. Still not overly significant. Don't get me wrong, it's still more. And more is bad. But it's not nearly as bad as it's being played out to be. To put this into context, 5.4 saved up to 50% memory from 5.3 (depending on benchmark). 8 << 50. Now, I'm not saying that memory should be thrown around willy-nilly. But given the rationale that I gave above, I think the benefits of sanity, portability and security clearly are significant enough for the relatively small cost in memory. Cheers, -- Pierre @pierrejoye | http://www.libgd.org