Newsgroups: php.internals
Path: news.php.net
Xref: news.php.net php.internals:74196
Mailing-List: contact internals-help@lists.php.net; run by ezmlm
Received-SPF: pass (pb1.pair.com: domain ajf.me designates 198.187.29.240 as permitted sender)
Content-Type: multipart/alternative; boundary="Apple-Mail=_17EDBFEB-5701-45BC-83B9-38AF9637F4EC"
Mime-Version: 1.0 (Mac OS X Mail 7.2 \(1874\))
In-Reply-To: <5373A39E.8050302@hristov.com>
Date: Wed, 14 May 2014 18:13:34 +0100
Cc: internals@lists.php.net
Message-ID: <4D169A0A-0A59-402B-8479-C63E5C212A63@ajf.me>
References: <e5f19ac0089d24adc695fce93ad9bd9c.squirrel@webmail.klapt.com> <CA+9eiLuoJM5bwA46M_2YddTXYOhAwhii=9Ref7-NNeuRuBfW5A@mail.gmail.com> <CAEZPtU7zz=npjUhacwSBA6cz+kb9iBmn3fjkvYuAjqtiPQgJhw@mail.gmail.com> <CAF+90c9J5mtTzqFxvQHka5w5zVoGOzXNaUeYec62X_LYLOPcxA@mail.gmail.com> <CAEZPtU6L0TCyR+U-eEJnwfYbxfMRibpKB70_Ue5fTOHD4y_VAQ@mail.gmail.com> <53732673.3080106@lsces.co.uk> <CA+9eiLuYCNcXB4o_GZd48v+5X4hfswOuzD7DZCokU96rvZ_iWA@mail.gmail.com> <A730B294-18C0-4886-9955-28C1CA59AF00@ajf.me> <5373A39E.8050302@hristov.com>
To: Andrey Hristov <php@hristov.com>
Subject: Re: [PHP-DEV] [VOTE] [RFC] 64 bit platform improvements for string length and integer
From: ajf@ajf.me (Andrea Faulds)

--Apple-Mail=_17EDBFEB-5701-45BC-83B9-38AF9637F4EC
Content-Transfer-Encoding: quoted-printable
Content-Type: text/plain;
	charset=windows-1252


On 14 May 2014, at 18:10, Andrey Hristov <php@hristov.com> wrote:

> This is purely academical. And the standard library has to support =
everything, it's the standard library. PHP is on its own, and if an =
addition is of little use to the most of the developers/scripts, why the =
heck it should be in/the default.
> A good solution is to typedef a php_size_t, leave it to uint32_t and =
for those, who need more than 4GB in strings and elements they can just =
build with size_t as definition. Offer the choice, don't force.

It is not just =93purely academic=94. Here, let me quote Pierre (in 'Re: =
[PHP-DEV] [VOTE] [RFC] 64 bit platform improvements for string length =
and integer=92, just now) quoting Anthony:

> This thread has been pointed out to me by a few people. As the
> originator of this patch and concept I feel that I should clarify a
> few points.
>=20
> # Rationale
>=20
> The reason that I originally started this patch was to clean up and
> standardize the underlying types. This is to introduce predictability,
> portability and type sanity into the engine and entire cphp
> implementation.
>=20
> ## Rationale for Int64:
>=20
> Without this patch, the size of integers (longs) varies based on which
> compiler you use. This means that even for identical target
> architectures behavior can change with respect to userland code.
> Refactoring this allows for consistent sizes that can be relied upon
> by the programmer. This is an effort to make it a bit easier to rely
> on integer width as a developer.
>=20
> And ideally this is a free cost to most implementations, since ints
> are already 64 bits wide, so there is no memory overhead. And
> performance stays the same as well.
>=20
> ## Rationale for size_t (string lengths):
>=20
> This has significant advantages. There are some costs to doing it, but
> they are not as significant as they may appear on the surface. Let's
> dive into it:
>=20
> ### It's The Correct Data Type
>=20
> The C89 spec indicates in 3.3.3.4 (
> http://port70.net/~nsz/c/c89/rationale/c3.html#size-95t-3-3-3-4 ) that
> the size_t type was created specifically for usage in this context. It
> is always, 100% guaranteed to be able to hold the bounds of every
> possible array element. Strings in C are simply char arrays.
> Therefore, the correct data type to use for string sizes (which really
> are just an offset qualifier) is size_t.
>=20
> Additionally, calloc, malloc, etc all expect parameters of type size_t
> for exactly this reason.
>=20
> Another good reference on it: http://www.viva64.com/en/a/0050/
>=20
> ### It's The Secure Data Type
>=20
> size_t (and ptrdiff_t) are the only C89 types that are 100% guaranteed
> to be able to hold the size of any possible object that the compiler
> will support. Other types will vary depending on the data model that
> the compiler supports, as the spec only defines minimum widths.
>=20
> This is so important that CERT issued a coding standard for it:
> INT01-C ( =
https://www.securecoding.cert.org/confluence/display/seccode/INT01-C.+Use+=
rsize_t+or+size_t+for+all+integer+values+representing+the+size+of+an+objec=
t
> ).
>=20
> One of the reasons is that it's difficult to do overflow checks in a
> portable way. See VU#162289: https://www.kb.cert.org/vuls/id/162289 .
> In there, they recommend using the C99 uintptr_t type, but suggest
> using size_t for platforms that don't have uintptr_t support (and
> since we target C89 for the engine, that's out).
>=20
> Apple's Secure Coding Guide's section on Avoiding Integer Overflows
> and Underflows says the same thing:
> =
https://developer.apple.com/library/mac/documentation/security/conceptual/=
securecodingguide/Articles/BufferOverflows.html
>=20
> ### About Long Strings
>=20
> The fact that changing to size_t allows strings (and arrays) to be >
> 4gb is a side-effect. A welcome one, but a side effect none the less.
> The primary reason to use it is that it's the correct data type, and
> gives you the most safety and security.
>=20
> # Response To Concerns Mentioned
>=20
> I'll respond here to some of the concerns mentioned in this thread:
>=20
> ## size_t uses more memory and will result in more CPU cache misses,
> which will result in worse performance
>=20
> Well, size_t will use more memory. No doubt about that.
>=20
> But the performance side is more nuanced. And as several benchmarks in
> this thread indicate, there isn't a practical difference. Heck, the
> benchmarks on Windows show an improvement in some cases.
>=20
> And there is a reason for that. Since a pointer is a 64 bit data type,
> and a int is a 32 bit data type, any time you add the two will result
> in extra CPU cycles needed for the cast. This can be clearly seen by
> analyzing a simple malloc call with an int vs a size_t param. Here's
> the diff:
>=20
>    < movl $5, -12(%rbp)
>    < movl -12(%rbp), %eax
>    < cltq
>    ---
>> movq $5, -16(%rbp)
>> movq -16(%rbp), %rax
>=20
> Now, a cache miss is much more expensive than a cast, but we don't
> have proof that cache misses will actually occur.
>=20
> In fact, in the benchmarks, the worst difference is 2%. Which is
> hardly significant (as indicated by several people here). But also
> notice that in both benchmarks (those done by Microsoft, and those
> done by Dmitry), some specific tests actually executed **faster** with
> the size_t transforms (namely Hello World, Wordpress, etc). So to say
> even 2% is not really the full story.
>=20
> We'll come back to the memory thing in a bit.
>=20
> ## Macro Renames and ZPP changes
>=20
> This was my idea, and I don't think it's been properly justified.
>=20
> ### ZPP Changes
>=20
> The ZPP changes are critical. The reason is that varargs is casting an
> arbitrary block of memory to a type, and then writing to it. So
> existing code that does zpp("s", str, &int_len) would wind up with a
> buffer overflow. Because zpp would be trying to write a 64 bit value
> to a 32 bit container. The other 32 bits would fall off the end, into
> who knows what. At BEST this can result in a segfault. At worst,
> memory corruption and MASSIVE security vulnerabilities.
>=20
> Also note that the compiler *can't* and actively doesn't catch these
> types of errors. That means that it's largely luck and testing that
> will lead to it.
>=20
> So, I chose to break BC and rename the ZPP symbols. Because that WILL
> error, and provide the developer with a meaningful indication that an
> improper data type was provided. As I considered a fatal error that an
> invalid type was supplied was a better way of identifying to the
> developer that "HEY, THIS NEEDS TO BE CHANGED ASAP" than just letting
> them hit random segfaults at runtime.
>=20
> If there is a way to get around this by giving the compiler more
> information, then do it. But to just leave the types there, and leave
> it to chance if a buffer overflow occurs, is dangerous. Which is why I
> made the call that the ZPP types **needed** to be changed.
>=20
> ### Macro Renames
>=20
> The reason for the rename is largely the same as with the ZPP changes.
> The severity of not changing is less (since the compiler will warn and
> do an implicit cast for you). But it's still there. Which is why I
> chose to change it. This is less critical, but was done to better
> indicate to the developer what needs to change to properly support the
> new system.
>=20
> ## Memory Overhead
>=20
> This is definitely a concern. There is a potential to double the
> amount of memory that PHP takes. Which on the surface looks enormous.
> And if we stop at the surface, we definitely shouldn't do it!
>=20
> But as we look deeper, we see that in actuality, the difference is not
> double. In fact, most data structures, as identified by Dmitry
> himself, only increase by between 6% (zend_op_array) 50%
> (zend_string's size). So that "double" figure quickly drops.
>=20
> But that's at the structure level. Let's look at what actually happens
> in practice. Dmitry himself also provides these answers. The average
> memory increase is 8% for Wordpress, and 6% for ZF1.
>=20
> Let's put that 8% in context. Wordpress used 12MB, and now it uses
> 13MB. 1MB more. That's not overly significant. ZF used 29MB. Now it
> uses 31MB. Still not overly significant.
>=20
> Don't get me wrong, it's still more. And more is bad. But it's not
> nearly as bad as it's being played out to be.
>=20
> To put this into context, 5.4 saved up to 50% memory from 5.3
> (depending on benchmark). 8 << 50.
>=20
> Now, I'm not saying that memory should be thrown around willy-nilly.
> But given the rationale that I gave above, I think the benefits of
> sanity, portability and security clearly are significant enough for
> the relatively small cost in memory.

--
Andrea Faulds
http://ajf.me/





--Apple-Mail=_17EDBFEB-5701-45BC-83B9-38AF9637F4EC--