[DRAFT][RFC] Big Integer Support

11 years ago by Andrea Faulds — view source

unread

Good evening again,

I’ve been working on a patch to add bigints to PHP for more than a fortnight now. It’d based on the phpng branch and very much unfinished, but is somewhat working.

A draft RFC is available here: https://wiki.php.net/rfc/bigint

It’s quite far from done, but I wanted to start some discussion now as finishing the patch will take some time.

Thanks!

--
Andrea Faulds
http://ajf.me/

11 years ago by Nikita Popov — view source

unread

Good evening again,

I’ve been working on a patch to add bigints to PHP for more than a
fortnight now. It’d based on the phpng branch and very much unfinished, but
is somewhat working.

A draft RFC is available here: https://wiki.php.net/rfc/bigint

It’s quite far from done, but I wanted to start some discussion now as
finishing the patch will take some time.

Thanks!

Hey Andrea, this looks really interesting!

Could you please submit a PR from your branch, so it's possible to review
the code?

Regarding the RFC text, I think you mixed up some left and right shifts.
"<< 65 will result in zero" is probably supposed to be >> 65? And "right
shifts will promote to bigints" probably is talking about left shifts?

Btw, I wouldn't worry about the change in shifts larger than the integer
width. Those have always been UB inherited from using UB of the C language.
The cyclic behavior is only an idiosyncrasy of the x86 architecture - other
ISAs like ARM (iirc) will not exhibit the behavior.

Nikita

11 years ago by Andrea Faulds — view source

unread

Hey Andrea, this looks really interesting!

Glad you think so. :)

Could you please submit a PR from your branch, so it's possible to review the code?

Oh, good idea. I’ll do so soon. I should note the patch is very much unfinished (it works, but there are quite a few problems and I think I broke the math bit of ext/standard ;) and there are some things I’m yet to implement. It also needs some tests of its own; at the moment I’ve just fixed and removed the skip condition from a few 64-bit ones, with the reasoning that if this is implemented right, they should act identically on 32-bit CPUs.

However, you can currently create bigints, they’re reference counted and copied-on-write correctly, they don’t leak memory or segfault, and operations overflow correctly.

Regarding the RFC text, I think you mixed up some left and right shifts. "<< 65 will result in zero" is probably supposed to be >> 65? And "right shifts will promote to bigints" probably is talking about left shifts?

Oops, yes. I’ll fix that right away.

Btw, I wouldn't worry about the change in shifts larger than the integer width. Those have always been UB inherited from using UB of the C language. The cyclic behavior is only an idiosyncrasy of the x86 architecture - other ISAs like ARM (iirc) will not exhibit the behaviour.

Right. I’ll need to fix up the code so that it only does the necessary checks for cyclic behaviour on the platforms which do it. Currently there’s no such check.

--
Andrea Faulds
http://ajf.me/

11 years ago by Tjerk Meesters — view source

unread

Good evening again,

I’ve been working on a patch to add bigints to PHP for more than a
fortnight now. It’d based on the phpng branch and very much unfinished, but
is somewhat working.

A draft RFC is available here: https://wiki.php.net/rfc/bigint

Cool! :)

Regarding the pow() operator, if both operands are IS_LONG it does
multiplications if the exponent is positive; I don't see that logic
repeated in the IS_BIGINT cases. Was that causing issues?

Also, I noticed a couple of merge commits on your branch that you may wish
to iron out :)

It’s quite far from done, but I wanted to start some discussion now as
finishing the patch will take some time.

Thanks!

--
Andrea Faulds
http://ajf.me/

--

--

Tjerk

11 years ago by Andrea Faulds — view source

unread

Regarding the pow() operator, if both operands are IS_LONG it does multiplications if the exponent is positive; I don't see that logic repeated in the IS_BIGINT cases. Was that causing issues?

That’s because I modified the existing code which manually computed pow() for two ints and overflowed to float, by simply making it overflow to bigint instead. The IS_BIGINT cases use the zend_bigint pow functions instead.

Also, I noticed a couple of merge commits on your branch that you may wish to iron out :)

I ought to, but that would mean rewriting history, and I’d rather not do that in case it breaks some fork or what have you. I actually have an abandoned fork of my own bigint branch which does bitwise ops differently, though I don’t care for it. Maybe, if this eventually is merged into phpng, I could rebase them out then.

Andrea Faulds
http://ajf.me/

11 years ago by Dan Ackroyd — view source

unread

Hi,

An interesting RFC, that could do with some fleshing out. In particular:

What is the behaviour when you add 0.5 to a value that is a bigint?

What is the behaviour when a bigint zval is used in an extension that
is then going to then pass it to an underlying library that is
expecting a 64bit value? Will all the extension authors need to start
casting and/or checking the values?

There are quite a few functions in the GMP library, which presumably
are useful in handling bigints. Are these going to be exposed to
users? If so, how?

Most of all though, it could do with a stronger argument for why
handling bigints 'automagically' like this is preferable to handling
them explicitly. Without great care for handling them, dealing with
very large numbers is still going to fail at some point. Although
handling integers as bigints explicitly is more code to write, it's
also results in more understandable behaviour, which seems better to
me than having more internal converting of types which is not visible
in userland.

Finally, the RFC should probably address the licensing issues that
would be involved in making GMP an integral part of PHP, not just for
how PHP is distributed, but also for how people making and
distributing PHP applications would be affected.

For reference the GMP library is dual-licensed under GNU LGPL v3 and
GNU GPL v2.0. As I understand it, these are not compatible with the
PHP license.

cheers
Dan

Good evening again,

I’ve been working on a patch to add bigints to PHP for more than a fortnight now. It’d based on the phpng branch and very much unfinished, but is somewhat working.

A draft RFC is available here: https://wiki.php.net/rfc/bigint

It’s quite far from done, but I wanted to start some discussion now as finishing the patch will take some time.

Thanks!

--
Andrea Faulds
http://ajf.me/

11 years ago by Andrea Faulds — view source

unread

An interesting RFC, that could do with some fleshing out.

Don’t worry, I’ll be sure to flesh it out further.

What is the behaviour when you add 0.5 to a value that is a bigint?

That would come under TYPE_PAIR(IS_BIGINT, IS_DOUBLE) (or its opposite). It would cast the bigint to a float then add the two. The results would be similar to adding a large 64-bit long (over 53 bits) and a double. I’m not sure what the exact result would be. For large values, 0.5 is probably just rounded away.

What is the behaviour when a bigint zval is used in an extension that
is then going to then pass it to an underlying library that is
expecting a 64bit value? Will all the extension authors need to start
casting and/or checking the values?

Functions expecting zvals will obviously need to start handling IS_BIGINT, so yes, you’d need to check and cast if appropriate. Casting from a bigint to a long is easy, though, it’s just zend_bigint_to_long(Z_LVAL_P(some_zval)). Obviously you could also export a double or a string in the same fashion. As I think I noted in the RFC, PHP functions which take longs already will continue to get longs, as bigints passed them will be casted.

There are quite a few functions in the GMP library, which presumably
are useful in handling bigints. Are these going to be exposed to
users? If so, how?

I’m exposing absolutely none of GMP directly and would rather not change that. Instead, I have the zend_bigint_* family, which are largely thin wrappers over GMP so you can deal with the zend_bigint type directly, and also so PHP is theoretically not GMP-dependent and could switch to another library. While I don’t have the functions inlined at the moment (debugging convenience; make clean isn’t fun!), they will be inlined eventually for performance. I agree that GMP’s functions are useful, but I really don’t want to directly expose the internal gmp type to anything outside of zend_bigint.c, so I guess any functions I haven’t already covered will need their own wrappers created.

Most of all though, it could do with a stronger argument for why
handling bigints 'automagically' like this is preferable to handling
them explicitly. Without great care for handling them, dealing with
very large numbers is still going to fail at some point. Although
handling integers as bigints explicitly is more code to write, it's
also results in more understandable behaviour, which seems better to
me than having more internal converting of types which is not visible
in userland.

I don’t think the new behaviour will be confusing, as integers will simply be unbounded now so far as users need to be concerned. A benefit of doing it implicitly is that PHP’s integer arithmetic would be completely consistent across platforms. Users don’t have to worry about whether the machine their code is run on is 32-bit or 64-bit, and operations are consistent for different sizes of numbers. It also makes writing code easier, as users don’t have to worry about if integers will suddenly overflow to float. Obviously, if you’re using ridiculously big numbers you’ll start hitting memory limits, but most of the time it means integers are effectively limitless.

Finally, the RFC should probably address the licensing issues that
would be involved in making GMP an integral part of PHP, not just for
how PHP is distributed, but also for how people making and
distributing PHP applications would be affected.

For reference the GMP library is dual-licensed under GNU LGPL v3 and
GNU GPL v2.0. As I understand it, these are not compatible with the
PHP license.

Right, the LGPL thing will need to be addressed. As far as I can see, there wouldn’t really be any impact on PHP users, though I’m not an open-source license expert. I can’t really see it causing a particular problem unless people are distributing proprietary PHP versions, but I need to look into it more, so don’t quote me on that.

Andrea Faulds
http://ajf.me/

11 years ago by Andrea Faulds — view source

unread

Good afternoon,

I’ve made some improvements to the RFC.

First, to address Dan’s questions:

Finally, the RFC should probably address the licensing issues that
would be involved in making GMP an integral part of PHP, not just for
how PHP is distributed, but also for how people making and
distributing PHP applications would be affected.

For reference the GMP library is dual-licensed under GNU LGPL v3 and
GNU GPL v2.0. As I understand it, these are not compatible with the
PHP license.

I’ve now addressed the licensing issues more thoroughly:

https://wiki.php.net/rfc/bigint#licensing_and_dependency_issues

Most of all though, it could do with a stronger argument for why
handling bigints 'automagically' like this is preferable to handling
them explicitly. Without great care for handling them, dealing with
very large numbers is still going to fail at some point. Although
handling integers as bigints explicitly is more code to write, it's
also results in more understandable behaviour, which seems better to
me than having more internal converting of types which is not visible
in userland.

The opening section now elaborates a little more on why I think bigints are beneficial:

https://wiki.php.net/rfc/bigint#introduction

I also think that having integers be effectively boundless would be more understandable than how they overflow to floats at the moment, especially for newbie programmers. More generally, it’s less bugs and less worrying about platform differences.

Now onto Nikita’s:

Could you please submit a PR from your branch, so it's possible to review
the code?

This has now been done: https://github.com/php/php-src/pull/700

I also lengthened my TODO list for things that are unfinished in the patch: https://wiki.php.net/rfc/bigint#todo

Regarding the RFC text, I think you mixed up some left and right shifts.
"<< 65 will result in zero" is probably supposed to be >> 65? And "right
shifts will promote to bigints" probably is talking about left shifts?

Btw, I wouldn't worry about the change in shifts larger than the integer
width. Those have always been UB inherited from using UB of the C language.
The cyclic behavior is only an idiosyncrasy of the x86 architecture - other
ISAs like ARM (iirc) will not exhibit the behavior.

I’ve fixed those errors and clarified this a bit, and I also mention that I removed negative shifts, which I had forgotten earlier:

https://wiki.php.net/rfc/bigint#changes_to_operators_for_the_sake_of_consistency

Thanks for your comments!

Andrea Faulds
http://ajf.me/

11 years ago by Andrea Faulds — view source

unread

Good evening,

The RFC and patch are still incomplete, but there is a certain open question I would like to start a discussion about, which has wider impact beyond merely bigints, namely dealing with floats as array keys. To quote the RFC:

A problem arising from allowing integers to be arbitrarily large is that array keys using strings for numeric keys beyond the maximum size of a long would probably seem weird. At present, bigints are just dealt with as if they were numeric strings when using them as array keys and indices, but this may not be optimal. This RFC aims for integer consistency across platforms, and this would be a remaining inconsistency. It also doesn't make sense from a user perspective to have integers over a certain value suddenly become string keys, though whether this matters much in practise with PHP's type casting and juggling is a different question.

This also presents a further issue: inconsistency between longs, bigints and doubles, which must be avoided, as integer consistency cross-platform is a key goal of this RFC. Currently in PHP, doubles used as indexes are simply casted to longs, without any regard for size. This means that they overflow if they are larger than the platform's long size, either 32-bit or 64-bit. However, bigints as implemented, will be treated as strings if they are outside of the bounds of a long on the platform. While bigints are likely to break existing code anyway, this would be a particularly bad breakage, as code relying on very large numbers being floats and wrapping when used as indices would break. Hence some sort of solution must be found. Either we cast bigints to longs and let them overflow (not terribly desirable), we don't change the current behaviour (inconsistent), or we change the handling of doubles. Personally, I don't like what PHP does here and would to go for this last option.

I think the current behaviour for floats as array keys is bad and we should change it. In my opinion, it should convert to string if fractional or if outside the range of a long, and otherwise convert to long. I think this is the most intuitive behaviour and the least confusing for new users, and probably also the least likely to cause bugs. Failing that, I’d like to see warnings when fractional floats are truncated to be used as array keys. We already warn about casts for string offsets, so I don’t see why we couldn’t for array offsets.

I’m less concerned about whether bigints become string keys or preserve their int-ness, as the difference to the end-user is much lesser. If I did add support for bigint keys, I’d probably just do so by storing them as strings and converting to bigint on output.

What are your thoughts? Obviously, as this RFC targets PHP 6/NEXT, breaking BC is acceptable to a degree. Also, if this becomes too controversial an issue (I hope not), I could always call a parallel vote on it if this RFC eventually goes to a vote.

Thanks!

Andrea Faulds
http://ajf.me/

11 years ago by Alexey Zakhlestin — view source

unread

I’m less concerned about whether bigints become string keys or preserve their int-ness,
as the difference to the end-user is much lesser. If I did add support for bigint keys,
I’d probably just do so by storing them as strings and converting to bigint on output.

The difference is quite noticeable actually.

for example, http://docs.php.net/array_merge

“If the input arrays have the same string keys, then the later value for that key will overwrite the previous one. If, however, the arrays contain numeric keys, the later value will not overwrite the original value, but will be appended."

--
Alexey Zakhlestin
CTO at Grids.by/you
https://github.com/indeyets
PGP key: http://indeyets.ru/alexey.zakhlestin.pgp.asc

11 years ago by Andrea Faulds — view source

unread

I’m less concerned about whether bigints become string keys or preserve their int-ness,
as the difference to the end-user is much lesser. If I did add support for bigint keys,
I’d probably just do so by storing them as strings and converting to bigint on output.

The difference is quite noticeable actually.

for example, http://docs.php.net/array_merge

“If the input arrays have the same string keys, then the later value for that key will overwrite the previous one. If, however, the arrays contain numeric keys, the later value will not overwrite the original value, but will be appended.”

Ooh, good catch. I guess I’ll have to look into whether bigint keys are possible, then.

--
Andrea Faulds
http://ajf.me/

11 years ago by Andrea Faulds — view source

unread

Hello,

I’ve made a change to the RFC and patch, whereby NaN will cast to integer zero
instead of LONG_MIN. I think that makes a lot more sense and I haven’t put this
under “Open Questions” as I don’t think anyone will really object.

Also, I’m reconsidering my earlier position on how to handle bigint to long
casting. (Userland can’t cast to long, but zend_parse_parameters, bitwise
shifts etc. need to use zend_dval_to_lval sometimes.) While I have currently
made the patch cap the value at the maximum/minimum value if too large/small,
this is inconsistent with double to long casting which instead truncates and
preserves the lowest bits. For this reason, and because I think it’s better to
keep as much information as possible rather than lose everything but the sign,
I will change it to truncate.

Either way, I think there should be some sort of warning (probably an E_NOTICE
or E_WARNING?) when this cast happens implicitly and the number is truncated,
such as in function calls. I’m tempted to remove this from Open Questions and
instead just Do The Right Thing and if someone objects later, the patch and RFC
can always be changed.

Of course, it’s worth noting we currently don’t do any sort of warning when
casting a float to a long causes a loss of information. Anthony Ferrara’s sadly
withdrawn RFC for scalar type hints would have done this for user land
functions, but that hasn’t happened yet, so adding a warning here would be
inconsistent with floats.

Any thoughts?

Thanks!

Andrea Faulds
http://ajf.me/

11 years ago by Andrea Faulds — view source

unread

Either way, I think there should be some sort of warning (probably an E_NOTICE
or E_WARNING?) when this cast happens implicitly and the number is truncated,
such as in function calls. I’m tempted to remove this from Open Questions and
instead just Do The Right Thing and if someone objects later, the patch and RFC
can always be changed.

Of course, it’s worth noting we currently don’t do any sort of warning when
casting a float to a long causes a loss of information. Anthony Ferrara’s sadly
withdrawn RFC for scalar type hints would have done this for user land
functions, but that hasn’t happened yet, so adding a warning here would be
inconsistent with floats.

The patch now raises an E_RECOVERABLE_ERROR if a bigint that is too large is passed as an argument to a function expecting a long. I think this is best as it would hopefully prevent bugs. I was inspired by Anthony Ferrara’s scalar type hinting RFC and Python’s OverflowErrors here.

However, this does cause the previously mentioned issue that floats don’t do the same thing and simply truncate without warning here. Should we change the behaviour for them as well? It would be inconsistent not to. Should I make a separate RFC which would make number casting behaviour strict and then make the bigint patch and RFC consistent with it, on the hope that separate RFC could be passed first? That way this wouldn’t become an issue that might block the bigint RFC.

Andrea Faulds
http://ajf.me/

11 years ago by Rowan Collins — view source

unread

Chris Wright wrote (on 26/06/2014):

I'm assuming if bigint keys were available, PHP_INT_MAX would still have the
same value, but PHP_INT_MAX + 1 would become a valid key, making this work
transparently.
PHP_INT_MAX would probably need to be renamed (i.e. deprecated and
replaced) because it would become a misnomer - the maxPHP int is no
longer known because it is limited only by memory.

An interesting thought. Is there really much value in renaming it
though? It would cause a compatibility break (once the deprecation
period expired) for something which could easily be covered by
documentation. If done right, the BigInt support would simply make the
constant unnecessary in the majority of cases.

It's also not entirely clear to me what it could be renamed to which
was any more appropriate.

P.S. You hit "reply" instead of "reply list" or "reply to all", so your
message missed the list.

Rowan Collins
[IMSoP]

11 years ago by Andrea Faulds — view source

unread

Chris Wright wrote (on 26/06/2014):

I'm assuming if bigint keys were available, PHP_INT_MAX would still have the
same value, but PHP_INT_MAX + 1 would become a valid key, making this work
transparently.
PHP_INT_MAX would probably need to be renamed (i.e. deprecated and
replaced) because it would become a misnomer - the maxPHP int is no
longer known because it is limited only by memory.

An interesting thought. Is there really much value in renaming it though? It would cause a compatibility break (once the deprecation period expired) for something which could easily be covered by documentation. If done right, the BigInt support would simply make the constant unnecessary in the majority of cases.

Right. Myself, I’m in favour of keeping PHP_INT_MAX around. The name might be misleading, but it’s still important in a few places. So long as the documentation makes this clear, it’s not a problem.

Actually, I’d like to add a new constant perhaps, PHP_INT_MIN. It seems silly that we have a _MAX yet no _MIN despite having signed integers, especially since your minimum can’t be typed as a literal (longstanding bug, one that bigints happen to “fix” as it would be a bigint, but it should really be a long), and it’s -(PHP_INT_MAX - 1) not -PHP_INT_MAX.

It's also not entirely clear to me what it could be renamed to which was any more appropriate.

PHP_LONG_MIN perhaps, but so far as user land cares, long and int are synonyms. I don’t want to rename it, anyway.

--
Andrea Faulds
http://ajf.me/

[DRAFT][RFC] Big Integer Support

--

Thanks for your comments!

Thanks!

Thanks!

P.S. You hit "reply" instead of "reply list" or "reply to all", so your message missed the list.

P.S. You hit "reply" instead of "reply list" or "reply to all", so your
message missed the list.