RFC [Discussion]: Randomizer Additions

2 years ago by drealecs@gmail.com — view source

unread

On Thu, Oct 13, 2022, 22:37 Joshua Rüsweg via internals <
internals@lists.php.net> wrote:

Hi

Tim Düsterhus and I have created an RFC to add new methods that solve
commonly encountered use cases to \Random\Randomizer. Specifically
creating a random string consisting of specific bytes and generating
random floating point values.

You can find the RFC at:

https://wiki.php.net/rfc/randomizer_additions

Proof of concept implementation is in:

https://github.com/php/php-src/pull/9664

https://github.com/php/php-src/pull/9679

Some open questions to start the discussion:

Are you missing other commonly useful operations that are also useful
to have in core?

For completeness, it would be good to have nextBool() as well.

Do you agree with the method names? Within the PR we received comments

that "alphabet" might not be an appropriate term.

Yes, alphabet is fine by me but I can suggest also: character set,
character list, character dictionary, byte set, byte list, byte dictionary
or some shortcuts of it, like charset.

Shall an option be added to getFloat() that changes the logic to

select from [$min, $max] (i.e. allowing the maximum to be returned)? And
how should that look like? Boolean parameter? Enum?

No, IMO. Mathematically it doesn't really make sense and talking about
floats, it will also be a very corner case not reached in tests that might
happen in production rarely and break things.

We're looking forward to your feedback.

Cheers

Joshua Rüsweg

--

To unsubscribe, visit: https://www.php.net/unsub.php

I am having another small issue.
As the Randomizer class is final, I guess this will not be perfectly
polyfillable in userland.
So... , if accepted, would it be completely wrong to have these new methods
in PHP 8.2? What can it break?

2 years ago by Joshua Rüsweg via internals — view source

unread

Hi

For completeness, it would be good to have nextBool() as well.

I'm just wondering if that's really necessary. Generating a boolean is
trivial with the nextFloat method (see example 1. Simulate a coinflip
from the RFC).

No, IMO. Mathematically it doesn't really make sense and talking about
floats, it will also be a very corner case not reached in tests that
might happen in production rarely and break things.

Why does it not make mathematical sense? With the nextFloat method, I
can understand the argument, because that is otherwise opaque,
especially when you work with probabilities. With the getFloat method,
however, I can imagine a few cases in which this makes sense. In our
example, for example, we calculate a random longitude and latitude with
the method. However, the value of Lat: +90.0 Lng: +180.0 cannot be
generated, although it is a valid value. However, the value of Lat:
-90.0 Lng: -180.0 is included. Am I missing something, or is there
currently no simple mathematical way to implement this with an
open-right interval?

I am having another small issue.
As the Randomizer class is final, I guess this will not be perfectly
polyfillable in userland.
So... , if accepted, would it be completely wrong to have these new
methods in PHP 8.2? What can it break?

I am relatively new and have little to no experience with contributing
PHP functions, but I don't think this should be done. There is a good
reason why this is not done and it only causes confusion if any
functions are introduced after the feature freeze. In the end, however,
I think it is the release manager who decides.

Cheers

Joshua Rüsweg

2 years ago by Go Kudo — view source

unread

2022年10月14日(金) 13:33 Alexandru Pătrănescu drealecs@gmail.com:

On Thu, Oct 13, 2022, 22:37 Joshua Rüsweg via internals <
internals@lists.php.net> wrote:

Hi

Tim Düsterhus and I have created an RFC to add new methods that solve
commonly encountered use cases to \Random\Randomizer. Specifically
creating a random string consisting of specific bytes and generating
random floating point values.

You can find the RFC at:

https://wiki.php.net/rfc/randomizer_additions

Proof of concept implementation is in:

https://github.com/php/php-src/pull/9664

https://github.com/php/php-src/pull/9679

Some open questions to start the discussion:

Are you missing other commonly useful operations that are also useful
to have in core?

For completeness, it would be good to have nextBool() as well.

Do you agree with the method names? Within the PR we received comments

that "alphabet" might not be an appropriate term.

Yes, alphabet is fine by me but I can suggest also: character set,
character list, character dictionary, byte set, byte list, byte dictionary
or some shortcuts of it, like charset.

Shall an option be added to getFloat() that changes the logic to

select from [$min, $max] (i.e. allowing the maximum to be returned)? And
how should that look like? Boolean parameter? Enum?

No, IMO. Mathematically it doesn't really make sense and talking about
floats, it will also be a very corner case not reached in tests that might
happen in production rarely and break things.

We're looking forward to your feedback.

Cheers

Joshua Rüsweg

--

To unsubscribe, visit: https://www.php.net/unsub.php

I am having another small issue.
As the Randomizer class is final, I guess this will not be perfectly
polyfillable in userland.
So... , if accepted, would it be completely wrong to have these new methods
in PHP 8.2? What can it break?

As the Randomizer class is final, I guess this will not be perfectly
polyfillable in userland.

What case do you envision for this? Probably, most cases can be polyfilled
by code: https://3v4l.org/JludD

And it seems that someone has already done such an implementation (which is
very nice and I would like to express my utmost appreciation).

https://github.com/arokettu/php-random-polyfill

Regards
Go Kudo

2 years ago by Dan Ackroyd — view source

unread

Hello Joshua,

Shall an option be added to getFloat() that changes the logic to
select from [$min, $max] (i.e. allowing the maximum to be returned)? And
how should that look like? Boolean parameter? Enum?

An enum would probably be nice, and possibly be for all four cases of
min_(inclusive|exclusive)max(inclusive|exclusive) unless there is a
technical reason to not include all of them.

Generating a random string containing specific characters...thus requires multiple lines of code for what effectively is a very simple operation.

Yeah, though those lines of code add distinction and emphasis for is
meant by character.

In particular, users might be surprised when they give this string
"abc😋👨‍👩‍👦"* and get a non-ascii result.

You're going to need to be really precise on the naming and I'm not at
all sure there is a single version that would be useful enough to
belong in core.

whereas a 64 Bit engine could generate randomness for 8 characters at once.

I'm really not sure that many programs are going to be speed limited
by random number generation.

For those that are, writing their own generator to consume all 64 bits
of randomness for each call sounds reasonably sensible, unless a
useful general api can be thought of.

For the float side of the RFC, as there are technical limitations on
which platforms it would be usable on, there needs to be a way of
determining whether the nextFloat and getFloat methods are going to
work. The way this is done on Imagick is to put appropriate defines in
the stub file and in the C code implementations so that the methods
aren't available on the class for the platforms where it isn't going
to function correctly.

I made a PR for that to Tim's repo, though I don't know of an
environment where it can be tested.

cheers
Dan
Ack

For those who get a mangled version, the "characters" there are 'a'
'b' 'c' 'smily-face' 'man + zerowidth joiner + woman + zerowidth
joiner + child'. The last is a character that takes a mere 18 bytes to
represent.

2 years ago by tim@bastelstu.be — view source

unread

Hi

Shall an option be added to getFloat() that changes the logic to
select from [$min, $max] (i.e. allowing the maximum to be returned)? And
how should that look like? Boolean parameter? Enum?

An enum would probably be nice, and possibly be for all four cases of
min_(inclusive|exclusive)max(inclusive|exclusive) unless there is a
technical reason to not include all of them.

No technical reason. The paper for the γ-section algorithm by Prof.
Goualard includes implementations for all four combinations.

The [closed, open) and [closed, closed] variants together are the most
useful combination, though. The former can be cleanly split into
subintervals, whereas the latter is symmetric which is useful if your
use case is as well.

With rejection sampling the [closed, closed] variant can also be turned
into any of the other three without introducing any bias.

Generating a random string containing specific characters...thus requires multiple lines of code for what effectively is a very simple operation.

Yeah, though those lines of code add distinction and emphasis for is
meant by character.

In particular, users might be surprised when they give this string
"abc😋👨‍👩‍👦"* and get a non-ascii result.

That's why the method includes 'bytes' in its name. This term is also
used in ->shuffleBytes() which was renamed in
https://wiki.php.net/rfc/random_extension_improvement due to this exact
problem.

In fact ->shuffleBytes() can be considered the companion method to the
proposed ->getBytesFromAlphabet():

->shuffleBytes() allows to simulate a Multivariate hypergeometric
distribution ("sampling without replacement") by shuffling the input
string and then using 'substr' on the result to select a number of bytes
'n'.

->getBytesFromAlphabet() directly maps to a Multinomial distribution
("replacement sampling").

You're going to need to be really precise on the naming and I'm not at
all sure there is a single version that would be useful enough to
belong in core.

Personally I believe the proposed ->getBytesFromAlphabet() to be useful
enough to belong in core. There are quite a few use cases that are
naturally restricted to ASCII: Basically everything that can be
considered an identifier.

Arbitrary numeric strings alone likely provide plenty of use cases:

Backup codes for multi-factor authentication (restricting the input to
digits allows you to leverage a numeric keyboard, reducing the chance
for input errors).
Random phone numbers for testing purposes.
Random credit card numbers (you just need to calculate the checksum
yourself).

While hexadecimal strings can easily be generated by applying bin2hex to
the output of ->getBytes(), that is also unintuitive, because the result
length is twice the number of bytes.

whereas a 64 Bit engine could generate randomness for 8 characters at once.

I'm really not sure that many programs are going to be speed limited
by random number generation.

The syscall cost to retrieve bytes from the 'Secure' engine (which is
the default engine) can be expensive. Especially on older operating
system versions and depending on how many more Meltdown/Spectre-style
vulnerabilities they find.

A userland implementation that generates 1000 random numeric strings
with 100 characters each using the 'Secure' engine requires 146ms on my
computer. The native implementation without optimization requires 101ms
and the optimized native implementation 26ms.

For Xoshiro256** (the fastest engine) the numbers are 89ms, 7ms and 3ms
respectively.

Benchmark attached.

For those that are, writing their own generator to consume all 64 bits
of randomness for each call sounds reasonably sensible, unless a
useful general api can be thought of.

This cannot be reasonably done in userland, because you pay an increased
cost to turn the bytes into numbers and then to perform the necessary
bit fiddling to debias the numbers.

For the float side of the RFC, as there are technical limitations on
which platforms it would be usable on, there needs to be a way of
determining whether the nextFloat and getFloat methods are going to
work. The way this is done on Imagick is to put appropriate defines in
the stub file and in the C code implementations so that the methods
aren't available on the class for the platforms where it isn't going
to function correctly.

I made a PR for that to Tim's repo, though I don't know of an
environment where it can be tested.

I've seent he PR and we shortly discussed this in chat. I would defer
this to the code review, because this amounts to an implementation
detail. Especially since all reasonable server platforms use IEEE 754.

Best regards
Tim Düsterhus

2 years ago by Jordan LeDoux — view source

unread

This cannot be reasonably done in userland, because you pay an increased
cost to turn the bytes into numbers and then to perform the necessary
bit fiddling to debias the numbers.

To add to this, I'm going to link to a userland implementation I wrote for
my Fermat library. A couple of notes:

I am not at all claiming to be a subject matter expert, and this was my
first pass at the feature, not a fully researched and optimized version.
The main issue at hand for me was that my 'numbers' could be arbitrary
precision, which would require me to directly ask for bytes and then pull
out numbers from them depending on the number of integer digits possible
within the range.

https://github.com/JordanRL/Fermat/blob/5d186eac99ceb98b089865099c2d7622428bfdaa/src/Samsara/Fermat/Provider/RandomProvider.php#L128

Now again, this was my first pass, but the way I tackled this was to get a
string of bytes that fully contained the integer range. The issue of
debiasing the numbers with that approach is so difficult, that I opted
instead to simply recursively call the operation until a number within the
desired range was generated. This avoids introducing bias, but is obviously
less than optimal. Some kind of ability to directly provide a set of
possible characters and generating from bytes directly would at least
reduce the recursion space further in my case, though my particular
algorithm still needs additional improvement obviously.

Generating random numbers that have an arbitrary range is not easy,
particularly if your range might exceed PHP_INT_MAX. Any utilities that
core can provide to improve performance and reduce the number of places
that bias might be introduced in such a situation would be much
appreciated. Personally, I decided for my implementation that introducing
bias was a greater concern than performance.

Jordan

2 years ago by Larry Garfield — view source

unread

Generating a random string containing specific characters...thus requires multiple lines of code for what effectively is a very simple operation.

Yeah, though those lines of code add distinction and emphasis for is
meant by character.

In particular, users might be surprised when they give this string
"abc😋👨‍👩‍👦"* and get a non-ascii result.

That's why the method includes 'bytes' in its name. This term is also
used in ->shuffleBytes() which was renamed in
https://wiki.php.net/rfc/random_extension_improvement due to this exact
problem.

In fact ->shuffleBytes() can be considered the companion method to the
proposed ->getBytesFromAlphabet():

"Alphabet" here still, to me, implies a character set, not a byte stream. Maybe getBytesFromString? getBytesFromList? getBytesFrom() (because you're getting it "from" the string that's provided, so you don't need another noun there.)?

I'm not opposed to the functionality, but "alphabet" doesn't seem like the right word for it.

--Larry Garfield

2 years ago by Joshua Rüsweg via internals — view source

unread

Hello!

[As Larry kindly pointed out to me, I only sent the email to Larry and
not to the mailing list.]

"Alphabet" here still, to me, implies a character set, not a byte stream. Maybe getBytesFromString? getBytesFromList? getBytesFrom() (because you're getting it "from" the string that's provided, so you don't need another noun there.)?

I'm not opposed to the functionality, but "alphabet" doesn't seem like the right word for it.

I'm not really happy with the name either and was hoping for some
suggestions. getBytesFrom sounds incomplete to me. getBytesFromList
sounds to me as if an array is being passed there and not a string.
getBytesFromString sounds good to me, though. Better than
getBytesfromAlphabet!

Cheers

Joshua Rüsweg

2 years ago by Jordan LeDoux — view source

unread

On Fri, Oct 28, 2022 at 12:30 PM Joshua Rüsweg via internals <
internals@lists.php.net> wrote:

Hello!

[As Larry kindly pointed out to me, I only sent the email to Larry and
not to the mailing list.]

"Alphabet" here still, to me, implies a character set, not a byte
stream. Maybe getBytesFromString? getBytesFromList? getBytesFrom()
(because you're getting it "from" the string that's provided, so you don't
need another noun there.)?

I'm not opposed to the functionality, but "alphabet" doesn't seem like
the right word for it.

I'm not really happy with the name either and was hoping for some
suggestions. getBytesFrom sounds incomplete to me. getBytesFromList
sounds to me as if an array is being passed there and not a string.
getBytesFromString sounds good to me, though. Better than
getBytesfromAlphabet!

Cheers

Joshua Rüsweg

Not to try and bikeshed further, but wouldn't getBytesFromChars or
getBytesFromCharList be more clear while being nearly as accurate?

Jordan

2 years ago by Claude Pache — view source

unread

Le 28 oct. 2022 à 23:43, Jordan LeDoux jordan.ledoux@gmail.com a écrit :

On Fri, Oct 28, 2022 at 12:30 PM Joshua Rüsweg via internals <
internals@lists.php.net> wrote:

Not to try and bikeshed further, but wouldn't getBytesFromChars or
getBytesFromCharList be more clear while being nearly as accurate?

Jordan

In the face of multibyte character sets such as UTF-8, I wouldn’t use “char” to mean “byte” (even if, in practice, the most common use will be strings of 1-byte chars). “Alphabet” or “string” might be ambiguous (is it an alphabet of bytes or an alphabet of characters?), but at least they are not contradictory.

—Claude

2 years ago by Jordan LeDoux — view source

unread

Le 28 oct. 2022 à 23:43, Jordan LeDoux jordan.ledoux@gmail.com a
écrit :

On Fri, Oct 28, 2022 at 12:30 PM Joshua Rüsweg via internals <
internals@lists.php.net> wrote:

Not to try and bikeshed further, but wouldn't getBytesFromChars or
getBytesFromCharList be more clear while being nearly as accurate?

Jordan

In the face of multibyte character sets such as UTF-8, I wouldn’t use
“char” to mean “byte” (even if, in practice, the most common use will be
strings of 1-byte chars). “Alphabet” or “string” might be ambiguous (is it
an alphabet of bytes or an alphabet of characters?), but at least they are
not contradictory.

—Claude

Well... perhaps. But "get bytes from character list" would do exactly what
it says it will do, even from UTF-8 strings. It will use any of the bytes
from the character list, even if one character in the list may contribute
multiple bytes to choose from. It is not contradictory, just potentially
confusing about the result

Jordan

2 years ago by Jordan LeDoux — view source

unread

On Sat, Oct 29, 2022 at 12:21 PM Jordan LeDoux jordan.ledoux@gmail.com
wrote:

Well... perhaps. But "get bytes from character list" would do exactly what
it says it will do, even from UTF-8 strings. It will use any of the bytes
from the character list, even if one character in the list may contribute
multiple bytes to choose from. It is not contradictory, just potentially
confusing about the result

Jordan

On second thought, isn't it possible to simply choose a random integer to
correspond with the char list position instead, sidestepping the UTF-8
issue entirely? In which case, it might be more accurate to name it just
getRandomCharFromList

Jordan

2 years ago by Jordan LeDoux — view source

unread

On Sat, Oct 29, 2022 at 6:02 PM Jordan LeDoux jordan.ledoux@gmail.com
wrote:

On Sat, Oct 29, 2022 at 12:21 PM Jordan LeDoux jordan.ledoux@gmail.com
wrote:

Well... perhaps. But "get bytes from character list" would do exactly
what it says it will do, even from UTF-8 strings. It will use any of the
bytes from the character list, even if one character in the list may
contribute multiple bytes to choose from. It is not contradictory, just
potentially confusing about the result

Jordan

On second thought, isn't it possible to simply choose a random integer to
correspond with the char list position instead, sidestepping the UTF-8
issue entirely? In which case, it might be more accurate to name it just
getRandomCharFromList

Jordan

Nevermind, disregard. I just remembered (again) how strings are handled in
PHP and why this wouldn't work.

Jordan

2 years ago by Joshua Rüsweg via internals — view source

unread

Hi

You can find the RFC at:

https://wiki.php.net/rfc/randomizer_additions

Tim Düsterhus and I have updated the RFC and have broadly adopted the
proposals.

Firstly we have renamed the getBytesFromAlphabet to
getBytesFromString as Larry Garfield suggested.

Secondly, we have extended the getFloat method with a parameter that
specifies which type of interval should be generated (Open, Closed,
Right Half-Open and Left Half-Open).

Are you happy with the enum names? Have you any other suggestions?

Cheers

Joshua Rüsweg

2 years ago by Jordan LeDoux — view source

unread

On Fri, Oct 28, 2022 at 12:23 PM Joshua Rüsweg via internals <
internals@lists.php.net> wrote:

Hi

You can find the RFC at:

https://wiki.php.net/rfc/randomizer_additions

Tim Düsterhus and I have updated the RFC and have broadly adopted the
proposals.

Firstly we have renamed the getBytesFromAlphabet to
getBytesFromString as Larry Garfield suggested.

Secondly, we have extended the getFloat method with a parameter that
specifies which type of interval should be generated (Open, Closed,
Right Half-Open and Left Half-Open).

Are you happy with the enum names? Have you any other suggestions?

I had a discussion with Tim about this naming topic as well, and want to
convey my suggestion along with the reasoning.

I think the enum should be Random\IntervalBoundary with the enum cases:

IntervalBoundary::Open
IntervalBoundary::Closed
IntervalBoundary::OpenRight or IntervalBoundary::HalfOpenRight
IntervalBoundary::OpenLeft or IntervalBoundary::HalfOpenLeft

First, I think the enum should be IntervalBoundary because an enum is
used to explicitly represent a program state in a human readable format.
Because of this, I think that the readability for enums should be the
highest priority, even above brevity. Additionally, it should describe the
thing that has the properties, in this case the interval itself. These
interval settings are not a property of floats, it's just that interval
boundary types don't usually matter outside of floats.

The enum should be in the Random\ namespace, because we are only
guaranteeing that this is a complete list of interval boundary properties
in the domain of random numbers. It should not be in the
Random\Randomizer namespace, even though that's the only place it would
be used, because these properties of an interval boundary would be shared
by anything that has interval properties in the domain of random numbers.
Enums don't describe a hierarchy, like interfaces do, so it doesn't make
sense to me to do it the same way the engines are done, for example.

I think OpenRight and OpenLeft are clear enough, since it clearly
describes something that is only half open, but I wouldn't argue strongly
against HalfOpenRight and HalfOpenLeft if someone else felt strongly
about it.

Anyway, that's my notes. :)

Jordan

2 years ago by tim@bastelstu.be — view source

unread

Hi

I had a discussion with Tim about this naming topic as well, and want to
convey my suggestion along with the reasoning.

I think the enum should be Random\IntervalBoundary with the enum cases:

Yes, I agree this is a much better name for the enum itself and just
adjusted the implementation and RFC.

IntervalBoundary::Open
IntervalBoundary::Closed
IntervalBoundary::OpenRight or IntervalBoundary::HalfOpenRight
IntervalBoundary::OpenLeft or IntervalBoundary::HalfOpenLeft

But for the enum members I prefer the existing naming of
(Closed|Open)(Closed|Open), unless someone has a good argument for a
different naming scheme:

It's very explicit, which I consider useful for two reasons which are
likely related:

(a) I don't have to think about what "OpenRight" means for the left
boundary.
(b) As not-a-native-speaker-of English I find OpenRight or HalfOpenRight
not completely obvious and would likely need to look up what exactly
OpenRight means, whereas ClosedOpen is understandable more intuitively.
Especially since a half-open interval is sometimes referred to as
half-open and sometimes as half-closed. The preferred variant might even
differ across languages.

It's nicely symmetric from the naming.
It's the naming used by Prof. Goualard in the Drawing Random
Floating-Point Numbers from an Interval paper (though in a shortened CC,
CO, OC, OO variant).

Best regards
Tim Düsterhus

2 years ago by Go Kudo — view source

unread

2022年10月14日(金) 4:38 Joshua Rüsweg via internals internals@lists.php.net:

Hi

Tim Düsterhus and I have created an RFC to add new methods that solve
commonly encountered use cases to \Random\Randomizer. Specifically
creating a random string consisting of specific bytes and generating
random floating point values.

You can find the RFC at:

https://wiki.php.net/rfc/randomizer_additions

Proof of concept implementation is in:

https://github.com/php/php-src/pull/9664

https://github.com/php/php-src/pull/9679

Some open questions to start the discussion:

Are you missing other commonly useful operations that are also useful
to have in core?

Do you agree with the method names? Within the PR we received comments
that "alphabet" might not be an appropriate term.

Shall an option be added to getFloat() that changes the logic to
select from [$min, $max] (i.e. allowing the maximum to be returned)? And
how should that look like? Boolean parameter? Enum?

We're looking forward to your feedback.

Cheers

Joshua Rüsweg

--

To unsubscribe, visit: https://www.php.net/unsub.php

(I made a mistake in replying to your message, so I will reply again here.)

Hi

I have read the RFC. I think the content is generally good and adequately
compensates for the areas that could not be completed.

I am skeptical only about getFloat(). The use cases are limited and seem
somewhat excessive. Do you have examples of how this is supported in other
languages?

Regards
Go Kudo

2 years ago by Go Kudo — view source

unread

2022年11月6日(日) 0:34 Go Kudo zeriyoshi@gmail.com:

2022年10月14日(金) 4:38 Joshua Rüsweg via internals internals@lists.php.net:

Hi

Tim Düsterhus and I have created an RFC to add new methods that solve
commonly encountered use cases to \Random\Randomizer. Specifically
creating a random string consisting of specific bytes and generating
random floating point values.

You can find the RFC at:

https://wiki.php.net/rfc/randomizer_additions

Proof of concept implementation is in:

https://github.com/php/php-src/pull/9664

https://github.com/php/php-src/pull/9679

Some open questions to start the discussion:

Are you missing other commonly useful operations that are also useful
to have in core?

Do you agree with the method names? Within the PR we received comments
that "alphabet" might not be an appropriate term.

Shall an option be added to getFloat() that changes the logic to
select from [$min, $max] (i.e. allowing the maximum to be returned)? And
how should that look like? Boolean parameter? Enum?

We're looking forward to your feedback.

Cheers

Joshua Rüsweg

--

To unsubscribe, visit: https://www.php.net/unsub.php

(I made a mistake in replying to your message, so I will reply again here.)

Hi

I have read the RFC. I think the content is generally good and adequately
compensates for the areas that could not be completed.

I am skeptical only about getFloat(). The use cases are limited and seem
somewhat excessive. Do you have examples of how this is supported in other
languages?

Regards
Go Kudo

Sorry, I did not read the updated RFC. The latitude and longitude samples
were very clear and even my limited understanding of floating point numbers
helped me understand why they were necessary.

I am so far in favor of this RFC and positive about both features. I also
agree that as a member of the maintainer I will maintain it properly.

Thank you for your very coherent writing and implementation.

Best Regards
Go Kudo

2 years ago by tim@bastelstu.be — view source

unread

Hi

I am skeptical only about getFloat(). The use cases are limited and seem
somewhat excessive. Do you have examples of how this is supported in other
languages?

Yes, unfortunately getFloat() became pretty complex, but that is because
"generating random floats correctly" is pretty complicated, due to how
floats work.

The getFloat() method as proposed implements the γ-section algorithm as
published in: Drawing Random Floating-Point Numbers from an Interval.
Frédéric Goualard, ACM Trans. Model. Comput. Simul., 32:3, 2022.
https://doi.org/10.1145/3503512

This publication is just 7 months old and explains how the
implementation in every other programming language is broken in one way
or another and proposes the γ-section algorithm as a not-broken algorithm.

As floats are not uniformly dense and do not allow representing all
values, it is very easy to introduce a bias or generate incorrect values.

An example taken from the publication:

php > $r = new Random\Randomizer();

We generate a random float in [0, 1) (allowing 0, but not 1), by
dividing a random int between 2^53 - 1 by 2^53. This is effectively what
->nextFloat() does. This creates a uniformly distributed float with as
many different values as possible, because a double (the underlying
representation) has 53 bits of precision.

The nextFloat() method is often the only thing that is available in
other languages, e.g. JavaScript with Math.random() [1]

php > $f = $r->getInt(0, (253 - 1)) / (253);
php > var_dump($f);
float(0.6942225382038698)

Now we want to turn this into a random float between [3.5, 4.5) (not
allowing 4.5), because that's what we need. It's also the formula given
in MDN for JavaScript's Math.random():

php > $min = 3.5;
php > $max = 4.5;
php > var_dump($min + ($max - $min) * $f);
float(4.19422253820387)

The simple formula appears to do the correct thing and it would be
correct if floats could represent all value values. But what happens if
the random integer is 2^53 - 1 (i.e. the maximum integer we allowed to
generate)?

php > $f = (253 - 1) / (253);
php > var_dump($f);
float(0.9999999999999999)
php > var_dump($min + ($max - $min) * $f);
float(4.5)

In this case the result was rounded to 4.5, because the exact result was
not representable. Now an invalid value was generated!

Likewise if you generate a random float between 0 and 1000 with this
method, some values will appear more often than others due to rounding
and the changing density of floats for each power of two.

With the γ-section algorithm by Prof. Goualard all these issues are
eliminated and that's what is used for getFloat(). The getFloat() method
supports all 4 possible boundary combinations to ensure that users have
a safe solution for all possible use cases, so that they don't need to
build an unsafe solution in userland.

Best regards
Tim Düsterhus

[1]
https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/Math/random

2 years ago by Jordan LeDoux — view source

unread

Likewise if you generate a random float between 0 and 1000 with this
method, some values will appear more often than others due to rounding
and the changing density of floats for each power of two.

With the γ-section algorithm by Prof. Goualard all these issues are
eliminated and that's what is used for getFloat(). The getFloat() method
supports all 4 possible boundary combinations to ensure that users have
a safe solution for all possible use cases, so that they don't need to
build an unsafe solution in userland.

This is a much more subtle problem than many people realize, and I'm sure
there are people who will vote or use this feature that don't have a deep
understanding of either the float/double type, or of the math behind this,
so I'm going to re-explain this using some analogies and examples.

NOTE: This is mainly for historical record if someone looks back on
this internals thread, not in response to any individual in the thread. I'm
trying to lay out in language that any passer-by could follow what the
issue is and why it is important. Accordingly, I'll be referencing base-10
numbers instead of base-2 numbers through most of this for clarity, but the
ideas behind it apply to the base-2 numbers that are actually used, just
with different boundaries at powers of 2 instead of powers of 10.

Suppose you have a number type that can store exactly 10 digits. Well over
the interval [0, 1] you can represent any number between 0.000 000 000 and
1.000 000 000 without issue. In fact, since you can store exactly 10
digits, you can represent any number between 0.000 000 000 and 9.999 999
999.

But what happens if you want the interval [0, 10]? Well, at 10 you can only
represent the digits 10.000 000 00, since you have to use one of the digits
that used to represent the decimal part to represent the whole part,
because you have a fixed amount of digits you can use. In order to
represent larger numbers, you lose the ability to represent certain numbers
that you used to be able to represent. The "density" of numbers has gone
down.

So if you tell the program "I want a random number, with decimal places,
between 0 and 1000", it runs into the following dilemma: between [0, 10),
(this means the interval that includes 0 but does not include 10), you have
1 billion values between each whole number which could be chosen. However,
for the numbers between [10, 100), you only have 100 million values between
each whole number that could be chosen, because you lost a digit after the
decimal point. This means that if you do a naive multiplication or
selection of your number, the intervals between 0 to 10 are individually 10
times more likely to be chosen than any of the individual intervals between
10 to 100, because those intervals have 10 times as many values mapped to
them.

The result would actually be nearly equally likely to be in the range [0,
10) as it would [10, 100), (because there are nearly as many possible
values between [0, 10) as there are between [10, 100)), even though
mathematically the second range is nearly 10 times the size of the first
range. Each possible representable value would be equally likely to be
chosen, but because the density of values is different at different parts
of the range, this actually skews the probability of a number landing
within an arbitrary part of the range of outputs.

This means that if your code does something like if ($rand->getFloat(0, 1000) < 500), you're NOT actually going to get a 50% chance of true and
a 50% chance of false. In fact, using the naive way of mapping floats
that Tim mentioned (with base-10 math instead of base-2), you'd get true
over two thirds of the time.

Accounting for this is not very easy. What proportion of your total result
set has an expanded representable value? In my example, that might be easy
to calculate, but the actual math for a program like this is done in
base-2, not base-10. To actually calculate this, you need all sorts of logs
you need to take to figure out how many powers of 2 you're spanning, and
how that affects the probability of different intervals. It is extremely
easy to get this math wrong, even if you know what you're doing. As Tim
said, virtually no implementations that exist actually do this correctly.

Also consider that the requested range may not nicely straddle a power of
2. What if you want a float for the interval [5, 10]? Neither 5 nor 10 sit
cleanly on a power of 2, so how does that affect your math?

Like cryptography and security, this is an area where very specific and
very confident expertise is necessary, and it is highly preferable to have
everyone use a single vetted solution than roll their own. This makes it
very poor as something to leave to userland, and makes it highly
desirable to include it in core.

It's very difficult to do, but there is also only one actually correct
result, which makes it perfect for inclusion in core.

Jordan

2 years ago by Joshua Rüsweg via internals — view source

unread

Hi

You can find the RFC at:

https://wiki.php.net/rfc/randomizer_additions

Proof of concept implementation is in:

https://github.com/php/php-src/pull/9664

https://github.com/php/php-src/pull/9679

we believe we resolved all open questions with the RFC and there was no
recent feedback with regard to the newest naming. Therefore we plan to
open voting on Wednesday, November, 9th, unless someone speaks up now
:). There will be two votes, one for the random string generation, one
for the floating point generators. Voting will run for two weeks, a 2/3
majority will be required (as with all votes). You can find the details
within the RFC in the "Proposed Voting Choice" section:
https://wiki.php.net/rfc/randomizer_additions#proposed_voting_choices

The PoC implementations have been updated with the latest RFC changes,
they will be cleaned up after the vote concludes.

Cheers

Joshua Rüsweg