3 years ago by Go Kudo — view source

unread

It's been a while Internals. I'm Go Kudo.

First of all, I would like to apologize for leaving my previous RFC, object
scoped RNG, and the preliminary RFC, split random extension, without any
progress.

The implementation of these RFCs was not sophisticated and failed to be
tested for a long time. In particular, we could not find the time to work
on them on Windows, and they were abandoned.

You may have noticed that my email address has changed. This is because I
have been assigned to work on this RFC and implementation as part of my
company. This has given me the equipment and time to debug in a variety of
environments.

Some people may have a negative impression about this change, but don't
worry. The RFC is still in the form of a feature proposal, and all rights
are reserved by the PHP Group.

Before we get into the RFC, let me explain again how this RFC came to be
proposed.

First, I started to implement an object-level scoped RNG in PHP to solve
the current issue. This is still available in PECL today.

https://pecl.php.net/package/orng
https://github.com/zeriyoshi/php-ext-orng

Later, I found in the PHP Internals ML logs that an implementation of RNG
with object-level scoping had been considered before, and was well received
at the time. However, it seems that the actual implementation was not done
at that time.

https://externals.io/message/98021#98130

After that, I started the first RFC and Vote. The result was a Decline, but
that was mainly due in part to the strange APIs.

https://wiki.php.net/rfc/object_scope_prng

The refreshed RFC and implementation are available at the following URL:

https://wiki.php.net/rfc/rng_extension
https://github.com/php/php-src/pull/8094

If there are no specific comments, I would like to start voting as soon as
the two-week pre-announcement phase is over.

Regards.
Go Kudo

3 years ago by timwolla@bastelstu.be — view source

unread

The refreshed RFC and implementation are available at the following URL:

https://wiki.php.net/rfc/rng_extension
https://github.com/php/php-src/pull/8094

If there are no specific comments, I would like to start voting as soon as
the two-week pre-announcement phase is over.

XorShift128+ has a 128 Bit internal state, but takes an integer seed
within the constructor. Thus only 64 Bits of seed can be provided.

Maybe the seed parameter should be a 16-byte string instead?
Initializing the generator with a completely random seed would then be:

new XorShift128Plus(\random_bytes(16));

instead of the much more complicated:

new XorShift128Plus(\random_int(\PHP_INT_MIN, \PHP_INT_MAX));

Perhaps the following API would be even clearer:

XorShift128Plus::fromSeed(\random_bytes(16));
XorShift128Plus::fromGenerator(new Secure()); // Takes 16 bytes from the
given generator.

I would adjust the 'Randomizer' to use the 'Secure' generator as a
safe default. If absolute performance or a reproducible sequence is
required then one can use a custom generator, but the default will be
the secure CSPRNG, making it harder to misuse.
The RFC is inconsistent in the example code. Is it 'stringShuffle' or
'shuffleString'?
The RFC should document the 'NumberGenerator' interface. Specifically
I'm interested in the return type of the 'generate' method. Does it
return bytes or integers? Is it legal to implement the interface in
userland code?

Best regards
Tim Düsterhus

3 years ago by Go Kudo — view source

unread

2022年2月14日(月) 20:40 Tim Düsterhus timwolla@bastelstu.be:

Hi

The refreshed RFC and implementation are available at the following URL:

https://wiki.php.net/rfc/rng_extension
https://github.com/php/php-src/pull/8094

If there are no specific comments, I would like to start voting as soon
as
the two-week pre-announcement phase is over.

XorShift128+ has a 128 Bit internal state, but takes an integer seed
within the constructor. Thus only 64 Bits of seed can be provided.

Maybe the seed parameter should be a 16-byte string instead?
Initializing the generator with a completely random seed would then be:

new XorShift128Plus(\random_bytes(16));

instead of the much more complicated:

new XorShift128Plus(\random_int(\PHP_INT_MIN, \PHP_INT_MAX));

Perhaps the following API would be even clearer:

XorShift128Plus::fromSeed(\random_bytes(16));
XorShift128Plus::fromGenerator(new Secure()); // Takes 16 bytes from the
given generator.

I would adjust the 'Randomizer' to use the 'Secure' generator as a
safe default. If absolute performance or a reproducible sequence is
required then one can use a custom generator, but the default will be
the secure CSPRNG, making it harder to misuse.

The RFC is inconsistent in the example code. Is it 'stringShuffle' or
'shuffleString'?

The RFC should document the 'NumberGenerator' interface. Specifically
I'm interested in the return type of the 'generate' method. Does it
return bytes or integers? Is it legal to implement the interface in
userland code?

Best regards
Tim Düsterhus

XorShift128+ has a 128 Bit internal state, but takes an integer seed
within the constructor. Thus only 64 Bits of seed can be provided.

This is for convenience. Other software that uses XorShift128+, such as
Chromium (V8), also uses a 64-bit value for the initial seed value.
I think that 128-bit value seeding with strings is unintuitive and not very
good for performance.

https://chromium.googlesource.com/v8/v8/+/refs/heads/main/src/base/utils/random-number-generator.h

I would adjust the 'Randomizer' to use the 'Secure' generator as a
safe default. If absolute performance or a reproducible sequence is
required then one can use a custom generator, but the default will be the
secure CSPRNG, making it harder to misuse.

Certainly, this may be appropriate. But, in this case, the Randomizer
generated with the default parameters will not be serializable. Is that
acceptable?

Personally, I think this is a good change.

The RFC is inconsistent in the example code. Is it 'stringShuffle' or
'shuffleString'?

RFC Fixed :)

The RFC should document the 'NumberGenerator' interface. Specifically
I'm interested in the return type of the 'generate' method. Does it return
bytes or integers? Is it legal to implement the interface in userland code?

It returns an int. Also, as pointed out in GH, the generated value is
implicitly treated as the size equivalent of PHP_INT_SIZE (zend_long) on
the environment. This means that it is not possible to implement the
userland Mersenne twister (32-bit) in a 64-bit environment.

https://github.com/php/php-src/pull/8094#pullrequestreview-881660425

3 years ago by timwolla@bastelstu.be — view source

unread

XorShift128+ has a 128 Bit internal state, but takes an integer seed
within the constructor. Thus only 64 Bits of seed can be provided.

This is for convenience. Other software that uses XorShift128+, such as
Chromium (V8), also uses a 64-bit value for the initial seed value.
I think that 128-bit value seeding with strings is unintuitive and not very
good for performance.

https://chromium.googlesource.com/v8/v8/+/refs/heads/main/src/base/utils/random-number-generator.h

I don't think performance for seeding the RNG matters. That's an
operation you only perform once (or a small number of times). The time
for the actual generation of random numbers most certainly dwarfs the
time used for seeding.

Regarding "unintuitive": I disagree. I find it unintuitive that there
are some RNG sequences that I can't access when providing a seed.

I wouldn't object a convenience constructor that also accepts an
integer, but I believe the default should be a seed that is appropriate
to generate the full state space.

I would adjust the 'Randomizer' to use the 'Secure' generator as a
safe default. If absolute performance or a reproducible sequence is
required then one can use a custom generator, but the default will be the
secure CSPRNG, making it harder to misuse.

Certainly, this may be appropriate. But, in this case, the Randomizer
generated with the default parameters will not be serializable. Is that
acceptable?

I consider that acceptable. If I want to serialize the randomizer then I
want a reproducible sequence and in that case I should also be forced to
explicitly decide what type of reproducible sequence I want. If I don't
explicitly decide, then newly generated randomizers might use an
entirely different generator if the default changes.,

The RFC should document the 'NumberGenerator' interface. Specifically
I'm interested in the return type of the 'generate' method. Does it return
bytes or integers? Is it legal to implement the interface in userland code?

It returns an int. Also, as pointed out in GH, the generated value is
implicitly treated as the size equivalent of PHP_INT_SIZE (zend_long) on
the environment. This means that it is not possible to implement the
userland Mersenne twister (32-bit) in a 64-bit environment.

I think the returned value of the generator should be a string
containing raw bytes.

It's very easy to interpret a bytestring as an appropriate integer (e.g.
using unpack()), but if some consumer needs a bytestring then turning
the integer back into a bytestring without accidentally introducing
biases is much harder, because of platform differences and the lack of
unsigned integers in userland.

Unfortunately your PR doesn't compile for me, so I can't test:

make: *** No rule to make target 'php-src/ext/standard/lcg.c', needed by
'ext/standard/lcg.lo'. Stop.

Best regards
Tim Düsterhus

3 years ago by tim@bastelstu.be — view source

unread

Unfortunately your PR doesn't compile for me, so I can't test:

make: *** No rule to make target 'php-src/ext/standard/lcg.c', needed by
'ext/standard/lcg.lo'. Stop.

I've managed to compile it by cleaning the whole directory and rerunning
of the build steps. Not sure what I missed the first time.

I've now been able to play around with it and have some additional
discussion points:

Consider the following script:

<?php

use Random\NumberGenerator\XorShift128Plus;
use Random\Randomizer;

$g1 = new XorShift128Plus(2);
$g2 = clone $g1;

$r1 = new Randomizer($g1);
$r2 = new Randomizer($g2);

var_dump(\bin2hex($r1->getBytes(8)));
var_dump(\bin2hex($r2->getBytes(4)) . \bin2hex($r2->getBytes(4)));

As a user: Would you expect those two 'var_dump' calls to result in the
same output?

Personally I would. For me that implies:

generate() should return raw bytes instead of a number (as I
suggested before).
The 'Randomizer' object should buffer unused bytes internally and
only call generate() if the internal buffer is drained.

Why xorshift instead of xoshiro / xoroshiro?

https://vigna.di.unimi.it/xorshift/ says that:

Information about my previous xorshift-based generators can be found here, but they have been entirely superseded by the new ones, which are faster and better.

That would imply to me that xorshift should not be used in new developments.

Consider the following script:

<?php

use Random\NumberGenerator\XorShift128Plus;

$g1 = new XorShift128Plus(2);

var_dump($g1);

exit;

Should the user be able to see the internal state of the Generator in
the var_dump() output?

Both xorshift as well as xoshiro / xoroshiro's reference
implementations include a 'jump()' function that allows one to easily
retrieve generators with distinct sequences, without needing to generate
seeds manually which might or might nor introduce a bias.

Is this something that we should provide as well?

As a follow-up to (4): Should the 'generate()' method be called
'next()' or 'step()' instead? Perhaps it should even be '__invoke()'?

Best regards
Tim Düsterhus

3 years ago by Go Kudo — view source

unread

2022年2月15日(火) 1:46 Tim Düsterhus tim@bastelstu.be:

Hi

Unfortunately your PR doesn't compile for me, so I can't test:

make: *** No rule to make target 'php-src/ext/standard/lcg.c', needed by
'ext/standard/lcg.lo'. Stop.

I've managed to compile it by cleaning the whole directory and rerunning
of the build steps. Not sure what I missed the first time.

I've now been able to play around with it and have some additional
discussion points:

Consider the following script:

<?php

use Random\NumberGenerator\XorShift128Plus;
use Random\Randomizer;

$g1 = new XorShift128Plus(2);
$g2 = clone $g1;

$r1 = new Randomizer($g1);
$r2 = new Randomizer($g2);

var_dump(\bin2hex($r1->getBytes(8)));
var_dump(\bin2hex($r2->getBytes(4)) . \bin2hex($r2->getBytes(4)));

As a user: Would you expect those two 'var_dump' calls to result in the
same output?

Personally I would. For me that implies:

generate() should return raw bytes instead of a number (as I
suggested before).

The 'Randomizer' object should buffer unused bytes internally and
only call generate() if the internal buffer is drained.

Why xorshift instead of xoshiro / xoroshiro?

https://vigna.di.unimi.it/xorshift/ says that:

Information about my previous xorshift-based generators can be found
here, but they have been entirely superseded by the new ones, which are
faster and better.

That would imply to me that xorshift should not be used in new
developments.

Consider the following script:

<?php

use Random\NumberGenerator\XorShift128Plus;

$g1 = new XorShift128Plus(2);

var_dump($g1);

exit;

Should the user be able to see the internal state of the Generator in
the var_dump() output?

Both xorshift as well as xoshiro / xoroshiro's reference
implementations include a 'jump()' function that allows one to easily
retrieve generators with distinct sequences, without needing to generate
seeds manually which might or might nor introduce a bias.

Is this something that we should provide as well?

As a follow-up to (4): Should the 'generate()' method be called
'next()' or 'step()' instead? Perhaps it should even be '__invoke()'?

Best regards
Tim Düsterhus

If there are no objections, I will change the NumberGenerator that
Randomizer uses by default to Secure.

Regarding "unintuitive": I disagree. I find it unintuitive that there are
some RNG sequences that I can't access when providing a seed.

This is also the case for RNG implementations in many other languages. For
example, Java also uses long (64-bit) as the seed value of the argument for
Math.

https://docs.oracle.com/en/java/javase/17/docs/api/java.base/java/util/Random.html#%3Cinit%3E(long)

As mentioned above, V8 also uses a 64-bit value as the seed value, and
generating with XorShift128+.

https://github.com/v8/v8/blob/main/src/base/utils/random-number-generator.h

On the other hand, some languages have access to the complete internal
state. Python, for example, accepts bytes or bytearrays.

https://docs.python.org/3/library/random.html#random.seed

However, making strings available in PHP may lead to incorrect usage.

I think we can safely do this by making the seed argument accept both int
and string, and only using it as the internal state if string is specified
and it's 128-bits long.

I've managed to compile it by cleaning the whole directory and rerunning
of the build steps. Not sure what I missed the first time.

This is probably due to a major change in config.m4. ./buidconf and
./configure need to be rerun properly.

Would you expect those two 'var_dump' calls to result in the same
output?

Added __debugInfo() magic method supports.

https://github.com/php/php-src/pull/8094/commits/78efd2bd1e0ac5db48c272b364a615a5611e8caa

generate() should return raw bytes instead of a number (as I suggested
before).

I don't think this is a very good idea.

The RNG is a random number generator and should probably not be generating
strings.
Of course, I am aware that strings represent binary sequences in PHP.
However, this is not user-friendly.

The generation of a binary string is a barrier when trying to implement
some kind of operation using numeric computation.

If you want to deal with the problem of generated size, it would be more
appropriate to define a method such as getGenerateSize() in the interface.
Even in this case, generation widths greater than PHP_INT_SIZE cannot be
supported, but generation widths greater than 64-bit are not very useful in
the first place.

The 'Randomizer' object should buffer unused bytes internally and only
call generate() if the internal buffer is drained.

Likewise, I think this is not a good idea. Buffering reintroduces the
problem of complex state management, which has been made so easy. The user
will always have to worry about the buffering size of the Randomizer.

Why xorshift instead of xoshiro / xoroshiro?

The XorShift128Plus algorithm is still in use in major browsers and is dead
in a good way.

Also, in our local testing, SplitMix64 + XorShift128Plus performed well in
terms of performance and random number quality, so I don't think it is
necessary to choose a different algorithm.

If this RFC passes, it will be easier to add algorithms in the future. If a
new algorithm is needed, it can be implemented immediately.

Regards,
Go Kudo

3 years ago by tim@bastelstu.be — view source

unread

Regarding "unintuitive": I disagree. I find it unintuitive that there are
some RNG sequences that I can't access when providing a seed.

This is also the case for RNG implementations in many other languages. For
example, Java also uses long (64-bit) as the seed value of the argument for
Math.

https://docs.oracle.com/en/java/javase/17/docs/api/java.base/java/util/Random.html#%3Cinit%3E(long)

java.util.Random is a LCG with only 48 Bits of state. A single 64-bit
signed long is sufficient to represent the state.

On the other hand, some languages have access to the complete internal
state. Python, for example, accepts bytes or bytearrays.

https://docs.python.org/3/library/random.html#random.seed

However, making strings available in PHP may lead to incorrect usage.

I think we can safely do this by making the seed argument accept both int
and string, and only using it as the internal state if string is specified
and it's 128-bits long.

That's a solution that would work for me.

Would you expect those two 'var_dump' calls to result in the same
output?

Added __debugInfo() magic method supports.

https://github.com/php/php-src/pull/8094/commits/78efd2bd1e0ac5db48c272b364a615a5611e8caa

Don't forget to update the RFC accordingly. It would probably be helpful
if you would put the full class stubs into the RFC. I find that easier
to understand than a list of methods.

generate() should return raw bytes instead of a number (as I suggested
before).

I don't think this is a very good idea.

The RNG is a random number generator and should probably not be generating
strings.

I'd say that the 'number' part in RNG is not technically accurate. All
RNGs are effectively generators for a random sequence of bits. The
number part is just an interpretation of those random sequence of bits
(e.g. 64 of them).

Of course, I am aware that strings represent binary sequences in PHP.
However, this is not user-friendly.

The generation of a binary string is a barrier when trying to implement
some kind of operation using numeric computation.

I believe the average user of the RNG API would use the Randomizer
class, instead of the raw generators, thus they would not come in
contact with the raw bytes coming from the generator.

However by getting PHP integers out of the generator it is much harder
for me to process the raw bits and bytes, if that's something I need for
my use case.

As an example if I want to implement the following in userland. Then
with getting raw bytes:

For Randomizer::getBytes() I can just concatenate the raw bytes.
For a random uint16BE I can grab 2 bytes and call unpack('n', $bytes)

If I get random 64 Bit integers then:

For Randomizer::getBytes() I need to use pack and I'm not even sure,
whether I need to use 'q', 'Q', 'J', 'P' to receive an unbiased result.
For uint16BE I can use "& 0xFFFF", but would waste 48 Bits, unless I
also perform bit shifting to access the other bytes. But then there's
also the same signedness issue.

Interpreting numbers as bytes and vice versa in C / C++ is very easy.
However in PHP userland I believe the bytes -> numbers direction is
easy-ish. The numbers -> bytes direction is full of edge cases.

If you want to deal with the problem of generated size, it would be more
appropriate to define a method such as getGenerateSize() in the interface.
Even in this case, generation widths greater than PHP_INT_SIZE cannot be
supported, but generation widths greater than 64-bit are not very useful in
the first place.

The 'Randomizer' object should buffer unused bytes internally and only
call generate() if the internal buffer is drained.

Likewise, I think this is not a good idea. Buffering reintroduces the
problem of complex state management, which has been made so easy. The user
will always have to worry about the buffering size of the Randomizer.

Unfortunately you did not answer the primary question. The ones you
answered were just follow-up conclusions from the answer I would give:

  var_dump(\bin2hex($r1->getBytes(8)));
  var_dump(\bin2hex($r2->getBytes(4)) . \bin2hex($r2->getBytes(4)));

As a user: Would you expect those two 'var_dump' calls to result in the
same output?

Why xorshift instead of xoshiro / xoroshiro?

The XorShift128Plus algorithm is still in use in major browsers and is dead
in a good way.

I believe that that the underlying RNG in web browsers is considered an
implementation detail, no?

For PHP this would be part of the API surface and would need to be
maintained indefinitely. Certainly it would make sense to use the latest
and greatest RNG, instead of something that is outdated when its first
shipped, no?

Also, in our local testing, SplitMix64 + XorShift128Plus performed well in
terms of performance and random number quality, so I don't think it is
necessary to choose a different algorithm.

If this RFC passes, it will be easier to add algorithms in the future. If a
new algorithm is needed, it can be implemented immediately.

Best regards
Tim Düsterhus

3 years ago by Go Kudo — view source

unread

2022年2月15日(火) 19:03 Tim Düsterhus tim@bastelstu.be:

Hi

Regarding "unintuitive": I disagree. I find it unintuitive that there
are
some RNG sequences that I can't access when providing a seed.

This is also the case for RNG implementations in many other languages.
For
example, Java also uses long (64-bit) as the seed value of the argument
for
Math.

https://docs.oracle.com/en/java/javase/17/docs/api/java.base/java/util/Random.html#%3Cinit%3E(long)

java.util.Random is a LCG with only 48 Bits of state. A single 64-bit
signed long is sufficient to represent the state.

On the other hand, some languages have access to the complete internal
state. Python, for example, accepts bytes or bytearrays.

https://docs.python.org/3/library/random.html#random.seed

However, making strings available in PHP may lead to incorrect usage.

I think we can safely do this by making the seed argument accept both int
and string, and only using it as the internal state if string is
specified
and it's 128-bits long.

That's a solution that would work for me.

Would you expect those two 'var_dump' calls to result in the same
output?

Added __debugInfo() magic method supports.

https://github.com/php/php-src/pull/8094/commits/78efd2bd1e0ac5db48c272b364a615a5611e8caa

Don't forget to update the RFC accordingly. It would probably be helpful
if you would put the full class stubs into the RFC. I find that easier
to understand than a list of methods.

generate() should return raw bytes instead of a number (as I suggested
before).

I don't think this is a very good idea.

The RNG is a random number generator and should probably not be
generating
strings.

I'd say that the 'number' part in RNG is not technically accurate. All
RNGs are effectively generators for a random sequence of bits. The
number part is just an interpretation of those random sequence of bits
(e.g. 64 of them).

Of course, I am aware that strings represent binary sequences in PHP.
However, this is not user-friendly.

The generation of a binary string is a barrier when trying to implement
some kind of operation using numeric computation.

I believe the average user of the RNG API would use the Randomizer
class, instead of the raw generators, thus they would not come in
contact with the raw bytes coming from the generator.

However by getting PHP integers out of the generator it is much harder
for me to process the raw bits and bytes, if that's something I need for
my use case.

As an example if I want to implement the following in userland. Then
with getting raw bytes:

For Randomizer::getBytes() I can just concatenate the raw bytes.

For a random uint16BE I can grab 2 bytes and call unpack('n', $bytes)

If I get random 64 Bit integers then:

For Randomizer::getBytes() I need to use pack and I'm not even sure,
whether I need to use 'q', 'Q', 'J', 'P' to receive an unbiased result.

For uint16BE I can use "& 0xFFFF", but would waste 48 Bits, unless I
also perform bit shifting to access the other bytes. But then there's
also the same signedness issue.

Interpreting numbers as bytes and vice versa in C / C++ is very easy.
However in PHP userland I believe the bytes -> numbers direction is
easy-ish. The numbers -> bytes direction is full of edge cases.

If you want to deal with the problem of generated size, it would be more
appropriate to define a method such as getGenerateSize() in the
interface.
Even in this case, generation widths greater than PHP_INT_SIZE cannot be
supported, but generation widths greater than 64-bit are not very useful
in
the first place.

The 'Randomizer' object should buffer unused bytes internally and only
call generate() if the internal buffer is drained.

Likewise, I think this is not a good idea. Buffering reintroduces the
problem of complex state management, which has been made so easy. The
user
will always have to worry about the buffering size of the Randomizer.

Unfortunately you did not answer the primary question. The ones you
answered were just follow-up conclusions from the answer I would give:
  var_dump(\bin2hex($r1->getBytes(8)));
  var_dump(\bin2hex($r2->getBytes(4)) . \bin2hex($r2->getBytes(4)));
As a user: Would you expect those two 'var_dump' calls to result in the
same output?

Why xorshift instead of xoshiro / xoroshiro?

The XorShift128Plus algorithm is still in use in major browsers and is
dead
in a good way.

I believe that that the underlying RNG in web browsers is considered an
implementation detail, no?

For PHP this would be part of the API surface and would need to be
maintained indefinitely. Certainly it would make sense to use the latest
and greatest RNG, instead of something that is outdated when its first
shipped, no?

Also, in our local testing, SplitMix64 + XorShift128Plus performed well
in
terms of performance and random number quality, so I don't think it is
necessary to choose a different algorithm.

If this RFC passes, it will be easier to add algorithms in the future.
If a
new algorithm is needed, it can be implemented immediately.

Best regards
Tim Düsterhus

java.util.Random is a LCG with only 48 Bits of state. A single 64-bit
signed long is sufficient to represent the state.

Sorry about that. Java was not affected by this problem.

At first, I updated the RFC to the latest status.

https://wiki.php.net/rfc/rng_extension

I need some time to think about the current issue. I understand its
usefulness, but I feel uncomfortable with the fact that the NumberGenerator
generates a string.

I also wonder about the point of changing RNG to XorShift128Plus. There are
a number of derived implementations, which RNG do you think is more
suitable?

Regards,
Go Kudo

3 years ago by tim@bastelstu.be — view source

unread

At first, I updated the RFC to the latest status.

https://wiki.php.net/rfc/rng_extension

Thank you, the examples are useful.

I need some time to think about the current issue. I understand its
usefulness, but I feel uncomfortable with the fact that the NumberGenerator
generates a string.

Would you feel more comfortable if the interface would be called
\Random\Engine or \Random\Generator (i.e. leaving out the "Number" from
the interface name)?

Engine is the term used by C++:
https://en.cppreference.com/w/cpp/numeric/random/mersenne_twister_engine
Generator is the term used by Java:
https://docs.oracle.com/en/java/javase/17/docs/api/java.base/java/util/random/RandomGenerator.html

With the 'FixedForUnitTest' example you mentioned in the RFC: While for
that specific implementation it appears pretty obvious that increasing
numbers are generated, in practice:

This will result in inconsistent behavior based on the architecture.
I can't test it due to the lack of the necessary architectures, but I
believe the following to be accurate:

$g = new FixedForUnitTest();

$r = new Randomizer($g);

// 0100000000000000 in 64 Bit little endian
// 0100000002000000 in 32 Bit little endian
// 0000000000000001 in 64 Bit big endian
var_dump(bin2hex($r->getBytes(8)));
This analogy completely breaks down for the 'shuffle' functions which
call the generator internally an unspecified number of times:

$g = new FixedForUnitTest();

$r = new Randomizer($g);

var_dump($r->shuffleString("abcdefghijklmnopqrstuvwxyz")); //
wosqyrupatvxznmlkjihgfedcb
var_dump($r->shuffleString("abcdefghijklmnopqrstuvwxyz")); //
fwrtjndlsyvpzuhxbqomkigeca

The resulting strings look somewhat ordered, but there is no clear
pattern, despite the underlying generator being completely predictable!

I also wonder about the point of changing RNG to XorShift128Plus. There are
a number of derived implementations, which RNG do you think is more
suitable?

I'm not an expert in RNGs, but based off this page:
https://prng.di.unimi.it/ and the linked papers it appears that
xoshiro256** is the overall best choice if memory usage is not a concern.

Best regards
Tim Düsterhus

3 years ago by Go Kudo — view source

unread

2022年2月15日(火) 22:09 Tim Düsterhus tim@bastelstu.be:

Hi

At first, I updated the RFC to the latest status.

https://wiki.php.net/rfc/rng_extension

Thank you, the examples are useful.

I need some time to think about the current issue. I understand its
usefulness, but I feel uncomfortable with the fact that the
NumberGenerator
generates a string.

Would you feel more comfortable if the interface would be called
\Random\Engine or \Random\Generator (i.e. leaving out the "Number" from
the interface name)?

Engine is the term used by C++:
https://en.cppreference.com/w/cpp/numeric/random/mersenne_twister_engine
Generator is the term used by Java:

https://docs.oracle.com/en/java/javase/17/docs/api/java.base/java/util/random/RandomGenerator.html

With the 'FixedForUnitTest' example you mentioned in the RFC: While for
that specific implementation it appears pretty obvious that increasing
numbers are generated, in practice:

This will result in inconsistent behavior based on the architecture.
I can't test it due to the lack of the necessary architectures, but I
believe the following to be accurate:

$g = new FixedForUnitTest();

$r = new Randomizer($g);

// 0100000000000000 in 64 Bit little endian
// 0100000002000000 in 32 Bit little endian
// 0000000000000001 in 64 Bit big endian
var_dump(bin2hex($r->getBytes(8)));

This analogy completely breaks down for the 'shuffle' functions which
call the generator internally an unspecified number of times:

$g = new FixedForUnitTest();

$r = new Randomizer($g);

var_dump($r->shuffleString("abcdefghijklmnopqrstuvwxyz")); //
wosqyrupatvxznmlkjihgfedcb
var_dump($r->shuffleString("abcdefghijklmnopqrstuvwxyz")); //
fwrtjndlsyvpzuhxbqomkigeca

The resulting strings look somewhat ordered, but there is no clear
pattern, despite the underlying generator being completely predictable!

I also wonder about the point of changing RNG to XorShift128Plus. There
are
a number of derived implementations, which RNG do you think is more
suitable?

I'm not an expert in RNGs, but based off this page:
https://prng.di.unimi.it/ and the linked papers it appears that
xoshiro256** is the overall best choice if memory usage is not a concern.

Best regards
Tim Düsterhus

Hi Tim.

Thanks for the good idea. I changed the NumberGenerator to Engine and
changed generate() to return a string as suggested.

The main changes since last time are as follows:

The userland implementation can now specify the width of the random
number sequence that can be generated
Random\Engine::nextByteSize() has been added
Random\Engine::generate() now returns a string
Random\Randomizer::getInt() now accepts an empty argument (like mt_rand())

At the same time, I have updated the RFC.

https://wiki.php.net/rfc/rng_extension

I have not yet come to a final conclusion on whether XorShift128Plus should
be switched to another RNG. For example, what about implementing
XorShift128Plus, but adding Xoshiro256** as well?

3 years ago by tim@bastelstu.be — view source

unread

Thanks for the good idea. I changed the NumberGenerator to Engine and
changed generate() to return a string as suggested.

Thanks, I've already seen the updated PR and played around with it. This
feels much better now.

As a test I implemented xoshiro128++ in pure userland (it being a 32 Bit
RNG avoids the signedness issues in PHP userland) and compared it
against the reference implementation in C.

Find my implementations attached. Both versions give the same results
(little endian encoding):

 $ gcc xoshiro128pp.c ; ./a.out
 fa3c872c
 $ sapi/cli/php test_rng.php
 string(8) "fa3c872c"

The main changes since last time are as follows:

The userland implementation can now specify the width of the random
number sequence that can be generated

Random\Engine::nextByteSize() has been added

Is the nextByteSize() method actually required? PHP strings already know
their own length.

Random\Engine::generate() now returns a string

I've looked into your C implementation and it appears it still is
affected by endianness issues. You can't simply memcpy the raw uintXX_t
bytes into the char array. I believe the following should do the trick
to for a consistent little endian encoding:

bytes[0] = (generated >> 0) & 0xff;
bytes[1] = (generated >> 8) & 0xff;
bytes[2] = (generated >> 16) & 0xff;
bytes[3] = (generated >> 24) & 0xff;

The choice of endianness is arbitrary, but it needs to be consistent for
every platform.

Likewise when converting back to a number from bytes:

number = (bytes[0] << 0) | (bytes[1] << 8) (bytes[2] << 16) | (bytes[3]
<< 24);

I believe the same issue exists when initializing the XorShift with a
string.

I have not yet come to a final conclusion on whether XorShift128Plus should
be switched to another RNG. For example, what about implementing
XorShift128Plus, but adding Xoshiro256** as well?

That would be fine for me as well. But it might make it harder for the
end user to choose an appropriate RNG.

Best regards
Tim Düsterhus

3 years ago by Go Kudo — view source

unread

2022年2月16日(水) 20:24 Tim Düsterhus tim@bastelstu.be:

Hi

Thanks for the good idea. I changed the NumberGenerator to Engine and
changed generate() to return a string as suggested.

Thanks, I've already seen the updated PR and played around with it. This
feels much better now.

As a test I implemented xoshiro128++ in pure userland (it being a 32 Bit
RNG avoids the signedness issues in PHP userland) and compared it
against the reference implementation in C.

Find my implementations attached. Both versions give the same results
(little endian encoding):
 $ gcc xoshiro128pp.c ; ./a.out
 fa3c872c
 $ sapi/cli/php test_rng.php
 string(8) "fa3c872c"
The main changes since last time are as follows:

The userland implementation can now specify the width of the random
number sequence that can be generated

Random\Engine::nextByteSize() has been added

Is the nextByteSize() method actually required? PHP strings already know
their own length.

Random\Engine::generate() now returns a string

I've looked into your C implementation and it appears it still is
affected by endianness issues. You can't simply memcpy the raw uintXX_t
bytes into the char array. I believe the following should do the trick
to for a consistent little endian encoding:

bytes[0] = (generated >> 0) & 0xff;
bytes[1] = (generated >> 8) & 0xff;
bytes[2] = (generated >> 16) & 0xff;
bytes[3] = (generated >> 24) & 0xff;

The choice of endianness is arbitrary, but it needs to be consistent for
every platform.

Likewise when converting back to a number from bytes:

number = (bytes[0] << 0) | (bytes[1] << 8) (bytes[2] << 16) | (bytes[3]
<< 24);

I believe the same issue exists when initializing the XorShift with a
string.

I have not yet come to a final conclusion on whether XorShift128Plus
should
be switched to another RNG. For example, what about implementing
XorShift128Plus, but adding Xoshiro256** as well?

That would be fine for me as well. But it might make it harder for the
end user to choose an appropriate RNG.

Best regards
Tim Düsterhus

Thanks Tim

I forgot about the endian problem. Will try to fix it. Thank you very much.

Is the nextByteSize() method actually required? PHP strings already know
their own length.

This is a convenience of the current implementation.

The native implementation does not use strings for performance reasons.
This is because there is no way to know the valid width information they
generate.

I'm trying to think of some good ideas.

That would be fine for me as well.

OK, I'm going to add Xoshiro128++ and Xoshiro256**.

In order to make PHP easier to maintain, I think it is best not to add more
implementations at random, so I will not add any more.

3 years ago by tim@bastelstu.be — view source

unread

Is the nextByteSize() method actually required? PHP strings already know
their own length.

This is a convenience of the current implementation.

You already said that you will think of some good ideas, but I'd like to
be clear that the convenience of the internal implementation should not
be something that affects the user-facing implementation.

In fact with the current implementation I can trivially create a
memory-unsafety bug:

 &lt;?php

 use Random\Engine;
 use Random\Randomizer;

 final class Bug implements Engine {
     public function generate(): string
     {
         return '';
     }

     public function nextByteSize(): int {
         return 7;
     }
 }

 $e = new Bug();
 $g = new Randomizer($e);

 var_dump(\bin2hex($g->getBytes(8)));

Results in:

==116755== Use of uninitialised value of size 8
==116755==    at 0x6180C8: php_bin2hex (string.c:111)
==116755==    by 0x6185D2: zif_bin2hex (string.c:220)
==116755==    by 0x79BDB4: ZEND_DO_ICALL_SPEC_RETVAL_USED_HANDLER (zend_vm_execute.h:1312)
==116755==    by 0x8194F0: execute_ex (zend_vm_execute.h:55503)
==116755==    by 0x81ED86: zend_execute (zend_vm_execute.h:59858)
==116755==    by 0x75A923: zend_execute_scripts (zend.c:1744)
==116755==    by 0x69C8C4: php_execute_script (main.c:2535)
==116755==    by 0x8E0B19: do_cli (php_cli.c:965)
==116755==    by 0x8E1DF9: main (php_cli.c:1367)
==116755== 
==116755== Use of uninitialised value of size 8
==116755==    at 0x618100: php_bin2hex (string.c:112)
==116755==    by 0x6185D2: zif_bin2hex (string.c:220)
==116755==    by 0x79BDB4: ZEND_DO_ICALL_SPEC_RETVAL_USED_HANDLER (zend_vm_execute.h:1312)
==116755==    by 0x8194F0: execute_ex (zend_vm_execute.h:55503)
==116755==    by 0x81ED86: zend_execute (zend_vm_execute.h:59858)
==116755==    by 0x75A923: zend_execute_scripts (zend.c:1744)
==116755==    by 0x69C8C4: php_execute_script (main.c:2535)
==116755==    by 0x8E0B19: do_cli (php_cli.c:965)
==116755==    by 0x8E1DF9: main (php_cli.c:1367)
==116755== 
string(16) "0000000000000000"

For userland implementations you really must derive the size from the
returned bytestring. Otherwise it's easy for a developer to create an
implementation where nextByteSize() and generate() disagree. Even if the
memory-unsafety is fixed, this will result in frustration for the user.

For native implementations you can keep some explicit width in the code,
if that helps with performance.

Best regards
Tim Düsterhus

3 years ago by Go Kudo — view source

unread

2022年2月16日(水) 21:25 Tim Düsterhus tim@bastelstu.be:

Hi

Is the nextByteSize() method actually required? PHP strings already
know
their own length.

This is a convenience of the current implementation.

You already said that you will think of some good ideas, but I'd like to
be clear that the convenience of the internal implementation should not
be something that affects the user-facing implementation.

In fact with the current implementation I can trivially create a
memory-unsafety bug:
 &lt;?php

 use Random\Engine;
 use Random\Randomizer;

 final class Bug implements Engine {
     public function generate(): string
     {
         return '';
     }

     public function nextByteSize(): int {
         return 7;
     }
 }

 $e = new Bug();
 $g = new Randomizer($e);

 var_dump(\bin2hex($g->getBytes(8)));
Results in:
==116755== Use of uninitialised value of size 8
==116755==    at 0x6180C8: php_bin2hex (string.c:111)
==116755==    by 0x6185D2: zif_bin2hex (string.c:220)
==116755==    by 0x79BDB4: ZEND_DO_ICALL_SPEC_RETVAL_USED_HANDLER
(zend_vm_execute.h:1312)
==116755==    by 0x8194F0: execute_ex (zend_vm_execute.h:55503)
==116755==    by 0x81ED86: zend_execute (zend_vm_execute.h:59858)
==116755==    by 0x75A923: zend_execute_scripts (zend.c:1744)
==116755==    by 0x69C8C4: php_execute_script (main.c:2535)
==116755==    by 0x8E0B19: do_cli (php_cli.c:965)
==116755==    by 0x8E1DF9: main (php_cli.c:1367)
==116755==
==116755== Use of uninitialised value of size 8
==116755==    at 0x618100: php_bin2hex (string.c:112)
==116755==    by 0x6185D2: zif_bin2hex (string.c:220)
==116755==    by 0x79BDB4: ZEND_DO_ICALL_SPEC_RETVAL_USED_HANDLER
(zend_vm_execute.h:1312)
==116755==    by 0x8194F0: execute_ex (zend_vm_execute.h:55503)
==116755==    by 0x81ED86: zend_execute (zend_vm_execute.h:59858)
==116755==    by 0x75A923: zend_execute_scripts (zend.c:1744)
==116755==    by 0x69C8C4: php_execute_script (main.c:2535)
==116755==    by 0x8E0B19: do_cli (php_cli.c:965)
==116755==    by 0x8E1DF9: main (php_cli.c:1367)
==116755==
string(16) "0000000000000000"
For userland implementations you really must derive the size from the
returned bytestring. Otherwise it's easy for a developer to create an
implementation where nextByteSize() and generate() disagree. Even if the
memory-unsafety is fixed, this will result in frustration for the user.

For native implementations you can keep some explicit width in the code,
if that helps with performance.

Best regards
Tim Düsterhus

Hi Tim

The following points have been fixed:

nextByteSize(): int has been removed from Random\Engine
If the width of the RNG is statically defined, it will now be used
preferentially
Added Xoshiro256StarStar
Fixed an endianness issue

And updated RFC

https://wiki.php.net/rfc/rng_extension

I also had a PHP implementation of Xorshift128Plus on hand, so I included
it in the tests.

https://github.com/colopl/php-src/blob/upstream_rfc/scoped_rng_for_pr/ext/random/tests/engine/user_xorshift128plus.phpt

Xoshiro128PlusPlus has been dropped from the bundle due to width issues. If
necessary, it will be implemented as an extension to PECL.
However, it is built in as a test of the userland implementation

https://github.com/colopl/php-src/blob/upstream_rfc/scoped_rng_for_pr/ext/random/tests/engine/user_xoshiro128plusplus.phpt

This seems to have solved the whole problem. How about it?

Regards
Go Kudo

3 years ago by tim@bastelstu.be — view source

unread

The following points have been fixed:

nextByteSize(): int has been removed from Random\Engine

If the width of the RNG is statically defined, it will now be used
preferentially

Added Xoshiro256StarStar

Fixed an endianness issue

And updated RFC

https://wiki.php.net/rfc/rng_extension

[...]
This seems to have solved the whole problem. How about it?

Awesome, this is feeling much much better now. As you might've seen I've
made some comments on GitHub regarding implementation bugs.

I have two more conceptional questions:

However I believe you did not answer my question regarding the following
and that's something that should be clear in the RFC and documentation:

 &lt;?php

 use Random\Engine\Xoshiro256StarStar;
 use Random\Randomizer;

 $g1 = new Xoshiro256StarStar(1);
 $g2 = clone $g1;
 $r1 = new Randomizer($g1);
 $r2 = new Randomizer($g2);

 var_dump(\bin2hex($r1->getBytes(8))); // string(16) "c510c70f6daff2b3"
 var_dump(\bin2hex($r2->getBytes(4) . $r2->getBytes(4))); //

string(16) "c510c70fea4c3647"

In this example I get 8 bytes from the randomizer. One time by getting
all 8 bytes at once, the second time by getting 4 bytes and then another
4 bytes.

I think that both lines should result in the same output, because in
both cases I am getting 8 bytes, without any other operations that might
affect the engine state in between. As a user I should not be required
to know how the Randomizer works internally (by always getting 8 bytes
from the engine and throwing away unused bytes).

If you disagree and think this is the correct behavior then this should
be documented accordingly in the RFC. If you agree, then this should be
fixed and a testcase be added.

This ties into (1): Currently any additional bytes returned by the
engine are silently ignored. Either the bytes should be processed in
full, or an error emitted if the returned bytestring is too long.

Consider the attached test case with a Sha1-based RNG. If I grab 20
bytes (the length of a SHA-1 hash) from the Randomizer then the
'generate()' function will be called 3 times, despite it returning
sufficient bytes on the first attempt. If I want to make sure that no
bytes are wasted, then I need to implement a pretty complex construction
(Sha1_2) to always return exactly 8 bytes.

Best regards
Tim Düsterhus

3 years ago by Go Kudo — view source

unread

2022年2月17日(木) 19:25 Tim Düsterhus tim@bastelstu.be:

Hi

The following points have been fixed:

nextByteSize(): int has been removed from Random\Engine

If the width of the RNG is statically defined, it will now be used
preferentially

Added Xoshiro256StarStar

Fixed an endianness issue

And updated RFC

https://wiki.php.net/rfc/rng_extension

[...]
This seems to have solved the whole problem. How about it?

Awesome, this is feeling much much better now. As you might've seen I've
made some comments on GitHub regarding implementation bugs.

I have two more conceptional questions:

However I believe you did not answer my question regarding the following
and that's something that should be clear in the RFC and documentation:
 &lt;?php

 use Random\Engine\Xoshiro256StarStar;
 use Random\Randomizer;

 $g1 = new Xoshiro256StarStar(1);
 $g2 = clone $g1;
 $r1 = new Randomizer($g1);
 $r2 = new Randomizer($g2);

 var_dump(\bin2hex($r1->getBytes(8))); // string(16) "c510c70f6daff2b3"
 var_dump(\bin2hex($r2->getBytes(4) . $r2->getBytes(4))); //
string(16) "c510c70fea4c3647"

In this example I get 8 bytes from the randomizer. One time by getting
all 8 bytes at once, the second time by getting 4 bytes and then another
4 bytes.

I think that both lines should result in the same output, because in
both cases I am getting 8 bytes, without any other operations that might
affect the engine state in between. As a user I should not be required
to know how the Randomizer works internally (by always getting 8 bytes
from the engine and throwing away unused bytes).

If you disagree and think this is the correct behavior then this should
be documented accordingly in the RFC. If you agree, then this should be
fixed and a testcase be added.

This ties into (1): Currently any additional bytes returned by the
engine are silently ignored. Either the bytes should be processed in
full, or an error emitted if the returned bytestring is too long.

Consider the attached test case with a Sha1-based RNG. If I grab 20
bytes (the length of a SHA-1 hash) from the Randomizer then the
'generate()' function will be called 3 times, despite it returning
sufficient bytes on the first attempt. If I want to make sure that no
bytes are wasted, then I need to implement a pretty complex construction
(Sha1_2) to always return exactly 8 bytes.

Best regards
Tim Düsterhus

Hi Tim
Thanks for your continued help!

As for your question, buffering the output can lead to counter-intuitive
behavior in code like the following.

&lt;?php

$engine = new \Random\Engine\Xoshiro256StarStar(1234);
$randomizer = new Randomizer($engine);

// Retrieve only 16 bits (the remaining 48 bits will be buffered in
Randomizer)
$str = $randomizer->getBytes(2);

// Generate a new 64 bits (to waste)
$engine->generate();

// Retrieve 64 bits (first 48 bits from buffer, but last 16 bits newly
generated)
// numerical continuity will be lost.
$str2 = $randomizer->getBytes(8);

However, the Engine will probably not be used by itself very often. I think
buffering should be implemented, but I'd like to solicit opinions on this
in the ML.

I'll implement value buffering first.

Regards
Go Kudo

3 years ago by tim@bastelstu.be — view source

unread

As for your question, buffering the output can lead to counter-intuitive
behavior in code like the following.

&lt;?php

$engine = new \Random\Engine\Xoshiro256StarStar(1234);
$randomizer = new Randomizer($engine);

// Retrieve only 16 bits (the remaining 48 bits will be buffered in
Randomizer)
$str = $randomizer->getBytes(2);

// Generate a new 64 bits (to waste)
$engine->generate();

// Retrieve 64 bits (first 48 bits from buffer, but last 16 bits newly
generated)
// numerical continuity will be lost.
$str2 = $randomizer->getBytes(8);

Personally I'd say by using the engine itself or using a different
Randomizer it's clear that any kind of guarantees regarding the result
no longer hold, because the state will be modified.

I've also seen your other email and will reply to it as well.

Best regards
Tim Düsterhus

3 years ago by Go Kudo — view source

unread

2022年2月17日(木) 19:25 Tim Düsterhus tim@bastelstu.be:

Hi

The following points have been fixed:

nextByteSize(): int has been removed from Random\Engine

If the width of the RNG is statically defined, it will now be used
preferentially

Added Xoshiro256StarStar

Fixed an endianness issue

And updated RFC

https://wiki.php.net/rfc/rng_extension

[...]
This seems to have solved the whole problem. How about it?

Awesome, this is feeling much much better now. As you might've seen I've
made some comments on GitHub regarding implementation bugs.

I have two more conceptional questions:

However I believe you did not answer my question regarding the following
and that's something that should be clear in the RFC and documentation:
 &lt;?php

 use Random\Engine\Xoshiro256StarStar;
 use Random\Randomizer;

 $g1 = new Xoshiro256StarStar(1);
 $g2 = clone $g1;
 $r1 = new Randomizer($g1);
 $r2 = new Randomizer($g2);

 var_dump(\bin2hex($r1->getBytes(8))); // string(16) "c510c70f6daff2b3"
 var_dump(\bin2hex($r2->getBytes(4) . $r2->getBytes(4))); //
string(16) "c510c70fea4c3647"

In this example I get 8 bytes from the randomizer. One time by getting
all 8 bytes at once, the second time by getting 4 bytes and then another
4 bytes.

I think that both lines should result in the same output, because in
both cases I am getting 8 bytes, without any other operations that might
affect the engine state in between. As a user I should not be required
to know how the Randomizer works internally (by always getting 8 bytes
from the engine and throwing away unused bytes).

If you disagree and think this is the correct behavior then this should
be documented accordingly in the RFC. If you agree, then this should be
fixed and a testcase be added.

This ties into (1): Currently any additional bytes returned by the
engine are silently ignored. Either the bytes should be processed in
full, or an error emitted if the returned bytestring is too long.

Consider the attached test case with a Sha1-based RNG. If I grab 20
bytes (the length of a SHA-1 hash) from the Randomizer then the
'generate()' function will be called 3 times, despite it returning
sufficient bytes on the first attempt. If I want to make sure that no
bytes are wasted, then I need to implement a pretty complex construction
(Sha1_2) to always return exactly 8 bytes.

Best regards
Tim Düsterhus

Hello.

I have been looking into output buffering, but don't know the right way to
do it. The buffering works fine if all RNG generation widths are static,
but if they are dynamic so complicated.

It is possible to solve this problem by allowing generate() itself to
specify the size it wants, but this would significantly slow down
performance.

I've looked at the sample code, but do you really need support for
Randomizer? Engine::generate() can output dynamic binaries up to 64 bits.
You can use Engine directly, instead of Randomizer::getBytes().

What exactly is the situation where buffering by Randomizer is needed?

Also worried that buffering will cut off random numbers at arbitrary sizes.
It may cause bias in the generated results.

If you don't want to waste the generated values, users can still implement
their own buffering structures into Random\Engine following.

&lt;?php

use Random\Engine;

final class BufferedEngine implements Engine
{
    private Iterator $gen;
    private string $buffer;

    public function __construct(string $seed)
    {
        $this->gen = $this->stream($seed);
        $this->buffer = '';
    }

    private function stream($state)
    {
        while (true) {
            $state = \sha1($state, true);
            for ($i = 0; $i < \strlen($state); $i++) {
                yield $state[$i];
            }
        }
    }

    public function generate(): string
    {
        echo __METHOD__, PHP_EOL;

        $result = "";
        for ($i = 0; $i < 8; $i++) {
            $result .= $this->gen->current();
            $this->gen->next();
        }

        return $result;
    }

    public function getBytes(int $length): string
    {
        $result = '';

        while (\strlen($result) < $length) {
            if ($this->buffer === '') {
                $this->buffer = $this->generate();
            }
            $required = $length - \strlen($result);
            $temp = \substr($this->buffer, 0, $required);
            $this->buffer = \substr($this->buffer, $required);
            $result .= $temp;
        }

        return $result;
    }
}

Buffering mechanism that was implemented in a limited way has been reverted
recently.

Best Regards
Go Kudo

3 years ago by tim@bastelstu.be — view source

unread

I have been looking into output buffering, but don't know the right way to
do it. The buffering works fine if all RNG generation widths are static,
but if they are dynamic so complicated.

I believe the primary issue here is that the engines are expected to
return an uint64_t, instead of a buffer with raw bytes. This requires
you to perform many conversions between the uint64 and the raw buffer:

When calling Randomizer::getBytes() for a custom engine the following
needs to happen:

The Engine returns a byte string.
This bytestring is then internally converted into an uint64_t.
Then calling Randomizer::getBytes() this uint64_t needs to be
converted back to a bytestring.

To avoid those conversations without sacrificing too much performance it
might be possible to return a struct that contains a single 4 or 8-byte
array:

 struct four_bytes {
     unsigned char val[4];
 };

 struct four_bytes r;
 r.val[0] = (result >> 0) & 0xff;
 r.val[1] = (result >> 8) & 0xff;
 r.val[2] = (result >> 16) & 0xff;
 r.val[3] = (result >> 24) & 0xff;

 return r;

.val can be treated as a bytestring, but it does not require dynamic
allocation. By doing that the internal engines (e.g. Xoshiro) would be
consistent with the userland engines.

It is possible to solve this problem by allowing generate() itself to
specify the size it wants, but this would significantly slow down
performance.

I don't think it's a good idea to add a size parameter to generate().

I've looked at the sample code, but do you really need support for
Randomizer? Engine::generate() can output dynamic binaries up to 64 bits.
You can use Engine directly, instead of Randomizer::getBytes().

What exactly is the situation where buffering by Randomizer is needed?

I don't need anything. I'm just trying to think of use-cases and
edge-cases. Basically: What would a user attempt to do and what would
their expectations be?

I'm not saying that this buffering must be implemented, but this is
something we need to think about. Because changing the behavior later is
pretty much impossible, as users might rely on a specific behavior for
their seeded sequences. The behavior might also need to be part of the
documentation.

Basically what we need to think about is what guarantees we give. As an
example:

Calling Engine::generate() with the same seed results in the same
sequence (This guarantee we give, and it is useful).
Calling Randomizer::getInt() with the same seeded engine results in
the same numbers for the same parameters (I think this also is useful).
Calling Randomizer::getBytes() with the same seeded engine results in
the same byte sequence (This is something we are currently discussing).
Calling Randomizer::getBytes() simply concatenates the raw bytes
retrieved by the Engine (This ties into (3)).
Calling Randomizer::shuffleArray() with the same seeded engine
results in the same result for the same string (This one is more
debatable, because then we must maintain the exact same shuffleArray()
implementation forever).

All these guarantees should be properly documented within the RFC. The
RFC template (https://wiki.php.net/rfc/template) says:

Remember that the RFC contents should be easily reusable in the PHP
Documentation.

So by thinking about this now and putting it in the RFC, the
explanations can easily be copied into the documentation if the RFC
passes the vote.

One should not need to look into the implementation to understand how
the Engines and the Randomizer is supposed to work.

Also worried that buffering will cut off random numbers at arbitrary sizes.
It may cause bias in the generated results.

If there's bias in specific bits or bytes of the generated number then
getBytes(32) will already be biased even without buffering, as the raw
bytes are what's of interest here. It does not matter if they are at the
1st or 4th position (for a 32-bit engine).

Best regards
Tim Düsterhus

3 years ago by Go Kudo — view source

unread

2022年2月18日(金) 19:46 Tim Düsterhus tim@bastelstu.be:

Hi

I have been looking into output buffering, but don't know the right way
to
do it. The buffering works fine if all RNG generation widths are static,
but if they are dynamic so complicated.

I believe the primary issue here is that the engines are expected to
return an uint64_t, instead of a buffer with raw bytes. This requires
you to perform many conversions between the uint64 and the raw buffer:

When calling Randomizer::getBytes() for a custom engine the following
needs to happen:

The Engine returns a byte string.

This bytestring is then internally converted into an uint64_t.

Then calling Randomizer::getBytes() this uint64_t needs to be
converted back to a bytestring.

To avoid those conversations without sacrificing too much performance it
might be possible to return a struct that contains a single 4 or 8-byte
array:
 struct four_bytes {
     unsigned char val[4];
 };

 struct four_bytes r;
 r.val[0] = (result >> 0) & 0xff;
 r.val[1] = (result >> 8) & 0xff;
 r.val[2] = (result >> 16) & 0xff;
 r.val[3] = (result >> 24) & 0xff;

 return r;
.val can be treated as a bytestring, but it does not require dynamic
allocation. By doing that the internal engines (e.g. Xoshiro) would be
consistent with the userland engines.

It is possible to solve this problem by allowing generate() itself to
specify the size it wants, but this would significantly slow down
performance.

I don't think it's a good idea to add a size parameter to generate().

I've looked at the sample code, but do you really need support for
Randomizer? Engine::generate() can output dynamic binaries up to 64 bits.
You can use Engine directly, instead of Randomizer::getBytes().

What exactly is the situation where buffering by Randomizer is needed?

I don't need anything. I'm just trying to think of use-cases and
edge-cases. Basically: What would a user attempt to do and what would
their expectations be?

I'm not saying that this buffering must be implemented, but this is
something we need to think about. Because changing the behavior later is
pretty much impossible, as users might rely on a specific behavior for
their seeded sequences. The behavior might also need to be part of the
documentation.

Basically what we need to think about is what guarantees we give. As an
example:

Calling Engine::generate() with the same seed results in the same
sequence (This guarantee we give, and it is useful).

Calling Randomizer::getInt() with the same seeded engine results in
the same numbers for the same parameters (I think this also is useful).

Calling Randomizer::getBytes() with the same seeded engine results in
the same byte sequence (This is something we are currently discussing).

Calling Randomizer::getBytes() simply concatenates the raw bytes
retrieved by the Engine (This ties into (3)).

Calling Randomizer::shuffleArray() with the same seeded engine
results in the same result for the same string (This one is more
debatable, because then we must maintain the exact same shuffleArray()
implementation forever).

All these guarantees should be properly documented within the RFC. The
RFC template (https://wiki.php.net/rfc/template) says:

Remember that the RFC contents should be easily reusable in the PHP
Documentation.

So by thinking about this now and putting it in the RFC, the
explanations can easily be copied into the documentation if the RFC
passes the vote.

One should not need to look into the implementation to understand how
the Engines and the Randomizer is supposed to work.

Also worried that buffering will cut off random numbers at arbitrary
sizes.
It may cause bias in the generated results.

If there's bias in specific bits or bytes of the generated number then
getBytes(32) will already be biased even without buffering, as the raw
bytes are what's of interest here. It does not matter if they are at the
1st or 4th position (for a 32-bit engine).

Best regards
Tim Düsterhus

I am sorry for the delay in replying.

Thank you for the clear explanation.
It is true that the RFC in its current form lacks explanation. I'll try to
fix this first.

Also, as I look into other languages' implementations, I see the need to
add some RNGs such as PCG. I will update the RFC to include these.

Here is a Rust example:
https://docs.rs/rand/latest/rand/

PCG:
https://www.pcg-random.org/index.html

Regards
Go Kudo

3 years ago by tim@bastelstu.be — view source

unread

I am sorry for the delay in replying.

Don't worry, there was a weekend inbetween and I totally understand that
one wants to take their weekend off.

Thank you for the clear explanation.
It is true that the RFC in its current form lacks explanation. I'll try to
fix this first.

Sounds good.

Also, as I look into other languages' implementations, I see the need to
add some RNGs such as PCG. I will update the RFC to include these.

I suggest you avoid "feature creep" within the RFC. Additional engines
can be added easily later on if the need arises. But for now it's more
important to get a reliable basis that one can build onto.

Having a choice of a multitude of different engine just distracts from
that goal and can be confusing for the user. With xoshiro256** there's a
very good choice that already is part of the RFC, no need to have
something else that might or might not be slightly better in some case.

Best regards
Tim Düsterhus

3 years ago by Go Kudo — view source

unread

2022年2月21日(月) 21:44 Tim Düsterhus tim@bastelstu.be:

Hi

I am sorry for the delay in replying.

Don't worry, there was a weekend inbetween and I totally understand that
one wants to take their weekend off.

Thank you for the clear explanation.
It is true that the RFC in its current form lacks explanation. I'll try
to
fix this first.

Sounds good.

Also, as I look into other languages' implementations, I see the need to
add some RNGs such as PCG. I will update the RFC to include these.

I suggest you avoid "feature creep" within the RFC. Additional engines
can be added easily later on if the need arises. But for now it's more
important to get a reliable basis that one can build onto.

Having a choice of a multitude of different engine just distracts from
that goal and can be confusing for the user. With xoshiro256** there's a
very good choice that already is part of the RFC, no need to have
something else that might or might not be slightly better in some case.

Best regards
Tim Düsterhus

RFC has been updated. Is this up to the required standard?
https://wiki.php.net/rfc/rng_extension

I acknowledge that the previous RFC may have been difficult to discuss. If
the problem has been solved, I would like to make another ML-wide
announcement and wait for two weeks from there.

I added PCG64 because according to the RNG experts, there seems to be a
mild conflict between Xorshiro256 and PCG64. Also, as mentioned in the RFC,
Rust and NumPy also implement PCG64.

In order to verify the feasibility of PCG64, we created a PoC in C. So far,
it seems to work fine.
https://github.com/zeriyoshi/pcg64_example

Regards,
Go Kudo

3 years ago by Mel Dafert — view source

unread

Hi

RFC has been updated. Is this up to the required standard?
https://wiki.php.net/rfc/rng_extension

I acknowledge that the previous RFC may have been difficult to discuss. If
the problem has been solved, I would like to make another ML-wide
announcement and wait for two weeks from there.

I added PCG64 because according to the RNG experts, there seems to be a
mild conflict between Xorshiro256 and PCG64. Also, as mentioned in the RFC,
Rust and NumPy also implement PCG64.

In order to verify the feasibility of PCG64, we created a PoC in C. So far,
it seems to work fine.
https://github.com/zeriyoshi/pcg64_example

Regards,
Go Kudo

Hello,

Two small nits:

The "Backward Incompatible Changes" section is missing the fact that
\Random\Randomizer will be reserved.
In the definition of Randomizer in the "Proposal" section, there is a typo:
function getBytes(int $legnth) should be $length instead.

To me, this RFC looks very good so far - I don't have a vote however.

Regards,
Mel

3 years ago by tim@bastelstu.be — view source

unread

RFC has been updated. Is this up to the required standard?
https://wiki.php.net/rfc/rng_extension

I've just passed my own RFC just yesterday, so I'm certainly not an
expert with regard to RFC standards, however:

I think the explanation in the RFC is much better now. It explains all
the important aspects, so that one does not need to look into the
implementation to understand how it is supposed to work.

However it would benefit from a little more structuring. I suggest to
add some additional sub-headings to the "Proposal" section. Perhaps
something like this:

Proposal

Current Issues

Global State

Mersenne Twister is not state of the art

Random Engine

MT

xorshift

xoshiro

<whatever engine else>

Randomizer

I'm not sure whether all those headlines make sense, you're the expert here.

I acknowledge that the previous RFC may have been difficult to discuss. If
the problem has been solved, I would like to make another ML-wide
announcement and wait for two weeks from there.

I would advice you not to rush the vote. Take your time to make the RFC
perfect. It does not help if you start the vote in 2 weeks and then the
RFC is declined, because of some issues.

Best regards
Tim Düsterhus

[RFC] [Under Discussion] Random Extension 4.0

Proposal

Current Issues

Global State

Mersenne Twister is not state of the art

Random Engine

MT

xorshift

xoshiro

<whatever engine else>

Randomizer