Newsgroups: php.internals Path: news.php.net Xref: news.php.net php.internals:118975 Return-Path: Delivered-To: mailing list internals@lists.php.net Received: (qmail 77238 invoked from network); 5 Nov 2022 20:27:31 -0000 Received: from unknown (HELO php-smtp4.php.net) (45.112.84.5) by pb1.pair.com with SMTP; 5 Nov 2022 20:27:31 -0000 Received: from php-smtp4.php.net (localhost [127.0.0.1]) by php-smtp4.php.net (Postfix) with ESMTP id 1A4531804F8 for ; Sat, 5 Nov 2022 13:27:30 -0700 (PDT) X-Spam-Checker-Version: SpamAssassin 3.4.2 (2018-09-13) on php-smtp4.php.net X-Spam-Level: X-Spam-Status: No, score=-0.2 required=5.0 tests=BAYES_20,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,FREEMAIL_FROM,HTML_MESSAGE, RCVD_IN_DNSWL_NONE,RCVD_IN_MSPIKE_H2,SPF_HELO_NONE,SPF_PASS, T_SCC_BODY_TEXT_LINE autolearn=no autolearn_force=no version=3.4.2 X-Spam-ASN: AS15169 209.85.128.0/17 X-Spam-Virus: No X-Envelope-From: Received: from mail-ua1-f44.google.com (mail-ua1-f44.google.com [209.85.222.44]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange ECDHE (P-256) server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by php-smtp4.php.net (Postfix) with ESMTPS for ; Sat, 5 Nov 2022 13:27:29 -0700 (PDT) Received: by mail-ua1-f44.google.com with SMTP id y25so3208671ual.2 for ; Sat, 05 Nov 2022 13:27:29 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20210112; h=cc:to:subject:message-id:date:from:in-reply-to:references :mime-version:from:to:cc:subject:date:message-id:reply-to; bh=4p0NFMPS3O5mi4JjPAXMCLaBwBWT8IUFgqh1lYD/fsw=; b=lNiQzlImzfrdEo+9exF0WsJ2wNWTBnL6WVwXwfSbQgkWvMk5MU+wleTqWDPoaT2Rpo 4HWdYO1n9jFoij3lHOPnZSPgbyowxKuLEe0rlsWzalTu00vh8Ge2gm8xm9urZhBQP7Tj lpPR83C5+a9+4ZE8KvQhO6R+5GQq5P6sUmL6ZJWuNYZlZWZzzqBEAuiGj+KATBqoA+Mz SaRNKb0A31mh6SsNiUzsxAfoejyHzc1ouoj7ag0yD2HkjAl5pdazXlaR/KGkB8oWHS3O LSTuDh2cULCUqaI/7q4ySieCyhYFbHSUAqRs5Ao86G6z2k151szAjwp4ii2+lv402wvd GbUw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=cc:to:subject:message-id:date:from:in-reply-to:references :mime-version:x-gm-message-state:from:to:cc:subject:date:message-id :reply-to; bh=4p0NFMPS3O5mi4JjPAXMCLaBwBWT8IUFgqh1lYD/fsw=; b=ewIgiINWqFsERj91gCONvKbgfdFn7BZ2BUocb8j/V6gnO9dDyjEfzXUl2aP8eZ0hbP ypyOypw4KQfqzqwRR33gL+inxIMn7FWKIcCXb6rW+AoEN4SiZv6NQoWim92mU14146Fa toD0oOZmF7IB51SXo9qyuCI8k3aNlCXIKQRRw3VQIPOE8gP03GbkbLyCM/7q6nwew/6h WMjYMUrhEf9TWh+gdSug55kiJbw9gKPjRmVhLvEyb/JNhCaBGVZXFk2Ex/vFc3KSNIF5 3lO4BsNiT3xwPh2RL+lRucMJcpFmSei+6V3EMzx2RCc5MgwTCqfmFC2mQnMrlgmrf77k 8oJQ== X-Gm-Message-State: ACrzQf1SFc1jSaW5rRWqCDk2eRXc6RMPhZobmRMEEMrmYBrYyapRHp1M iCbc7+QHOJjMrV49/C4tCFogZHLzFEjL2TFu5w8= X-Google-Smtp-Source: AMsMyM5zth8CwN2iPAPuPV5jkOVBM2+JYy6GZb6jsGCJP1GwPxz39pkRuJxlHu4LYaUmPcR+Dd1LH08h1oHDZ+5nAsU= X-Received: by 2002:ab0:20b6:0:b0:3d2:7fa5:a53 with SMTP id y22-20020ab020b6000000b003d27fa50a53mr11764219ual.82.1667680048413; Sat, 05 Nov 2022 13:27:28 -0700 (PDT) MIME-Version: 1.0 References: <5ceebae4-a3fb-5d29-cdb7-dceed7b07c78@wcflabs.de> In-Reply-To: Date: Sat, 5 Nov 2022 13:27:16 -0700 Message-ID: To: =?UTF-8?Q?Tim_D=C3=BCsterhus?= Cc: Go Kudo , =?UTF-8?Q?Joshua_R=C3=BCsweg?= , PHP internals Content-Type: multipart/alternative; boundary="000000000000e280b605ecbf0486" Subject: Re: [PHP-DEV] RFC [Discussion]: Randomizer Additions From: jordan.ledoux@gmail.com (Jordan LeDoux) --000000000000e280b605ecbf0486 Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable On Sat, Nov 5, 2022 at 9:00 AM Tim D=C3=BCsterhus wrote: > > Likewise if you generate a random float between 0 and 1000 with this > method, some values will appear more often than others due to rounding > and the changing density of floats for each power of two. > > With the =CE=B3-section algorithm by Prof. Goualard all these issues are > eliminated and that's what is used for getFloat(). The getFloat() method > supports all 4 possible boundary combinations to ensure that users have > a safe solution for all possible use cases, so that they don't need to > build an unsafe solution in userland. > > This is a much more subtle problem than many people realize, and I'm sure there are people who will vote or use this feature that don't have a deep understanding of either the float/double type, or of the math behind this, so I'm going to re-explain this using some analogies and examples. **NOTE**: This is mainly for historical record if someone looks back on this internals thread, not in response to any individual in the thread. I'm trying to lay out in language that any passer-by could follow what the issue is and why it is important. Accordingly, I'll be referencing base-10 numbers instead of base-2 numbers through most of this for clarity, but the ideas behind it apply to the base-2 numbers that are actually used, just with different boundaries at powers of 2 instead of powers of 10. Suppose you have a number type that can store exactly 10 digits. Well over the interval [0, 1] you can represent any number between 0.000 000 000 and 1.000 000 000 without issue. In fact, since you can store exactly 10 digits, you can represent any number between 0.000 000 000 and 9.999 999 999. But what happens if you want the interval [0, 10]? Well, at 10 you can only represent the digits 10.000 000 00, since you have to use one of the digits that used to represent the decimal part to represent the whole part, because you have a fixed amount of digits you can use. In order to represent larger numbers, you lose the ability to represent certain numbers that you used to be able to represent. The "density" of numbers has gone down. So if you tell the program "I want a random number, with decimal places, between 0 and 1000", it runs into the following dilemma: between [0, 10), (this means the interval that includes 0 but does not include 10), you have 1 billion values between each whole number which could be chosen. However, for the numbers between [10, 100), you only have 100 million values between each whole number that could be chosen, because you lost a digit after the decimal point. This means that if you do a naive multiplication or selection of your number, the intervals between 0 to 10 are individually 10 times more likely to be chosen than any of the individual intervals between 10 to 100, because those intervals have 10 times as many values mapped to them. The result would actually be nearly equally likely to be in the range [0, 10) as it would [10, 100), (because there are nearly as many possible values between [0, 10) as there are between [10, 100)), even though mathematically the second range is nearly 10 times the size of the first range. Each possible representable value would be equally likely to be chosen, but because the *density* of values is different at different parts of the range, this actually skews the probability of a number landing within an arbitrary part of the range of outputs. This means that if your code does something like `if ($rand->getFloat(0, 1000) < 500)`, you're NOT actually going to get a 50% chance of `true` and a 50% chance of `false`. In fact, using the naive way of mapping floats that Tim mentioned (with base-10 math instead of base-2), you'd get `true` *over two thirds of the time*. Accounting for this is not very easy. What proportion of your total result set has an expanded representable value? In my example, that might be easy to calculate, but the actual math for a program like this is done in base-2, not base-10. To actually calculate this, you need all sorts of logs you need to take to figure out how many powers of 2 you're spanning, and how that affects the probability of different intervals. It is *extremely* easy to get this math wrong, even if you know what you're doing. As Tim said, virtually no implementations that exist *actually* do this correctly. Also consider that the requested range may not nicely straddle a power of 2. What if you want a float for the interval [5, 10]? Neither 5 nor 10 sit cleanly on a power of 2, so how does that affect your math? Like cryptography and security, this is an area where very specific and very confident expertise is necessary, and it is highly preferable to have everyone use a single vetted solution than roll their own. This makes it very poor as something to leave to userland, and makes it *highly* desirable to include it in core. It's very difficult to do, but there is also only one actually correct result, which makes it perfect for inclusion in core. Jordan --000000000000e280b605ecbf0486--