A few weeks ago, I promised this explanation to hatshepsut93; and I half-wrote it at the time. I am making this its own topic, because that’s not the first time here I’ve seen people either ask about this, or make this mistake without even realizing it. For those in a hurry, a code snippet is below.Roll a six-sided die with output
d between 1 and 6, inclusive. Convert the results
d to a 2-bit number
b, using the equation
b = (
d - 1) % 4, where “%” denotes “modulo”.
Here is how all potential inputs map to all potential outputs:
Input d: 1 2 3 4
5 6
Output b: 0 1 2 3
As you can see, the output numbers 2 and 3 each have one way of being chosen; whereas the numbers 0 and 1 each have two ways of being chosen. That is to say, 0 and 1 are twice as likely as 2 and 3.
This is
“modulo bias”; and it must be avoided anytime you need to pick a uniformly distributed output number from a range which mismatches the range of inputs.
Now, consider the common use case of generating a random alphanumeric password. Picking a password alphabet in ASCII order in the 62-character range of [0-9A-Za-z] using a random_octet % 62 (0x3e), we obtain:
'0' '1' '2' '3' '4' '5' '6' '7' '8' '9' 'A' 'B' 'C' ... 'x' 'y' 'z'
-------------------------------------------------------------------
00 01 01 03 04 05 06 07 08 09 0a 0b 0c ... 3b 3c 3d % 0x3e
3e 3f 40 41 42 43 44 45 46 47 48 49 4a ... 79 7a 7b % 0x3e
7c 7d 7e 7f 80 81 82 83 84 85 86 87 88 ... b7 b8 b9 % 0x3e
ba bb bc bd be bf c0 c1 c2 c3 c4 c5 c6 ... f5 f6 f7 % 0x3e
f8 f9 fa fb fc fd fe ff [!!! OOPS !!!]
Observe that the eight characters [0-7] can be picked 5 different ways, whereas the others [8-9A-Za-z] can only be picked 4 different ways! As a result, each character in [0-7] will be picked 5/256 = 1.953125% of the time, whereas each of the others will be picked only 4/256 = 1.5625% of the time. (The decimals given are exact.) Although that doesn’t look like much, it is a relative difference of 8 characters being a whopping
25% more likely than the other 54 characters. You do not want a password with those properties!
Please keep handy and adapt as needed the following algorithm for avoiding modulo bias, here presented as a C snippet which I here copy with minor modifications from
from FreeBSD’s libc (it was copied from OpenBSD, and probably somewhere else before that). The code comment (not written by me) explains how it works. Over the course of years, it will save you many instances of shooting yourself in the foot:
#include
/*
* Add here a source of uniformly distributed,
* cryptographically secure random unsigned 32-bit integers:
*/
uint32_t arc4random(void);
/* Begin (mostly) copied code: */
/*
* Calculate a uniformly distributed random number less than upper_bound
* avoiding "modulo bias".
*
* Uniformity is achieved by generating new random numbers until the one
* returned is outside the range [0, 2**32 % upper_bound). This
* guarantees the selected random number will be inside
* [2**32 % upper_bound, 2**32) which maps back to [0, upper_bound)
* after reduction modulo upper_bound.
*/
uint32_t
arc4random_uniform(uint32_t upper_bound)
{
uint32_t r, min;
if (upper_bound < 2)
return 0;
/* 2**32 % x == (2**32 - x) % x */
min = -upper_bound % upper_bound;
/*
* This could theoretically loop forever but each retry has
* p > 0.5 (worst case, usually far better) of selecting a
* number inside the range we need, so it should rarely need
* to re-roll.
*/
for (;;) {
r = arc4random();
if (r >= min)
break;
}
return (r % upper_bound);
}
[This thread is self-moderated, based on experience; it is for on-topic technical discussion only.]