Do you think it would be better if I generated more bits of entropy and then keeping 128 of them? Instead of zfill or repeating the process if bits are less than 128.
Yup, in the context of your script, slicing a length-128 string from a larger string of random 1s and 0s would work (that is, it would correct the bias, because the high-order bit would then have a 50/50 chance of being either 1 or 0).
But, the way that you posed that option as an alternative to
zfill makes me think you don't fully understand the source of this bias. I hope you don't find the following explanation tiresome, I just know that if it were me in your shoes, I'd really appreciate someone taking pains to help me understand the potential hole in my thinking:
Let's imagine a toy version of this problem where you're trying to generate just 4 (rather than 128) bits of entropy with a coin. Let's
not go down the rabbit hole of coin fairness, entropy extraction, and tossing technique. (I have a sometimes-sophomoric, or, as my wife would say, "bloody stupid" sense of humor, especially when I'm in a good mood, so "tossing technique" just gave me the giggles. Don't toss while flipping coins, yeah?)
There are only 16 possible ways for 4 coin-tosses-in-a-row to end up, so let's put them all in a table (ignore the last 4 columns for now):
Outcome | Pattern | Binary | Decimal | f1($pcv) | f2($pcv) | f3($pcv) | f4($pcv) |
#1 | HHHH | 1111 | 15 | "0b1111" | "1111" | "1111" | "HHHH" |
#2 | HHHT | 1110 | 14 | "0b1110" | "1110" | "1110" | "HHHT" |
#3 | HHTH | 1101 | 13 | "0b1101" | "1101" | "1101" | "HHTH" |
#4 | HHTT | 1100 | 12 | "0b1100" | "1100" | "1100" | "HHTT" |
#5 | HTHH | 1011 | 11 | "0b1011" | "1011" | "1011" | "HTHH" |
#6 | HTHT | 1010 | 10 | "0b1010" | "1010" | "1010" | "HTHT" |
#7 | HTTH | 1001 | 9 | "0b1001" | "1001" | "1001" | "HTTH" |
#8 | HTTT | 1000 | 8 | "0b1000" | "1000" | "1000" | "HTTT" |
#9 | THHH | 0111 | 7 | "0b111" | "111" | "0111" | "THHH" |
#10 | THHT | 0110 | 6 | "0b110" | "110" | "0110" | "THHT" |
#11 | THTH | 0101 | 5 | "0b101" | "101" | "0101" | "THTH" |
#12 | THTT | 0100 | 4 | "0b100" | "100" | "0100" | "THTT" |
#13 | TTHH | 0011 | 3 | "0b11" | "11" | "0011" | "TTHH" |
#14 | TTHT | 0010 | 2 | "0b10" | "10" | "0010" | "TTHT" |
#15 | TTTH | 0001 | 1 | "0b1" | "1" | "0001" | "TTTH" |
#16 | TTTT | 0000 | 0 | "0b0" | "0" | "0000" | "TTTT" |
The first four columns are: 1. the possibility/outcome # (1 through 16, so, every possible outcome accounted for), 2. the heads-or-tails pattern corresponding to each outcome (H = heads, T = tails), 3. the same heads-or-tails pattern but in base-2 (1 = heads, 0 = tails), and 4. the heads-or-tails pattern converted from base-2 into base-10.
The last four columns involve the imaginary variable
$pcv (previous column's value) and the following definitions:
f1 = lambda x: bin(x)
f2 = lambda x: x[2:]
f3 = lambda x: x.zfill(4)
f4 = lambda x: x.replace('0', 'T').replace('1', 'H')
Now, the first thing to think about when looking at that table is if there are any "bad" or "low entropy" outcomes in it? The answer to that is
no: as long as each outcome is
equally probable, then
any given outcome is just as "entropic" as any other outcome. HTTH is just as good as TTTT, savvy? That means that, tempting though it is, it's a
mistake to use the
string length of the values in the
f2($pcv) column to decide whether or not you "have enough entropy" (to be clear, it's a mistake to base that decision on
anything about a single outcome). If you discard outcomes whenever
len(f2($pcv)) != 4, then you're only ensuring that the pattern will always begin with a heads instead of a tails (check the table to confirm that).
The second thing to think about when looking at that table is if the final patterns in the
f4($pcv) column ever
disagree with the source patterns in column 2? They don't. So, you can rest assured that the
zfill(128) that was present in your original script was correct (that is, it was only
restoring 0s that, in some sense, were there from the start, and were "lost" during an
int to
str conversion). You can/should pat yourself on the back for getting it right the first time.
There are at least two other bugs in your script though (beyond the bias bug, which we'll call Bug #1):
Bug #2You need a
zfill(32) before passing the hexadecimal representation of your entropy into
bytes.fromhex (otherwise, you'll occasionally pass it an
odd number of nibbles, which will cause it to raise a
ValueError).
Bug #3You need a
zfill(256) at the end of the line that calculates your
sha256_bin variable (otherwise, your checksum, and therefore the last word of your mnemonic, will be wrong 50% of the time).
For an example, try manually setting your entropy to
cc399b43e82bfbc07f2fe3fd1f13ed41.
Your script will spit out the following mnemonic:
slow smoke special space sausage then witness wise wonder weasel win lobsterIan's script (and mine) will spit out this instead:
slow smoke special space sausage then witness wise wonder weasel win lionThe reason your wonder weasel only managed to win a lobster instead of a lion, is because you're miscalculating the checksum bits as
1001 instead of
0001.
(I'm going to be too busy to respond for a while. I'll take a peek at this thread again in a few weeks' time.)