Author

Topic: Possibly improving on BIP039 with language models ? (Read 158 times)

legendary
Activity: 2870
Merit: 7490
Crypto Swap Exchange
The thing is, to memorize a seed phrase people usually invent a story, called a memory peg.

Maybe it's me lacking imagination, but I don't like having to come up with a story from a bunch of words.

Sounds complicated, aside from difficulty of creating the story in general, you also need to geberate "good" to make process of creating a story become easier.
Even if you remember the story, i doubt you can remember which word/word position supposed to be your seed.

What I was wondering if a language model could not come up with such a story for me.

I've seen news which says language model (usually GPT) can create story, but the annoying part is checking whether the generated story have exactly 11/23 (not 12/24 because last word works as checksum) words (which only appear once) on BIP39 word list.
legendary
Activity: 3472
Merit: 10611
Talking about brainwallet, if I'm not mistaken it's basically using BIP39, right ?
no, the brainwallets take any arbitrary test that you like and turn that into your keys. for example you can use "123" as your brainwallet mnemonic! and that is exactly why this method is a terrible idea, people tend to be very weak sources of entropy. there are lots of brainwallets "hacked" today that were using very simple strings, from single words to popular quotes, poems, etc.

BIP39 and other similar methods on the other hand generate a good random entropy using a strong RNG and then convert that to human readable form (the mnemonic).
legendary
Activity: 1288
Merit: 1080
If you want to use poem as your private key, it's easier to create a new poem (or modify existing one), then perform SHA256 on that poem and use that hash result as seed.
This is how people end up losing their bitcoins.

I know, what i suggest is essentially brainwallet and also vulnerable against dictionary attack.

But at least it's slightly more practical (don't need fancy hardware) and deterministic (since model parameter could be changed over time).

Talking about brainwallet, if I'm not mistaken it's basically using BIP39, right ?  That is, it will give you a list of words, called a seed phrase and you're supposed to memorize it.

The thing is, to memorize a seed phrase people usually invent a story, called a memory peg.

Maybe it's me lacking imagination, but I don't like having to come up with a story from a bunch of words.  What I was wondering if a language model could not come up with such a story for me.



legendary
Activity: 1042
Merit: 2805
Bitcoin and C♯ Enthusiast
If you want to use poem as your private key, it's easier to create a new poem (or modify existing one), then perform SHA256 on that poem and use that hash result as seed.
This is how people end up losing their bitcoins.
legendary
Activity: 1288
Merit: 1080
It's pretty straight-forward to do this, but the number of bits per word ends up extremely low, so the strings are absurd long.

GPT2 and similar gives you a probability distribution over the next token as a function of the prior token.

Imagine output tokens as boxes laid along a ruler of unit length. Set the width of each token equal to the probability, so they all add up to 1 and span the whole ruler.  For each token, subdivide it into the tokens for the next symbol, again, sized by their probabilities so they add up to the bounding token.

In this model we just take the private key as a position along the ruler and read off the tokens at its position.  Assuming rounding is handled correctly, this process is totally reversible. See also: https://en.wikipedia.org/wiki/Range_encoding

GPT2TC will basically do this for you, https://bellard.org/nncp/gpt2tc.html  though due to various implementation details the result is not always reversible, so it loses a little information.


Or you could use https://github.com/sipa/gramtropy which is explicitly made for this purpose, but uses a simpler language model.

But really I think all this stuff is mostly pointless and hurts key security rather than helping it.

Those seem like useful links, thanks.

I knew about Fabrice Bellard's work but somehow it didn't occur to me that text compression is basically the same problem.  Silly me.  Tongue

Quote
Good luck entering that in exactly, character per character.

Well obviously the idea would be to generate a text that is friendly to the memory.  A poem, typically.  Memorizing poems is something people have been doing for millennia.

Anyway, thanks again for the links, notably about range encoding.  I had never heard of this before.
staff
Activity: 4284
Merit: 8808
For example, here is a 256-bit key decoded with gpt2tc and the 1558M model:

Quote
https://www.youtube.com/watch?v=NbOijSrZwmU

by The Official PUBG Channel

Discover these new ways to play PUBG inside the single player campaign mode and compare them to the private servers! Watch Studio and play with other players just like you with 10 saved servers to challenge your group with at the Play menu. Get more points per play and earn 6 skill points each time you earn a kill. Private servers will be unlocked on dec 8th so go explore them now! Help a fellow player and you'll get one of the exclusive premium 'Hearthstone' branded skins wearing mounted orcs priest. Get in now by checking out into Playtesting – click on the button below then shoot the button below!

Good luck entering that in exactly, character per character. Tongue

Here is the same key in hex:

c4c54956ea89760bfb1f1c22752765cbdf7a21606e3dd17f80d81e2668518d4d

And in the gramtropy failmail grammar:

Quote
this bale bewails that sail still her shale snail wails while her quail wails so the railing ale dovetails but some wholesale ale impales so his quail emails his pale tattletale so her shale hail bails and braille jails assail and that female flail travails yet his flailing tails fail wholesale mails yet their quail blackmails his gale

or the breezy grammar:

Quote
ah payees decrease trees and mkay so fees flee and hmm keys caese or um well teas caese or well uh capris tease teas or ah ah bees crease keys and good gosh capris crease greases thus um ah teas decrease fleas or ah bees flee or um jeez bees freeze teas and haha teas please keys or so ah peas caese then ah trustees please pleas

or the silly grammar:

Quote
soon one nurse adjusts cats and saints hear a big arm so evil cats have no tiny buyer and four bleak czars hunt flies so a cat is a man in the big paw but men boil a sleepy leg and shy adults want cats so a dog is a lax blouse

or the english grammar:

Quote
No fit grand new Chinese coders make two sturdy large 15-year-old Irish tin graves. A strong Swiss king paints the many big warm 81-year-old oval brown British chains. No firm queens polish the three known huge cool 53-year-old Algerian leather wires. Five good new kings polish five poor grand hot 89-year-old triangular stone books.

(Failmail and breezy grammars were intended to be purposefully hard to remember, english model had a hard time constructing a 256-bit model so that is four 64-bit encodings).
staff
Activity: 4284
Merit: 8808
It's pretty straight-forward to do this, but the number of bits per word ends up extremely low, so the strings are absurd long.

GPT2 and similar gives you a probability distribution over the next token as a function of the prior token.

Imagine output tokens as boxes laid along a ruler of unit length. Set the width of each token equal to the probability, so they all add up to 1 and span the whole ruler.  For each token, subdivide it into the tokens for the next symbol, again, sized by their probabilities so they add up to the bounding token.

In this model we just take the private key as a position along the ruler and read off the tokens at its position.  Assuming rounding is handled correctly, this process is totally reversible. See also: https://en.wikipedia.org/wiki/Range_encoding

GPT2TC will basically do this for you, https://bellard.org/nncp/gpt2tc.html  though due to various implementation details the result is not always reversible, so it loses a little information.


Or you could use https://github.com/sipa/gramtropy which is explicitly made for this purpose, but uses a simpler language model.

But really I think all this stuff is mostly pointless and hurts key security rather than helping it.

legendary
Activity: 1288
Merit: 1080
These days, artificial intelligence is all the rage and the latest kids on the block are transformer models, and with them language models such as GPT-2 and most recently GPT-3.

These things are pretty cool, with them you can create texts that are often indistinguishable from a text written by a human.

This leads me to wonder if these models could be used to improve on BIP039.  I wouldn't feel confident in using BIP039 to memorize a private key.  The generated words are very random and uncorrelated, it seems to me the risk I forget them, or forget their order, is very high.

That's why I wonder if it's possible to use a private key to seed the word completion algorithm of a language model.

That doesn't seem obvious to me, as the probability distribution for each new word is not uniform, so we can't just pick say the residue modulo ten to pick one word out of ten possible ones.

Yet, my intuition tells me it has to be possible to sample the words while respecting the probability distribution.  I just can't quite figure out how exactly.

If someone sees how this can be done and can provide a demonstration with GPT-2, that would be fantastic.

PS.  I mean think about it : GPT-3 can generate poems, for instance.  Wouldn't it be cool if you could encode a bitcoin private key in a poem ?
Jump to: