How much rarer are short bitcoin addresses?

piotr_n

legendary

Activity: 2058

Merit: 1416

aka tonikt

Quote from: RoxxR on November 25, 2013, 05:03:30 AM

So if I understand this right, I've got my answer:
32 characters are 400,000 times rarer than 33 characters.

Statistically, but in this case, there where were only two 32-chars long addresses generated withing the 20 million random samples so the number is not very precise.
It gives you an idea about the magnitude, but the 400,000 might just as well be 150,000 or 1,000,000

Quote from: RoxxR on November 25, 2013, 05:03:30 AM

So if I understand this right, I've got my answer:
Are 31 characters also 400,000 times rarer than 32 characters, or does the factor differ again?

That's a good question...
Unfortunately I don't know the answer nor have resources to run such a long simulation.
Though back in my high school (which was very long ago), if I had asked my math teacher, she would have probable known how to calculate it

RoxxR

full member

Activity: 209

Merit: 148

Thanks for this interesting discussion.

Code:

34 chars: ~96.026255%
33 chars: ~3.973735%
32 chars: ~0.00001%

So if I understand this right, I've got my answer:
32 characters are 400,000 times rarer than 33 characters.

Are 31 characters also 400,000 times rarer than 32 characters, or does the factor differ again?

piotr_n

legendary

Activity: 2058

Merit: 1416

aka tonikt

Quote from: flatfly on November 24, 2013, 05:37:36 PM

Perhaps the function was coded this way to avoid confusing newbies with address lengths that vary too much? Hard to be sure... but i can't think of another reason.

I believe it is to assure that the specific implementation of the decode function will always return 25 bytes, since no padding is being done there.

BTW, in theory there are eight possible "valid" addresses with the length of 26 characters:

Code:

11111111111111111111BZbvjr
11111111111111111111HeBAGj
11111111111111111111QekFQw
11111111111111111111UpYBrS
11111111111111111111g4hiWR
11111111111111111111jGyPM8
11111111111111111111o9FmEC
11111111111111111111ufYVpS

And I was wrong again; removing any of the trailing '1' will cause the satoshi client to assume the address as "invalid", even though it is technically quite valid, since e.g. both; 1BZbvjr and 11111111111111111111BZbvjr represent exactly the same 200-bit number (which a bitcoin address essentially is).

But IMHO it is a flaw in the implementation - e.g. my client considers addresses with redundant trailing '1' characters are valid and equal.
Hell you can even skip the first '1' - the 32-bit checksum that si there anyway should be just fine to assure no typos.

flatfly

legendary

Activity: 1120

Merit: 1016

090930

Quote from: piotr_n on November 24, 2013, 11:58:53 AM

Quote from: flatfly on November 24, 2013, 09:18:10 AM

However the next factors are much, MUCH higher than 58, it seems. I haven't been able to figure it out completely, though. This question is trickier than it seems...

I think I know why.

It is the part of satoshi's EncodeBase58() function:

Code:

// Leading zeroes encoded as base58 zeros
for (const unsigned char* p = pbegin; p < pend && *p == 0; p++)
str += pszBase58[0];

So for every zero-byte at the beginning, the encode function adds '1' in front of the string - effectively forcing the string to come out longer.

Using this function, I've made a simulation on 20 millions random addresses and indeed not even once we get a string shorter than 32 characters. The statistical results are like this:

Code:

34 chars: ~96.026255%
33 chars: ~3.973735%
32 chars: ~0.00001%

But when we modify the EncodeBase58 function to break the loop (after putting the first '1' in front), then after the 20 million rounds we get stats like this:

Code:

34 chars: 95.714475%
33 chars: 4.210685%
32 chars: 0.073435%
31 chars: 0.00139%
30 chars: 0.000015%

So in other words: if you want to replace trailing '11..' with single '1..' - then you get the expected factor 58, for the even shorter addresses.

And the bottom line is that EncodeBase58 function in the satoshi client is stupidly implemented, because there is no reason whatsoever for an address to start with more than one '1', and even if it does you can just remove the extra ones and the address will still work just fine.

Very interesting. Thanks for running the tests! Perhaps the function was coded this way to avoid confusing newbies with address lengths that vary too much? Hard to be sure... but i can't think of another reason.

piotr_n

legendary

Activity: 2058

Merit: 1416

aka tonikt

Quote from: flatfly on November 24, 2013, 09:18:10 AM

However the next factors are much, MUCH higher than 58, it seems. I haven't been able to figure it out completely, though. This question is trickier than it seems...

I think I know why.

It is the part of satoshi's EncodeBase58() function:

Code:

// Leading zeroes encoded as base58 zeros
for (const unsigned char* p = pbegin; p < pend && *p == 0; p++)
str += pszBase58[0];

So for every zero-byte at the beginning, the encode function adds '1' in front of the string - effectively forcing the string to come out longer.

Using this function, I've made a simulation on 20 millions random addresses and indeed not even once we get a string shorter than 32 characters. The statistical results are like this:

Code:

34 chars: ~96.026255%
33 chars: ~3.973735%
32 chars: ~0.00001%

But when we modify the EncodeBase58 function to break the loop (after putting the first '1' in front), then after the 20 million rounds we get stats like this:

Code:

34 chars: 95.714475%
33 chars: 4.210685%
32 chars: 0.073435%
31 chars: 0.00139%
30 chars: 0.000015%

So in other words: if you want to replace trailing '11..' with single '1..' - then you get the expected factor 58, for the even shorter addresses.

And the bottom line is that EncodeBase58 function in the satoshi client is stupidly implemented, because there is no reason whatsoever for an address to start with more than one '1', and even if it does you can just remove the extra ones and the address will still work just fine.

flatfly

legendary

Activity: 1120

Merit: 1016

090930

Quote from: piotr_n on November 24, 2013, 06:24:28 AM

I believe the proper calculation for the first factor comes from the value:

Code:

0x00FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF / 58 = 0x0469ee58469ee58469ee58469ee58469ee58469ee58469ee

And the chance of 0x0469ee58469ee58469ee58469ee58469ee58469ee58469ee having... something to do with mod/div 58 - is the first factor.

Something like that... Haven't had my fifth coffee yet today Wink

EDIT:
If this time I did not get it wrong again, it might go like this (python code):

Code:

>>> x = 0x00FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF / 58
>>> 1 / (1 - float(x - 58**31) / x)
23.337985918792462

That looks much better

However the next factors are much, MUCH higher than 58, it seems. I haven't been able to figure it out completely, though. This question is trickier than it seems...

piotr_n

legendary

Activity: 2058

Merit: 1416

aka tonikt

I believe the proper calculation for the first factor comes from the value:

Code:

0x00FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF / 58 = 0x0469ee58469ee58469ee58469ee58469ee58469ee58469ee

And the chance of 0x0469ee58469ee58469ee58469ee58469ee58469ee58469ee having... something to do with mod/div 58 - is the first factor.

Something like that... Haven't had my fifth coffee yet today Wink

EDIT:
If this time I did not get it wrong again, it might go like this (python code):

Code:

>>> x = 0x00FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF / 58
>>> 1 / (1 - float(x - 58**31) / x)
23.337985918792462

piotr_n

legendary

Activity: 2058

Merit: 1416

aka tonikt

Quote from: flatfly on November 24, 2013, 05:30:07 AM

Quote from: piotr_n on November 24, 2013, 04:56:58 AM

Quote from: RoxxR on November 23, 2013, 06:13:00 AM

I'm not very good at math... but I have noticed that 33-character addresses are about 20 times less common than 34-character addresses. Similarly, are 32-character addresses 20 times less common than 33-character addresses (and so on)?

I would really like to see a math guru confirm this.

It's as you say, except that the factor is 58, not 20.
Each one character shorter address should be 58 times rarer...

Nope, the factor is not 58. I haven't done the math (not enough time for this now) but from my quick initial testing, the factor seems to be roughly 22.5, at least for 33-character addresses. Don't forget to take leading zeros into account.

You're right - sorry, my mistake.
The first byte is always zero, which means that the first factor is lower.
~~13.140625, if I did the math right... (58*58/256)~~
But any further shorter ones should use the factor 58 / char.

flatfly

legendary

Activity: 1120

Merit: 1016

090930

Quote from: piotr_n on November 24, 2013, 04:56:58 AM

Quote from: RoxxR on November 23, 2013, 06:13:00 AM

I'm not very good at math... but I have noticed that 33-character addresses are about 20 times less common than 34-character addresses. Similarly, are 32-character addresses 20 times less common than 33-character addresses (and so on)?

I would really like to see a math guru confirm this.

It's as you say, except that the factor is 58, not 20.
Each one character shorter address should be 58 times rarer...

Nope, the factor is not 58. I haven't done the math (not enough time for this now) but from my quick initial testing, the factor seems to be roughly 22.5, at least for 33-character addresses. Don't forget to take leading zeros into account.

piotr_n

legendary

Activity: 2058

Merit: 1416

aka tonikt

Quote from: RoxxR on November 23, 2013, 06:13:00 AM

I'm not very good at math... but I have noticed that 33-character addresses are about 20 times less common than 34-character addresses. Similarly, are 32-character addresses 20 times less common than 33-character addresses (and so on)?

I would really like to see a math guru confirm this.

It's as you say, except that the factor is 58, not 20.
Each one character shorter address should be 58 times rarer...

cr1776

legendary

Activity: 4284

Merit: 1316

They just vary due to encoding, so (somewhat simplified) if you had two:
12345678
00098765 could be written as 98765

That doesn't quantify it, but this explains:

Quote from: RoxxR on November 23, 2013, 06:13:00 AM

I'm not very good at math... but I have noticed that 33-character addresses are about 20 times less common than 34-character addresses. Similarly, are 32-character addresses 20 times less common than 33-character addresses (and so on)?

I would really like to see a math guru confirm this.

BurtW

legendary

Activity: 2646

Merit: 1138

All paid signature campaigns should be banned.

Quote from: michagogo on November 23, 2013, 11:44:52 AM

http://bitcoin.stackexchange.com/questions/2013/why-does-the-length-of-a-bitcoin-key-vary

That does not answer the question about the distribution of Bitcoin address lengths. I do remember seeing a post about that a long time ago. Try the search function to see if you can find it.

michagogo

member

Activity: 80

Merit: 10

http://bitcoin.stackexchange.com/questions/2013/why-does-the-length-of-a-bitcoin-key-vary

RoxxR

full member

Activity: 209

Merit: 148

I'm not very good at math... but I have noticed that 33-character addresses are about 20 times less common than 34-character addresses. Similarly, are 32-character addresses 20 times less common than 33-character addresses (and so on)?

I would really like to see a math guru confirm this.

Topic: How much rarer are short bitcoin addresses? (Read 2524 times)