Author

Topic: How much rarer are short bitcoin addresses? (Read 2524 times)

legendary
Activity: 2058
Merit: 1416
aka tonikt
November 25, 2013, 06:37:08 AM
#14
So if I understand this right, I've got my answer:
32 characters are 400,000 times rarer than 33 characters.
Statistically, but in this case, there where were only two 32-chars long addresses generated withing the 20 million random samples so the number is not very precise.
It gives you an idea about the magnitude, but the 400,000 might just as well be 150,000 or 1,000,000

So if I understand this right, I've got my answer:
Are 31 characters also 400,000 times rarer than 32 characters, or does the factor differ again?
That's a good question...
Unfortunately I don't know the answer nor have resources to run such a long simulation.
Though back in my high school (which was very long ago), if I had asked my math teacher, she would have probable known how to calculate it Smiley
full member
Activity: 209
Merit: 148
November 25, 2013, 05:03:30 AM
#13
Thanks for this interesting discussion.

Code:
34 chars: ~96.026255%
33 chars: ~3.973735%
32 chars: ~0.00001%

So if I understand this right, I've got my answer:
32 characters are 400,000 times rarer than 33 characters.

Are 31 characters also 400,000 times rarer than 32 characters, or does the factor differ again?
legendary
Activity: 2058
Merit: 1416
aka tonikt
November 25, 2013, 04:33:58 AM
#12
Perhaps the function was coded this way to avoid confusing newbies with address lengths that vary too much? Hard to be sure... but i can't think of another reason.
I believe it is to assure that the specific implementation of the decode function will always return 25 bytes, since no padding is being done there.

BTW, in theory there are eight possible "valid" addresses with the length of 26 characters:
Code:
11111111111111111111BZbvjr
11111111111111111111HeBAGj
11111111111111111111QekFQw
11111111111111111111UpYBrS
11111111111111111111g4hiWR
11111111111111111111jGyPM8
11111111111111111111o9FmEC
11111111111111111111ufYVpS

And I was wrong again; removing any of the trailing '1' will cause the satoshi client to assume the address as "invalid", even though it is technically quite valid, since e.g. both; 1BZbvjr and 11111111111111111111BZbvjr represent exactly the same 200-bit number (which a bitcoin address essentially is).

But IMHO it is a flaw in the implementation - e.g. my client considers addresses with redundant trailing '1' characters are valid and equal.
Hell you can even skip the first '1' - the 32-bit checksum that si there anyway should be just fine to assure no typos.
legendary
Activity: 1120
Merit: 1016
090930
November 24, 2013, 05:37:36 PM
#11
However the next factors are much, MUCH higher than 58, it seems. I haven't been able to figure it out completely, though. This question is trickier than it seems...
I think I know why.

It is the part of satoshi's EncodeBase58() function:
Code:
   // Leading zeroes encoded as base58 zeros
    for (const unsigned char* p = pbegin; p < pend && *p == 0; p++)
        str += pszBase58[0];

So for every zero-byte at the beginning, the encode function adds '1' in front of the string - effectively forcing the string to come out longer.

Using this function, I've made a simulation on 20 millions random addresses and indeed not even once we get a string shorter than 32 characters. The statistical results are like this:
Code:
34 chars: ~96.026255%
33 chars: ~3.973735%
32 chars: ~0.00001%

But when we modify the EncodeBase58 function to break the loop (after putting the first '1' in front), then after the 20 million rounds we get stats like this:
Code:
34 chars: 95.714475%
33 chars: 4.210685%
32 chars: 0.073435%
31 chars: 0.00139%
30 chars: 0.000015%

So in other words: if you want to replace trailing '11..' with single '1..' - then you get the expected factor 58, for the even shorter addresses.

And the bottom line is that EncodeBase58 function in the satoshi client is stupidly implemented, because there is no reason whatsoever for an address to start with more than one '1', and even if it does you can just remove the extra ones and the address will still work just fine.

Very interesting. Thanks for running the tests! Perhaps the function was coded this way to avoid confusing newbies with address lengths that vary too much? Hard to be sure... but i can't think of another reason.
legendary
Activity: 2058
Merit: 1416
aka tonikt
November 24, 2013, 11:58:53 AM
#10
However the next factors are much, MUCH higher than 58, it seems. I haven't been able to figure it out completely, though. This question is trickier than it seems...
I think I know why.

It is the part of satoshi's EncodeBase58() function:
Code:
   // Leading zeroes encoded as base58 zeros
    for (const unsigned char* p = pbegin; p < pend && *p == 0; p++)
        str += pszBase58[0];

So for every zero-byte at the beginning, the encode function adds '1' in front of the string - effectively forcing the string to come out longer.

Using this function, I've made a simulation on 20 millions random addresses and indeed not even once we get a string shorter than 32 characters. The statistical results are like this:
Code:
34 chars: ~96.026255%
33 chars: ~3.973735%
32 chars: ~0.00001%

But when we modify the EncodeBase58 function to break the loop (after putting the first '1' in front), then after the 20 million rounds we get stats like this:
Code:
34 chars: 95.714475%
33 chars: 4.210685%
32 chars: 0.073435%
31 chars: 0.00139%
30 chars: 0.000015%

So in other words: if you want to replace trailing '11..' with single '1..' - then you get the expected factor 58, for the even shorter addresses.

And the bottom line is that EncodeBase58 function in the satoshi client is stupidly implemented, because there is no reason whatsoever for an address to start with more than one '1', and even if it does you can just remove the extra ones and the address will still work just fine.
legendary
Activity: 1120
Merit: 1016
090930
November 24, 2013, 09:18:10 AM
#9
I believe the proper calculation for the first factor comes from the value:

Code:
0x00FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF / 58 = 0x0469ee58469ee58469ee58469ee58469ee58469ee58469ee

And the chance of 0x0469ee58469ee58469ee58469ee58469ee58469ee58469ee having... something to do with mod/div 58 - is the first factor. Smiley

Something like that... Haven't had my fifth coffee yet today Wink


EDIT:
If this time I did not get it wrong again, it might go like this (python code):
Code:
>>> x = 0x00FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF / 58
>>> 1 / (1 - float(x - 58**31) / x)
23.337985918792462

That looks much better Smiley
However the next factors are much, MUCH higher than 58, it seems. I haven't been able to figure it out completely, though. This question is trickier than it seems...
legendary
Activity: 2058
Merit: 1416
aka tonikt
November 24, 2013, 06:24:28 AM
#8
I believe the proper calculation for the first factor comes from the value:

Code:
0x00FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF / 58 = 0x0469ee58469ee58469ee58469ee58469ee58469ee58469ee

And the chance of 0x0469ee58469ee58469ee58469ee58469ee58469ee58469ee having... something to do with mod/div 58 - is the first factor. Smiley

Something like that... Haven't had my fifth coffee yet today Wink


EDIT:
If this time I did not get it wrong again, it might go like this (python code):
Code:
>>> x = 0x00FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF / 58
>>> 1 / (1 - float(x - 58**31) / x)
23.337985918792462
legendary
Activity: 2058
Merit: 1416
aka tonikt
November 24, 2013, 06:05:17 AM
#7
I'm not very good at math... but I have noticed that 33-character addresses are about 20 times less common than 34-character addresses. Similarly, are 32-character addresses 20 times less common than 33-character addresses (and so on)?

I would really like to see a math guru confirm this.

It's as you say, except that the factor is 58, not 20.
Each one character shorter address should be 58 times rarer...

Nope, the factor is not 58. I haven't done the math (not enough time for this now) but from my quick initial testing, the factor seems to be roughly 22.5, at least for 33-character addresses. Don't forget to take leading zeros into account.

You're right - sorry, my mistake.
The first byte is always zero, which means that the first factor is lower.
13.140625, if I did the math right... (58*58/256)
But any further shorter ones should use the factor 58 / char.
legendary
Activity: 1120
Merit: 1016
090930
November 24, 2013, 05:30:07 AM
#6
I'm not very good at math... but I have noticed that 33-character addresses are about 20 times less common than 34-character addresses. Similarly, are 32-character addresses 20 times less common than 33-character addresses (and so on)?

I would really like to see a math guru confirm this.

It's as you say, except that the factor is 58, not 20.
Each one character shorter address should be 58 times rarer...

Nope, the factor is not 58. I haven't done the math (not enough time for this now) but from my quick initial testing, the factor seems to be roughly 22.5, at least for 33-character addresses. Don't forget to take leading zeros into account.
legendary
Activity: 2058
Merit: 1416
aka tonikt
November 24, 2013, 04:56:58 AM
#5
I'm not very good at math... but I have noticed that 33-character addresses are about 20 times less common than 34-character addresses. Similarly, are 32-character addresses 20 times less common than 33-character addresses (and so on)?

I would really like to see a math guru confirm this.

It's as you say, except that the factor is 58, not 20.
Each one character shorter address should be 58 times rarer...
legendary
Activity: 4284
Merit: 1316
November 24, 2013, 02:31:18 AM
#4
They just vary due to encoding, so (somewhat simplified) if you had two:
12345678
00098765  could be written as 98765

That doesn't quantify it, but this explains:

I'm not very good at math... but I have noticed that 33-character addresses are about 20 times less common than 34-character addresses. Similarly, are 32-character addresses 20 times less common than 33-character addresses (and so on)?

I would really like to see a math guru confirm this.
legendary
Activity: 2646
Merit: 1138
All paid signature campaigns should be banned.
November 23, 2013, 11:56:09 AM
#3

That does not answer the question about the distribution of Bitcoin address lengths.  I do remember seeing a post about that a long time ago.  Try the search function to see if you can find it.
full member
Activity: 209
Merit: 148
November 23, 2013, 06:13:00 AM
#1
I'm not very good at math... but I have noticed that 33-character addresses are about 20 times less common than 34-character addresses. Similarly, are 32-character addresses 20 times less common than 33-character addresses (and so on)?

I would really like to see a math guru confirm this.
Jump to: