Pubkey variable characters distribution

vshoes

newbie

Activity: 8

Merit: 2

Thanks for explanations (and picking up my sloppy nomenclature). All seems clear now. I didn't run a Grubb's test as there's no need, the explanations have covered why this is the case and we can take it that there are two distinct populations here.

odolvlobo

legendary

Activity: 4578

Merit: 3526

Quote from: vshoes on November 20, 2024, 04:42:31 AM

Clearly the first cluster of 2 through to P (less I) appears evenly distributed (will run a Grubb's test to confirm). Can anyone explain why characters(1, I, Q, R, S, T, U, V, W, X, Y, Z) are so less likely to appear in the first position? Thanks

The explanation is straightforward. A bitcoin address is a base-58 encoding of a 200-bit number, except that before base-58 encoding takes place, leading groups of 8 bits with the value of 0 are encoded as '1'. Since the first 8 bits of a legacy address are always 0, most of the time the address is a '1' followed by the base-58 encoding of a 24-byte random number.

Now, 256²⁴ = 58^32.7758, so an address is typically a '1' followed by 32.7758 base-58 digits. That means that the first digit after the '1' is a digit in the range of 0 - 23 (58^0.7758).

Remember, that leading '1's are special case. The first digit is always '1' because the first byte is set to 0, but the second byte is random, so it will be 0 only 1/256 of the time, leading to rare cases where an address starts with '11'.

pooya87

legendary

Activity: 3472

Merit: 10611

You keep saying "pubkey" but you are using address which is the hash of the public key and in case of P2PKH addresses it is using HASH160 or RIPEMD160 of SHA256 hash of the public key encoded using Base58 encoding.

ranochigo

legendary

Activity: 3038

Merit: 4418

Crypto Swap Exchange

Bitcoin Version 1 addresses, (or aka legacy addresses) uses base58 encoding. This can be used to explain majority of the observations:

Let's take the example from here[1], step 8 onwards.:

Hex: 00f54a5851e9372b87810a8e60cdd2e7cfd80b6e31c7f18fe8
Decimal: 6014503356492732657644518984173176634541310227850984525800
Address: 1PMycacnJaSqwwJqjawXBErnLsZ7RkXUAs

For the number of 1s, base58 takes in the number of zeros and prepends 1 to it before making the conversion. As such, any 0x00 are directly converted to 1s. As you're aware, it's fairly hard to find as many leading 0x00 bytes in SHA256. Those characters that aren't affected by this goes to the next step.

base58 modulos the decimal representation of the hex, for example, 6014503356492732657644518984173176634541310227850984525800 % 58 = 20 which corresponds to s on the chart. s is then the final letter of the address and it goes on until it runs out of values; final modulo being 22 % 58 = 22 which maps to P.

Typically, this results in addresses which are 34 characters, 1 + 33 characters derived from base58 encoding. However, there are certain addresses which are shorter, 1 + 32 characters derived from base58 encoding and explains this phenomenon.

This is going to be a simplified proof of why Bitcoin addresses with base58 encoding are less than 34 characters:

1. For the decimal representation, largest possible number is 6.277 * 10^57 when considering the fact that the hex consists of 24 bytes after removing leading 00s for network byte.

2. This means that the largest possible address in decimals would be (58^32)*23, since (58^32)*24 violates that. This means that the first digit after the 1 cannot be R and above.

3. However, this wouldn't be true if your decimal representation is (58^32) and smaller! However, this means that at best, you can only have 33 digits in the address (58^31)*58. Hence, this means that it is far more unlikely for you to generate an address that has anything after Q** on the base58 encoding sequence as your second character.

I think it might be clearer if you're able to look at the base58 encoding guide, a nifty one here: https://learnmeabitcoin.com/technical/keys/base58/.

Now, to explain your outlier for the letter I, I suspect you're sampling the characters without considering the case-sensitivity. I is not an accepted encoding in base58, you likely represented i as I and the above explains why it is still less common, lower case i comes after R.

** The range of decimal representation also affects the frequency of Q, but not to the extent to those after.

[1] https://en.bitcoin.it/wiki/Technical_background_of_version_1_Bitcoin_addresses

vshoes

newbie

Activity: 8

Merit: 2

I'm using a python script and bitcoinlib to generate a vanity address; I'm looking for a public key that looks like 1(name) where (name) is only 4 characters long, so I didn't think it would take me too long, guessed maybe the order of days or weeks.

Code:

from bitcoinlib import wallets

# create new wallet
w = wallets.Wallet.create('TestWallet1')

# key info as dictionary
keyInfo = w.get_key()

# get pub and priv key in WIF format
pubKey = keyInfo.address
privKey = keyInfo.wif

I ran it for a short time exporting the results, and analyzed the output. What surprised me was the distribution of the first variable character. After 1, these are the most frequently occurring characters

https://talkimg.com/images/2024/11/20/b1d0v.png

Clearly the first cluster of 2 through to P (less I) appears evenly distributed (will run a Grubb's test to confirm). Can anyone explain why characters(1, I, Q, R, S, T, U, V, W, X, Y, Z) are so less likely to appear in the first position? Thanks

Topic: Pubkey variable characters distribution (Read 186 times)