Pages:
Author

Topic: Is the output of hash function evenly distributed? (Read 2265 times)

legendary
Activity: 1932
Merit: 2077
Good point. Collision resistance comes from the fact that its uniformly distributed.  

good distribution --> collision resistance ? No, example: SuperFastHash

http://softwareengineering.stackexchange.com/questions/49550/which-hashing-algorithm-is-best-for-uniqueness-and-speed
Quote
I think the more important takeaway is that there are two classes of algorithms when it comes to collisions:

    collisions rare: FNV-1, FNV-1a, DJB2, DJB2a, SDBM
    collisions common: SuperFastHash, Loselose

And then there's the how evenly distributed the hashes are:

    outstanding distribution: Murmur2, FNV-1a, SuperFastHash
    excellent distribution: FNV-1
    good distribution: SDBM, DJB2, DJB2a
    horrible distribution: Loselose


collision resistance --> uniformity    but uniformity != random uniformity  

https://research.neustar.biz/2012/02/02/choosing-a-good-hash-function-part-3/
Quote
A hash function ought to distribute its keys uniformly across its output range.
Now, uniformity is different from random uniformity.
hero member
Activity: 692
Merit: 569
I rephrase my question:

1) given arbitrary inputs, does the SHA-256 function really generate outputs that are evenly distributed?

2) given block headers** as inputs, does the SHA-256 function really generate outputs that are evenly distributed?


Yes, yes. If not, it would not be a good hash function and it would be open to attacks.


Good point. Collision resistance comes from the fact that its uniformly distributed. 
legendary
Activity: 3416
Merit: 1912
The Concierge of Crypto
Looks good. For more than 2 digits distribution, you should do a few million more, or a few billion more, or 2^32 or something, to get a better picture of the distribution. At one point, you'll see that, maybe there's no point going further.

No one needs pi to more than 100 decimal places, but geeks have been computing it to billions of digits.
legendary
Activity: 1932
Merit: 2077
Try a loop. Count from 1 to a billion. Hash those. Include a large random salt.

For example:

xxxxxxxxxxxxxxxxxxxxxxxxxx00000001
xxxxxxxxxxxxxxxxxxxxxxxxxx00000002
xxxxxxxxxxxxxxxxxxxxxxxxxx00000003
xxxxxxxxxxxxxxxxxxxxxxxxxx00000004
...
xxxxxxxxxxxxxxxxxxxxxxxxxx99999996
xxxxxxxxxxxxxxxxxxxxxxxxxx99999997
xxxxxxxxxxxxxxxxxxxxxxxxxx99999998
xxxxxxxxxxxxxxxxxxxxxxxxxx99999999

Ok, I took your advice.
I picked a block (# 277316,  hash 0000000000000001b6b9a13b095e96db41c4a928b97ef2d944a9b31b2cc7bdc4) and I made a loop with the nonce from 0 to 2^20:

Code:
#!/usr/bin/env python
import hashlib

nomefile="dist_hash"
maxnonce=2**20

version='00000002'
previousblockhash='0000000000000002a7bbd25a417c0374cc55261021e8a9ca74442b01284f0569'
merkleroot='c91c008c26e50763e9f548bb8b2fc323735f73577effbc55502c51eb4cc7cf2e'
timestamp='52be093a'
difficultytarget='1903a30c'
nonce=0

def byteswap(a):
    b='';    
    for i in range(0,len(a)/2+1): b=a[2*i:2*i+2]+b
    return b

header=byteswap(version)+byteswap(previousblockhash)+byteswap(merkleroot)+byteswap(timestamp)+byteswap(difficultytarget)

with open(nomefile, 'w') as f:
f.write("NONCE" + "\t\t" + "BLOCK HEADER HASH" + "\n")
for nonce in range(0,maxnonce) :  #924591752 is the best
nonce_st = byteswap("{0:0{1}x}".format(nonce,8)) #format: hex (8 bytes), little endian
hash_result = hashlib.sha256((header+nonce_st).decode('hex')).digest()
hash_result = hashlib.sha256(hash_result).hexdigest()
f.write("{0:0{1}}".format(nonce,8) + "\t" + byteswap(hash_result) + "\n")

Output sorted by ascending hash:
Code:
 NONCE             BLOCK HEADER HASH
00073675         00001161c27946bfbc99e631e5cfd1bf979c27bd68207991cb97e3d054c3192b
00563775         00001f485a9ad567aeb6a18a514f28b807f6a5c43a166debdc74b7ac46a765ea
00108164         000024a77201aaa21f86a43878bc50ef8be511105b403cad9c4b99e6f6b5563c
00172165         00002f3ea713eef6411ca3b2ead29ce859b1330d05df76a298bed3cf9807d666
00661616         000033f44a25e97b6dca43e4ba8dc9fe0e0b48777cc936f774433ce6244c71d4
...
...
00052351 ffffaa0a77c61b614e110c280bbb06c6d48561df1d6b064e4718a75ca97d509f
00845204 ffffc35f3b2a105c5a9d057b03e5eea26bb83f5d6482070be0eaa28a4dcb2f6b
00999281 ffffc4ee9128c730094017f32614562b434c3c782263b4adeb2a64fdc14a8584
00245490 ffffe3579d3529b5fe46b12db6eb472a6d5df707d4a738b7259ea7b2c62bae9b
00282384 ffffef1566a137349da9e6e02ecd638c7508933c08428f099a7f0808f10d7c6f

The last digit distribution:
Code:
TOT = 2^20 = 1048576
last
digit         fr.      prob.       unif. -->  1/16=0.0625
0            62401   0.062401
1            62764   0.062764
2            62608   0.062608
3            62638   0.062638
4            62465   0.062465
5            62695   0.062695
6            62351   0.062351
7            62706   0.062706
8            62093   0.062093   --> MIN = -0.000407 = -0.65%
9            62340   0.06234
10 (a)       62884   0.062884   --> MAX = +0.000384 = +0.61%
11 (b)       62412   0.062412
12 (c)       62455   0.062455
13 (d)       62204   0.062204
14 (e)       62220   0.06222
15 (f)       62764   0.062764

and the last 2 digits distribution:
Code:
last 2
digits      fr.       prob.    unif. -->  1/256=0.00390625

78 (4E)    3751    0.003751    MIN -->  -0.00015525 = -3,97%
...........................................
...........................................
47 (8A)    4117    0.004117    MAX -->  +0.00021075 = +5.40%
legendary
Activity: 3416
Merit: 1912
The Concierge of Crypto
Try a loop. Count from 1 to a billion. Hash those. Include a large random salt.

For example:

xxxxxxxxxxxxxxxxxxxxxxxxxx00000001
xxxxxxxxxxxxxxxxxxxxxxxxxx00000002
xxxxxxxxxxxxxxxxxxxxxxxxxx00000003
xxxxxxxxxxxxxxxxxxxxxxxxxx00000004
...
xxxxxxxxxxxxxxxxxxxxxxxxxx99999996
xxxxxxxxxxxxxxxxxxxxxxxxxx99999997
xxxxxxxxxxxxxxxxxxxxxxxxxx99999998
xxxxxxxxxxxxxxxxxxxxxxxxxx99999999

Or, let it run for a day or a few hours, then stop whatever you have generated. That should be more than enough for you or most other people. I've already done it, but it's better for you to try it yourself.
legendary
Activity: 1932
Merit: 2077
Obviously block hashes from blockchain are not evenly distributed (because of difficulty and retarget):

Code:
BLOCKS : from 0 to 442033 (TOT = 442034)
height                    hash
0 000000000019d6689c085ae165831e934ff763ae46a2a6c172b3f1b60a8ce26f
1 00000000839a8e6886ab5951d76f411475428afc90947ee320161bbf18eb6048
...
10000    0000000099c744455f58e6c6e98b671e1bf7f37346bfd4cf5d0274ad8ee660cb
10001    00000000f01df1dbc52bce6d8d31167a8fef76f1a8eb67897469cf92205e806b
...
50000    000000001aeae195809d120b5d66a39c83eb48792e068f8ea1fea19d84a4278a
50001    000000001c920d495e1eeef2452b6d1c6c229a919b28196c103ecffebabee141
...
100000   000000000003ba27aa200b1cecaad478d2b00432346c3f1f3986da1afd33e506
100001   00000000000080b66c911bd5ba14a74260057311eaeb1982802f7010f1a9f090
...
200000   000000000000034a7dedef4a161fa058a2d67a173a90155f3a2fe6fc132e0ebf
200001   00000000000002e3269b8a00caf315115297c626f954770e8398470d7f387e1c
...
300000   000000000000000082ccf8f1557c5d40b21edabb18d2d691cfbf87118bac7254
300001   000000000000000049a0914d83df36982c77ac1f65ade6a52bdced2ce312aba9
...
400000   000000000000000004ec466ce4732fe6f1ed1cddc2ed4b328fff5224276e3f6f
400001   000000000000000005421b1b2ee6d06d037557d7f5ec96852542413cfed40c22
...
442032   000000000000000000b6a5c60d2509768b29f4b49ceda2679bc6d33f99c8adba
442033   000000000000000000bf88951b0bd245474265cc6a2a1b38ac8ef78c8612158a

So I have checked the last digit distribution of every hash (16 possible values):

Code:
BLOCKS : from 0 to 442033 (TOT = 442034)
last
digit         fr.         prob.           unif. -->  1/16=0.0625
0           27520   0.0622576543886                    
1           27881   0.0630743336485
2           27952   0.0632349547772       MAX -->  =+0.00073495777 = +1,16%
3           27892   0.0630992186121
4           27543   0.0623096865852
5           27339   0.0618481836239
6           27674   0.062606043879
7           27691   0.0626445024591
8           27428   0.0620495256021
9           27203   0.061540514983        MIN --> -0.000959485 = -1.54%
10 (a)      27594   0.0624250623255
11 (b)      27554   0.0623345715488
12 (c)      27863   0.063033612799
13 (d)      27616   0.0624748322527
14 (e)      27619   0.062481619061
15 (f)      27665   0.0625856834542

I think it's ok.

Out of curiosity, I have checked the last 2 digits distribution (256 possible values) too:

Code:
BLOCKS : from 0 to 442033 (TOT = 442034)
last 2
digits        fr.            prob.            unif. -->  1/256=0.00390625
71  (47)    1891    0.0042779514698    MAX -->  =+0.0003717 = +9,516%
...................................................
...................................................
...................................................
138 (8a)    1622    0.0036694009963   MIN --> =-0.000236849 = -6,06%


and the last 3 digits distribution (4096 possible values):

Code:
BLOCKS : from 0 to 442033 (TOT = 442034)
last 3
digits        fr.            prob.            unif. -->  1/4096=0.000024414
2686 (A7E)   147     0.000332553604474   MAX --> 0.00008841 = +36,21%
.................................................
.................................................
.................................................
3139 (C43)   75     0.000169670206364   MIN --> -0.000074470419 = -30,5%

How do you think about these results? These distributions are okay?
newbie
Activity: 17
Merit: 0
I rephrase my question:

1) given arbitrary inputs, does the SHA-256 function really generate outputs that are evenly distributed?

2) given block headers** as inputs, does the SHA-256 function really generate outputs that are evenly distributed?


Yes, yes. If not, it would not be a good hash function and it would be open to attacks.
legendary
Activity: 1050
Merit: 1016
Ignoring the fact that the likelihood of 2 blocks outputting the same hash, in the same consensus round (10 mins) is practically zero.

It is still zero regardless of time. The human brain isnt equipped to deal with very large/small numbers. It's just too far from the realm of normal experience.

SHA2 is regarded as being evenly distributed. Even if that's wrong it's uniform enough for current purposes,  makes little difference in the end as far as a collision is concerned.

and if I said it was zero there would be posts stating that in theory it isn't :p
newbie
Activity: 1
Merit: 0
Ignoring the fact that the likelihood of 2 blocks outputting the same hash, in the same consensus round (10 mins) is practically zero.

It is still zero regardless of time. The human brain isnt equipped to deal with very large/small numbers. It's just too far from the realm of normal experience.

SHA2 is regarded as being evenly distributed. Even if that's wrong it's uniform enough for current purposes,  makes little difference in the end as far as a collision is concerned.
legendary
Activity: 3416
Merit: 1912
The Concierge of Crypto
I rephrase my question:

1) given arbitrary inputs, does the SHA-256 function really generate outputs that are evenly distributed?

2) given block headers** as inputs, does the SHA-256 function really generate outputs that are evenly distributed?

**block header = Version (4 bytes) + Previous Block Hash (32 bytes) + Merkle Root (32 bytes) + Timestamp (4 bytes) + Difficulty Target (4 bytes) + Nonce (4 bytes) = 80 byte = 640 bit  --> fixed length

block hash = block header hash = SHA256(SHA256(Block_Header))

But we assume that the output of the hash function is evenly distributed (so we can calculate target and difficulty for proof-of-work algorithm).

How can we know for sure the output is evenly distributed?
I did a few million hashes of number sequences and other random stuff, for playing cards and dice simulations. It looked evenly distributed.

In your case, what were your inputs? What kind of number sequences? Fixed length (more or less than 256 bit) or not?


Why don't you try your #2 and see... we have more than 400k blocks already.

I believe my inputs were more than 256 bits. I used a bunch of "random" characters ... at least 60 in length.
legendary
Activity: 1932
Merit: 2077
I found out some interesting information about hash function:

http://www.denimgroup.com/know_artic_secure_hash_functions.html
Quote
A secure hash function has three properties: preimage resistance, collision resistance, and second preimage resistance.

A good collision resistant hash function should have each hash value be about as evenly distributed as possible

http://crypto.stackexchange.com/questions/12822/are-the-sha-family-hash-outputs-practically-random
Quote
The SHA-256 (as well as any cryptographically secure hash algorithm) produces output that will appear like an uniformly random sequence to observer who does not know the input.

Quite a few random number generators, for example ANSI X9.31's RNG and NIST SP 800-90 Hash_DRBG use SHA family hash functions for the reason that resulting sequence is hard to distinguish from random.

If the RNG is able to produce entropy, but has large bias, SHA-256 is a very good function for collecting entropy. The output of the function has nearly 1 bit of entropy per output bit, if the input to the function contained at least 256 bits of entropy. Only very little entropy gets lost in this process.

SHA-256 using constructs like Hash_DRBG in NIST SP 800-90 can be used in situation where true entropy is available, but is slow to collect. Once you have been able to collect 256 bits of entropy (and preferable a bit more to be safe), you can use the entropy to instantiate Hash_DRBG/SHA-256, which will be able to serve billions of random numbers.

Remember these are not good uses:

If you feed SHA-256 with too little entropy (or even smaller input than 256 bits), the output may appear random, but it is not. Smart adversary can be able to abuse this
In other words: if your input to hash is short or too predictable, so shall be the output.

Pseudo-Random Number Generator using SHA-256:
https://www.stat.berkeley.edu/~stark/Java/Html/sha256Rand.htm

But in general randomness is not the same as collision avoidancehttp://softwareengineering.stackexchange.com/questions/49550/which-hashing-algorithm-is-best-for-uniqueness-and-speed
legendary
Activity: 1260
Merit: 1019
Is there something in Bitcoin Core that prevents the block at height 441372 (with a
Previous Block Hash that is identical to it's own Block hash) from being added to the blockchain?
Yes.
No one node will receive it
Because will ignore INV command thinking that already has this block in the blockchain
legendary
Activity: 1932
Merit: 2077
I rephrase my question:

1) given arbitrary inputs, does the SHA-256 function really generate outputs that are evenly distributed?

2) given block headers** as inputs, does the SHA-256 function really generate outputs that are evenly distributed?

**block header = Version (4 bytes) + Previous Block Hash (32 bytes) + Merkle Root (32 bytes) + Timestamp (4 bytes) + Difficulty Target (4 bytes) + Nonce (4 bytes) = 80 byte = 640 bit  --> fixed length

block hash = block header hash = SHA256(SHA256(Block_Header))

But we assume that the output of the hash function is evenly distributed (so we can calculate target and difficulty for proof-of-work algorithm).

How can we know for sure the output is evenly distributed?
I did a few million hashes of number sequences and other random stuff, for playing cards and dice simulations. It looked evenly distributed.

In your case, what were your inputs? What kind of number sequences? Fixed length (more or less than 256 bit) or not?
legendary
Activity: 3416
Merit: 1912
The Concierge of Crypto
But we assume that the output of the hash function is evenly distributed (so we can calculate target and difficulty for proof-of-work algorithm).

How can we know for sure the output is evenly distributed?
I did a few million hashes of number sequences and other random stuff, for playing cards and dice simulations. It looked evenly distributed.
legendary
Activity: 3472
Merit: 4801
Nope, the hash X is already used for a block X, so block X' using the same hash X would get "soft" rejected by the network.

I'm not understanding what you are saying.  Can you point out the code in Bitcoin Core that explains what you are saying?



As a new example (saying what I think is the same thing as shorena in a different way)...


Lets say we are at block height 441370 with a hash of 0000...08ac (as of the writing of this post, we actually are)


Now lets say a few minutes later a mining pool solves a block.  The new block at block height 441371 has:
Code:
Block Hash:   0000...075c
Previous Block:  0000...08ac (referring to block height 441370)


Now, lets say 15 minutes later a mining pool solves a block.  The new block at block height 441372 has:
Code:
Block Hash:   0000...075c
Previous Block:  0000...75c (referring to block height 441371)

Is there something in Bitcoin Core that prevents the block at height 441372 (with a Previous Block Hash that is identical to it's own Block hash) from being added to the blockchain?


If not, then when a miner creates the next block having:
Code:
Block Hash:   0000...08ac
Previous Block:  0000...075c

Then how will any Bitcoin Core node know if this new block is at height 441373 (Previous Block: 0000...075c refers to block height 441372) or a fork competing with the other block at height 441372 (Previous Block: 0000...075c refers to block height 441371)?
legendary
Activity: 1050
Merit: 1016
And while in the subject...

Although extremely unlikely, what happens if a new block has the same hash as a previous block?

Bitcoin continues normally. It's a perfectly valid occurrence.

Is that so?

Lets say there are blocks A, B, C and D with B and C having the same hash. B refers to A as normal. C refers to B which may work even though its the same hash, Im not sure. D refers to C and B because they have the same hash. What am I missing? Besides that a 256-bit collision will probably never happen.

Ignoring the fact that the likelihood of 2 blocks outputting the same hash, in the same consensus round (10 mins) is practically zero.  There would be a network fork until the next valid blocks on all of the fork chains were created.

The network would then revert to the fork where the next block/s had the most work performed upon it.

A (not very good) diagram:

CHAIN (WORK = 100) -> Block +1 (HASH X  : WORK = 10) -> Block +2 (HASH Z : WORK 12)
CHAIN (WORK = 100) -> Block +1 (HASH X' : WORK = 10) -> Block +2 (HASH Y : WORK 13) <- This one would win

So block X' cant follow X?

Nope, the hash X is already used for a block X, so block X' using the same hash X would get "soft" rejected by the network.
copper member
Activity: 1498
Merit: 1528
No I dont escrow anymore.
And while in the subject...

Although extremely unlikely, what happens if a new block has the same hash as a previous block?

Bitcoin continues normally. It's a perfectly valid occurrence.

Is that so?

Lets say there are blocks A, B, C and D with B and C having the same hash. B refers to A as normal. C refers to B which may work even though its the same hash, Im not sure. D refers to C and B because they have the same hash. What am I missing? Besides that a 256-bit collision will probably never happen.

Ignoring the fact that the likelihood of 2 blocks outputting the same hash, in the same consensus round (10 mins) is practically zero.  There would be a network fork until the next valid blocks on all of the fork chains were created.

The network would then revert to the fork where the next block/s had the most work performed upon it.

A (not very good) diagram:

CHAIN (WORK = 100) -> Block +1 (HASH X  : WORK = 10) -> Block +2 (HASH Z : WORK 12)
CHAIN (WORK = 100) -> Block +1 (HASH X' : WORK = 10) -> Block +2 (HASH Y : WORK 13) <- This one would win

So block X' cant follow X?
legendary
Activity: 1050
Merit: 1016
And while in the subject...

Although extremely unlikely, what happens if a new block has the same hash as a previous block?

Bitcoin continues normally. It's a perfectly valid occurrence.

Is that so?

Lets say there are blocks A, B, C and D with B and C having the same hash. B refers to A as normal. C refers to B which may work even though its the same hash, Im not sure. D refers to C and B because they have the same hash. What am I missing? Besides that a 256-bit collision will probably never happen.

Ignoring the fact that the likelihood of 2 blocks outputting the same hash, in the same consensus round (10 mins) is practically zero.  There would be a network fork until the next valid blocks on all of the fork chains were created.

The network would then revert to the fork where the next block/s had the most work performed upon it.

A (not very good) diagram:

CHAIN (WORK = 100) -> Block +1 (HASH X  : WORK = 10) -> Block +2 (HASH Z : WORK 12)
CHAIN (WORK = 100) -> Block +1 (HASH X' : WORK = 10) -> Block +2 (HASH Y : WORK 13) <- This one would win
copper member
Activity: 1498
Merit: 1528
No I dont escrow anymore.
And while in the subject...

Although extremely unlikely, what happens if a new block has the same hash as a previous block?

Bitcoin continues normally. It's a perfectly valid occurrence.

Is that so?

Lets say there are blocks A, B, C and D with B and C having the same hash. B refers to A as normal. C refers to B which may work even though its the same hash, Im not sure. D refers to C and B because they have the same hash. What am I missing? Besides that a 256-bit collision will probably never happen.

Edit: a more in depth description of this problem can be found in post #8 by Danny below.

Edit2: answered:

Is there something in Bitcoin Core that prevents the block at height 441372 (with a
Previous Block Hash that is identical to it's own Block hash) from being added to the blockchain?
Yes.
No one node will receive it
Because will ignore INV command thinking that already has this block in the blockchain


reference -> https://bitcoin.org/en/developer-reference#getblocks
legendary
Activity: 3878
Merit: 1193
And while in the subject...

Although extremely unlikely, what happens if a new block has the same hash as a previous block?

Bitcoin continues normally. It's a perfectly valid occurrence.
Pages:
Jump to: