Author

Topic: Modelling and Simulating Hash Functions (Read 5215 times)

hero member
Activity: 560
Merit: 517
September 17, 2013, 04:09:49 AM
#4
Quote
But bare single SHA-2 can accept as input an infinite set, right?
I see that mentioned a lot around these parts.  To be really strict about it, SHA-2's input set is finite, because it uses a Merkle–Damgård construction.  In other words, before a message is chucked into SHA-2's meat grinder, it's padded and the message length (in bits) is appended as a 64-bit big endian integer.  If I remember MD compliance correctly, this forces the message length to be less than 2^64 bits in length (~2.3 exabytes).  So ... technically finite, but only Google would care.

Quote
don't know about PBKDF2 to be honest, but at least SHA256^2 has then the same order
You're right, those are good points about the properties if the model were a bijection.  Regarding PBKDF2, you'd get an order ~63% the size of the input set.  You get the same result when XORing random integers.  e.g. if you XOR 256 random integers, each within the range [0, 255], you can expect to see only ~161 unique outputs on average.
legendary
Activity: 3416
Merit: 1912
The Concierge of Crypto
September 17, 2013, 03:22:29 AM
#3
But bare single SHA-2 can accept as input an infinite set, right? It can accept any string from 1 byte to 1 gigabyte. Only the output is limited to 256 bits.

And then there is SHA-512.
legendary
Activity: 1135
Merit: 1166
September 17, 2013, 12:55:06 AM
#2
TL;DR: Recursively applying hash functions like SHA-2 insignificantly weakens the function, if one ignores computational complexity.  Blah, blah, blah, programming is fun and cryptography is awesome.

I'm not a cryptographer either, but I think your finding can be easily argued qualitatively also theoretically-only.  If SHA256 is a bijection (or equivalently injective or surjective) on the set of possible 256 bit values, then chained execution doesn't change what you call its "order" at all (don't know about PBKDF2 to be honest, but at least SHA256^2 has then the same order).  If that is not true (which seems plausible if we intuitively think of hashing as involving something "random"), then SHA256^2 is necessarily "weaker" than a single SHA256 - although probably only insignificantly.  I haven't looked at your code, but your numbers seem to give a nice and interesting expectation of about how strong this weakening should be.
hero member
Activity: 560
Merit: 517
September 16, 2013, 10:55:48 PM
#1
Programmer Adventures in Modelling and Simulating Hash Functions

I've been curious about the properties of hash functions lately.  I had three questions, that I attempted to answer:

  • What is the order of SHA-2 (x)?
  • What is the order of SHA-2 (SHA-2 (x))?
  • What is the order of PBKDF2 (x)?

For this research, I define order to mean the size of the output set of the given function (in other words, how many unique outputs there are).  Ideally, this would be 2^256 for SHA-256.  If the cost of executing the function is considered to be one operation, than the complexity of finding a collision is directly related to the order of the function.

When the input set is small, finding the order of a given function is easy.  Just run all possible inputs through the function, and count the number of unique results.  Unfortunately, the input set of SHA-256 is impossibly large (though technically finite).  Since I'm a programmer by nature, I chose to answer the above questions by modelling and simulating.

The model I use is a random oracle.  These represent perfect hash functions; they are impossible to invert except through brute-force.  For small input sizes, one can generate a random oracle easily.  It's just a random LUT!  To answer the first question using this model, we need to know the expected order of a random oracle.  This is equal to the average of all possible random oracles (for a given input size).  Again, we run into a problem, because the set of possible random oracles is too large.  So, instead, I estimate the expected order by generating a suitably large number of random oracles and averaging their orders.

To answer the last two questions, we do the same, but apply the random oracle in the specified construct.  e.g. Oracle (Oracle (x)).  With enough samples, we should hopefully get good estimates for the expected orders and thus have answers to my questions.  The Python code for all of this is listed at the bottom of this post.  In the meantime...

Results

Code:
Testing 8 bits
Oracle:  162.212890625
Exponential:  19.470703125
Chained:  120.920898438

Testing 12 bits
Oracle:  2589.109375
Exponential:  95.3359375
Chained:  1920.23925781

Testing 16 bits
Oracle:  41426.6992188
Exponential:  1278.45996094
Chained:  30704.7412109

Testing 17 bits
Oracle:  82854.8759766
Exponential:  2516.46582031
Chained:  61408.5410156

Testing 18 bits
Oracle:  165699.530273
Exponential:  5046.56347656
Chained:  122825.053711

Testing 19 bits
Oracle:  331412.545898
Exponential:  10073.0644531
Chained:  245653.294922

Testing X bits means running the simulations with an input set of size 2 ^ X.  Oracle is just the plain random oracle model.  Exponential is 101 iterations of a random oracle iteratively applied (e.g. Oracle (Oracle (x)), but with 101 Oracles).  Chained is similar to PBKDF2 (password and salt combined, no HMAC).

The results are certainly interesting.  Over all the tested input sizes, we see consistently that the estimated expected order of a hash function is only 63% of the input set.  In other words, based on these models, one would expect the order of SHA-256 to be ~2^255.3.  Pretty good!  On the other hand, SHA-256 in an Exponential construct is weaker, resulting in an order of ~2^250.3.  SHA-256 in PBKDF2 would be ~2^254.9.

None of this is particularly surprising, but it's educational to have some concrete numbers.  Of particular interest is the result for the Exponential construct.  Some have conjectured that double SHA-2 is weaker than regular SHA-2.  While this seems to be strictly true, the weakness is nowhere near significant.  After 100 iterations of SHA-2, one would expect to see the output space cut down to 1% of its usual size.  But 1% of 2^256 is still impossibly massive and impossible to search.  If we factor in increased computational complexity, searching the space is still more expensive than single SHA-2.  (NOTE: Doubling hash functions can have other benefits, too.  e.g. HMAC_SHA1 should remain strong in spite of SHA1's weaknesses)

Chained (pseudo-PBKDF2) is interesting too.  At its core it uses the hash function recursively like Exponential does, so seemingly the larger your iteration parameter for PBKDF2 is, the less pseudo entropy you're injecting into the system.  But the expected order is still quite large, and of course computational complexity grows drastically.

Perhaps all of this could have been easily proven using modern algebra, but I'm a better programmer than I am a mathematician.  So to all those crypto-interested programmers out there, I hope this has been useful.

The Code

Code:
import random
import os


# If True, causes the PRNG to be reseeded with entropy often.
EXTRA_RANDOM = True


# Represents an instance of a PseudoRandom Function
class PRF (object):
def __init__ (self, func, input_bits):
self.input_bits = input_bits
self.func = func

# Calculates the size of the output set
def output_order (self):
outputs = set ()

for x in xrange (2 ** self.input_bits):
y = self.func (x)
outputs.add (y)

return len (outputs)

# Estimates the size of the output set
def estimate_output_order (self, num_samples=2**16):
num_samples = min (num_samples, 2 ** self.input_bits)
inputs = random.sample (xrange (2 ** self.input_bits), num_samples)
outputs = set ()

for x in inputs:
y = self.func (x)
outputs.add (y)

return len (outputs) * (2 ** self.input_bits) / num_samples


# A real random oracle generator
class RandomOracle (object):
def __init__ (self, input_bits):
self.input_bits = input_bits

def generate (self):
if EXTRA_RANDOM:
random.seed (os.urandom (32))

permutation = [random.randrange (2 ** self.input_bits) for i in xrange (2 ** self.input_bits)]

def model (x):
return permutation[x]

return PRF (model, self.input_bits)


# Generates exponential versions of the given PRF generator.
# i.e. y = PRF (PRF (PRF (x)))
class ExponentialPRF (object):
def __init__ (self, prf, iterations):
self.prf = prf
self.iterations = iterations

def generate (self):
prf = self.prf.generate ()

def model (x):
U = x

for i in xrange (self.iterations):
U = prf.func (U)

return U

return PRF (model, prf.input_bits)


# Generates chained versions of the given PRF generator.
# i.e. y = PRF (x) ^ PRF (PRF (x)) ^ PRF (PRF (PRF (x))) ^ ...
class ChainedPRF (object):
def __init__ (self, prf, iterations):
self.prf = prf
self.iterations = iterations

def generate (self):
prf = self.prf.generate ()

def model (x):
U = x
T = 0

for i in xrange (self.iterations):
U = prf.func (U)
T = T ^ U

return T

return PRF (model, prf.input_bits)


# Given a PRF generator, estimates the expected size of the output set.
def estimate_family_order (generator, num_samples=1024):
total = 0

for i in xrange (num_samples):
prf = generator.generate ()
total += prf.output_order ()

return total / float (num_samples)


# Some tests
for bits in [8, 12, 16, 17, 18, 19, 20, 21, 22, 23, 24]:
print "Testing %d bits" % bits

oracle = RandomOracle (bits)
exponential = ExponentialPRF (oracle, 101)
chained = ChainedPRF (oracle, 101)

print "Oracle: ", estimate_family_order (oracle)
print "Exponential: ", estimate_family_order (exponential)
print "Chained: ", estimate_family_order (chained)
print ""

TL;DR: Recursively applying hash functions like SHA-2 insignificantly weakens the function, if one ignores computational complexity.  Blah, blah, blah, programming is fun and cryptography is awesome.
Jump to: