Author

Topic: SHA-256 as a boolean function (Read 11358 times)

sr. member
Activity: 507
Merit: 253
April 30, 2013, 11:03:39 AM
#24
Yes, it's certainly interesting, but computationally it appears it can be no more efficient than brute force…
Also, if it does work better than even ASIC brute-force devices, it wouldn't adversely affect Bitcoin, even if Bitcoin doesn't have a contingency plan for partially-breaking SHA-256 yet.
legendary
Activity: 1050
Merit: 1000
You are WRONG!
April 30, 2013, 02:53:27 AM
#23
sr. member
Activity: 507
Merit: 253
April 29, 2013, 11:34:52 PM
#22
Check out this.
full member
Activity: 210
Merit: 100
January 08, 2012, 01:14:17 PM
#21
I think something like this http://hashcat.net/wiki/mask_attack would be more reasonable/realistic. It used to take me 6 hours to brute force a WPA PSK of 8 digits, now I can do it in under 30 minutes.

Let's quote from the page you're linking to: In Mask attack we know about humans and how they design passwords.

Mask attack is NO BETTER than a traditional brute-force attack. You're breaking your own (I sure hope so) WPA key faster only because it's a shitty key.
Attempting to break a sufficiently random passphrase, like RU8wNkg4R2uTQ0tAx will only mean that you'll have to try most of the possible passphrase-space.

I regret to say, the example you posted (breaking a dictionary-based WPA passphrase) is completely orthogonal to bitcoin mining.
staff
Activity: 4326
Merit: 8951
January 02, 2012, 01:39:01 AM
#20
I think something like this http://hashcat.net/wiki/mask_attack would be more reasonable/realistic. It used to take me 6 hours to brute force a WPA PSK of 8 digits, now I can do it in under 30 minutes.

This is already how mining works, (barring surprising weakness in SHA256) the searched keyspace has uniform prior probability... the parts that have known acceptable values are already fixed.
hero member
Activity: 532
Merit: 500
January 01, 2012, 07:20:27 PM
#19
I think something like this http://hashcat.net/wiki/mask_attack would be more reasonable/realistic. It used to take me 6 hours to brute force a WPA PSK of 8 digits, now I can do it in under 30 minutes.
staff
Activity: 4326
Merit: 8951
January 01, 2012, 07:05:11 PM
#18
And then I haven't yet tried boolean function optimization and statistical early rejection.

I tried early statistical early rejection using about a  giga-hash-second-month of shares data (or so) from real miners and wasn't able to find anything useful given intermediate state bits or state bit pairs. Perhaps with your graph techniques you'll find something more complicated which is useful...

but I'm doubtful— we have something like 3x the number of rounds of the best attack and the complete construction gets complete diffusion many times over and the later your early out distinguisher has to run the better it has to be in order to be worthwhile.
legendary
Activity: 4634
Merit: 1851
Linux since 1997 RedHat 4
December 31, 2011, 11:17:02 PM
#17
Wow.
I'm surprised.
OK this is something I've been looking at for a while and not got very far with it.
My main thought about it all this time (over a few months) has been why I've heard no one else even mention it.
My first and last comment on the subject unless I complete my work on it some time in the far future Smiley
Edit: I should add that my work on it is related but follows a different path, with what I hope to be a better chance of performance increase.
I was hoping 10x ... but that could be dreaming Smiley
legendary
Activity: 1022
Merit: 1033
December 30, 2011, 05:32:20 PM
#16
Concrete numbers: one SHA-256, 32 boolean variables in nonce position, no BDD optimization, just DAG nodest:

 * 99940  operations to compute last 1 bit of output
 * 101349 operations to compute last 16 bits of output
 * 102510 operations to compute last 32 bits of output
 * 104780 operations for 64 bits

(includes OR'ing them).

So computing more bits costs more, but not very much.

Now let's compare it with a traditional SHA-256 implementation, this page:https://en.bitcoin.it/wiki/Why_a_GPU_mines_faster_than_a_CPU#Why_are_AMD_GPUs_faster_than_Nvidia_GPUs.3F gives an idea: "~3250 to execute the SHA-256 compression function" (on Nvidia GTX 590).

From https://en.bitcoin.it/wiki/Mining_hardware_comparison we get that 1024 ALUs working at 1215MHz make 193 Mhash/sec. 1024*1215/193 = 6446. Mining requires 2 SHA-256 has operations, so let's half that: 6446/2=3223. Quite similar to number on wiki page, so I guess that's it.

To compute hash for 32 different combinations NVIDIA GPU requires 3223*32 = 103136 cycles.

This is same number as required by boolean function implemention, so they should be roughly equal in performance, and perhaps boolean function would be even faster with additional optimizations (although second SHA-256 might not optimize as well as the first one).

So on the optimization potential, at least 2000 DAG nodes can be reduced to sufficiently simple (< 1000 leafs) BDDs. This number is going to get higher after I remove 5 variables to compute 32 combos in parallel.

Then there are 'hybrid' BDDs which combines dynamic nodes with static ones, it is going to penetrate DAG even further.

And then I haven't yet tried boolean function optimization and statistical early rejection.

So some speedup for NVIDIA GPUs and CPU mining isn't ruled out. AMD hardware is much harder to beat. But I haven't yet looked at what advanced instructions it supports, things like BFI_INT might replace several graph nodes.
legendary
Activity: 1022
Merit: 1033
December 30, 2011, 08:07:16 AM
#15
Off-topic.  You're not talking about boolean logic implementation anymore.  These optimizations apply to traditional implementations just as well.

Sure, I've just illustrated that SHA-256 round's diffusion isn't as good as people think about it. (BTW, even though these optimizations are applicable, I'm not sure that existing miners exploit all of them. Early miners like cpuminer do not seem to take advantage of them at all.)

And if it is not so good, there are optimization possibilities on bit level too (which aren't accessible on word level).

Quote from: jetmine
Your point was to calculate just one output bit using a big and wide input logic function (and then parallelize to make good use of available instruction sets).

I think you've missed the main part of my reply -- I'm not calculating just one bit of output, I can calculate all bits of output using this method. So your criticism isn't applicable.

Let's say H is content of H register on 64th round, and H[0] is first bit of it, H[1] is second and so on.

We need to find input such that

Code:
     H[0] or H[1] or ... H[31] = 0

     or alternatively:

     not (H[0]) and not (H[1]) ... not(H[31]) = 1
 

That would be true for approximately one in 4 billion inputs. If that's not enough, we can include bits of G into a target function. Or we can compute fewer bits of H if that helps. Or we can express values of H[0] in terms of other expressions and do early rejection on that level.

Quote from: jetmine
A modest speedup is not achievable, only a very tiny one (if any) compared to an equally well optimized traditional implementation.

I don't know yet. So far, I've got pretty good results with BDD, better than I expected.

Quote from: jetmine
want to calculate more bits of it.  But you've already used 98% of the budget.  Now what?  Calculate a traditional result (98+100=198%)?  Or calculate another single bit (98+98=196%)?

Calculating another single bit would take only 0.1% because it depends on exactly same inputs as the first bit. Or see above -- it is possible to formulate whole mining problem in a form of boolean expression.

Quote from: jetmine
I hope I explained it sufficiently clear this time.

It was actually sufficiently clear the first time too, but for some reason you're missing that it is possible to reuse computations to calculate all bits of output.

In that case 2% speedup would be 2% speedup, plain and clear.

Right now I see BDDizing fringe nodes as only viable optimization, so I guess we need to wait until I'll BDDize all what is BDDizable before jumping to conclusions.

But thanks for looking into this, so far you're the one who paid most attention.
newbie
Activity: 53
Merit: 0
December 30, 2011, 07:30:10 AM
#14
so ~11% of computations go away.

Off-topic.  You're not talking about boolean logic implementation anymore.  These optimizations apply to traditional implementations just as well.

Your point was to calculate just one output bit using a big and wide input logic function (and then parallelize to make good use of available instruction sets).

It is true that I cannot chop away, like, 99% of computation, but some modest speedup might be possible.

A modest speedup is not achievable, only a very tiny one (if any) compared to an equally well optimized traditional implementation. Yet even a modest speedup in the boolean part would not translate into a total algorithm speedup.  Imagine that you for example need only 98% for one bit (compared to a traditional algorithm).  There is a 50% chance that the bit is 0 and you want to calculate more bits of it.  But you've already used 98% of the budget.  Now what?  Calculate a traditional result (98+100=198%)?  Or calculate another single bit (98+98=196%)?  It could be 0, too, and further complicate your life.

To calculate just a single bit makes only sense if you can achieve a generous speedup, so you can use it as "fast oracle" to decide whether or not to run the "expensive full calculation" on a particular nonce.

As said, the cryptographic nature of SHA2 makes it impossible to achieve the generous speedup, nor a modest one.  Therefore the whole idea is doomed to fail.

I hope I explained it sufficiently clear this time.
legendary
Activity: 1022
Merit: 1033
December 30, 2011, 07:14:47 AM
#13
For people curious about possibility of optimization of 'fringe' nodes, here are some BDD (binary decision diagram) complexity results: http://paste.lisp.org/display/126772

These are results for all subtrees of height 13 in computation DAG after constant propagation and other simple optimizations. Vars is effective number of meaningful variables in BDD after optimization, leafs means number of distinct outcomes.

E.g. a fully random boolean expression of 32 vars would have 32 vars and something like 2^32 leafs.

Here maximum we have is 32 vars, 19676 leafs, while most are very simple, some even constants (zero variables).

Of course, it doesn't mean that SHA-256 is broken as registers after all 64 rounds have height of ~3500. At high heights it becomes close to random boolean function, with billions of distinct results. But low heights can be pre-computed and optimized out. I don't know how much yet.
legendary
Activity: 1050
Merit: 1000
You are WRONG!
December 30, 2011, 06:55:13 AM
#12
If possible, what would the benefits of this be?

Let's say 10-20% speedup in an ideal case.

almost none. we could calculate sha256 a little bit faster, but only per block. we would need to optimize for every block...

I believe optimization can be done fairly quickly, thus its cost would be negligible. Especially if you use GPU for main computation and CPU for code generation. I would only be concerned with OpenCL/CUDA compiler not being fast enough, but that's just a technical difficulty.
dream on...
legendary
Activity: 1022
Merit: 1033
December 30, 2011, 06:34:34 AM
#11
If possible, what would the benefits of this be?

Let's say 10-20% speedup in an ideal case.

almost none. we could calculate sha256 a little bit faster, but only per block. we would need to optimize for every block...

I believe optimization can be done fairly quickly, thus its cost would be negligible. Especially if you use GPU for main computation and CPU for code generation. I would only be concerned with OpenCL/CUDA compiler not being fast enough, but that's just a technical difficulty.
legendary
Activity: 1022
Merit: 1033
December 30, 2011, 06:17:18 AM
#10
From what I understand (here + usenet) you are near to need "almost as few" resources to calculate one bit, as a traditional method needs to calculate the full result.  If this is correct, it matches quite well with what I would expect from following through your idea mentally.

I was thinking about this approach initially, but that doesn't make sense as that one bit shares a lot of computations with bits nearby (it is fairly obvious, I just wanted to verify that actual results match intuitive ones and there is no unexpected speedup). Now I'm aiming at computing as many bits as necessary. E.g. 32 or 64 of them. It is true that computing one bit costs about as much as computing 32 of them due to re-use of common dependencies. Computing 64 bits would require somewhat more due to a round structure of SHA-256 (see below).

Quote from: jetmine
The problem is the non-linearity from the adders.

That's right.

Quote from: jetmine
input will cover (almost) all bits.  Therefore the logic function for the MSB will need all input bits.  A function for the near-MSB will need almost all input bits. Etc.

But half of result bits are pretty far from MSB. Smiley They will be mixed eventually, but only after a few rounds.

Quote from: jetmine
Since SHA2 is cryptografically good, there is not much to optimize away by using logic functions.  The dynamic inputs avalanche over the static ones very quickly.  A few steps into the algorithm,

Many people say that, and initially I had a same impression. But try inspecting SHA2 compression function closer: one round does very little. It modifies only two registers (and shifts the rest), eating only 32 bits of input block at once.

Thus when you do SHA-256 for bitcoin mining, you don't need to compute first three rounds (of 64) at all. That's already a ~5% speedup. Then fourth round boils to a single addition. Then fifth is pretty simple too, as input bits are constant, so you just need to compute S0 (which boils down to two XORs per bit if you ignore ROTR) and do an addition with constant. And so on. So it only requires all operations by round 8 or so.

Likewise, we are only interest in H on 64th round. Which is just G at round 63. Which is F at round 62. Which is E at round 61. Which is D + S1 + Ch + W[60] + K[60] at round 60.

So you can chop off last four rounds. Only 57 rounds out of 64 are required, so ~11% of computations go away. Furthermore, a lot of leftover rounds can be simplified out. Furthermore, a lot of block expansion parts can be simplified. Furthermore, some bits in first few meaningful rounds have very few dependencies and can be simplified.

Quote from: jetmine
and almost every bit is somehow influence by an MSB or near-MSB (which means that all input bits are relevant now).  You just cannot save much.

It is true that I cannot chop away, like, 99% of computation, but some modest speedup might be possible.

legendary
Activity: 1050
Merit: 1000
You are WRONG!
December 30, 2011, 05:43:56 AM
#9
If possible, what would the benefits of this be?
almost none. we could calculate sha256 a little bit faster, but only per block. we would need to optimize for every block...
hero member
Activity: 742
Merit: 500
December 29, 2011, 07:38:09 PM
#8
If possible, what would the benefits of this be?
newbie
Activity: 53
Merit: 0
December 29, 2011, 03:13:46 PM
#7
Anybody wants to discuss?

It's an interesting thought, but in my opinion it is doomed to fail.

From what I understand (here + usenet) you are near to need "almost as few" resources to calculate one bit, as a traditional method needs to calculate the full result.  If this is correct, it matches quite well with what I would expect from following through your idea mentally.

The problem is the non-linearity from the adders.  While you can easily bitslice the other logic, or adders alone, you can't do it with adders + logic.  As soon as you merge the two domains, for example when you perform a logic function on an adder output, you need to know the actual values of bits.

When adders are expressed as logic functions, the fan-in is huge.  It's natural.  If you want to know the MSB of an adder result, you need to know the carry.  For the carry, you need to know the lower bits and their carries.  As a consequence, the input will cover (almost) all bits.  Therefore the logic function for the MSB will need all input bits.  A function for the near-MSB will need almost all input bits. Etc.

Since SHA2 is cryptografically good, there is not much to optimize away by using logic functions.  The dynamic inputs avalanche over the static ones very quickly.  A few steps into the algorithm, and almost every bit is somehow influence by an MSB or near-MSB (which means that all input bits are relevant now).  You just cannot save much.

To calculate one output bit you need to do (almost) as many calculations as a traditional implementation.  I think this is where you are now, except that your adder replacements are less efficient in terms of "operation count" (because you've separated the single add operation out as several logic operations) and therefore you get a higher operation count.

If I've misunderstood something and you're not really at this point, you should try harder.  It must be possible to get there.

However, there is another thing to think about.  You are calculating just one bit of the result, and will probably (50%) want to calculate more bits.  To do that, you will need to process again (almost) all inputs.  This is where the real inefficiency starts.  Your algorithm is inherently not able to "cache" partial intermediates.  You can think of the traditional algorithm with adders as calculating partial results which are then used several times as inputs into the next stages.

Your algorithm, on the other hand, has to re-calculate everything from scratch, because it didn't separate out what is result of an adder for example, but rather fusioned every single add operation with thousands of other operations into one big logic blob.

This is my biggest complain, and makes me think the approach is doomed.

  • You are missing out on the partial result reuse.
  • You can't get significantly below 100% resource usage for a single bit calculation either.
  • You would need to get below 50% for it to make sense.
  • The cryptographical properties of SHA2 make it impossible to achieve 50%.
legendary
Activity: 1050
Merit: 1000
You are WRONG!
December 28, 2011, 07:57:59 AM
#6
Yep, right. So it is 768 bits in general case. But it doesn't make any difference in this context.
yes because you need to optimise for every single block, as the static-variables, will change with every block.
also the internal state is larger then 256 bit, and it can be precomputed per block, but it will still change every block.

i think it will take longer time to optimise for every block, then it will to just do the plain work.
legendary
Activity: 1022
Merit: 1033
December 28, 2011, 07:53:50 AM
#5
Yep, right. So it is 768 bits in general case. But it doesn't make any difference in this context.
legendary
Activity: 1050
Merit: 1000
You are WRONG!
December 28, 2011, 06:59:50 AM
#4
More-or-less so. The aim is, indeed, to 'reduce needless duplication of work'. However, it's not possible to process all inputs at once.

Generally speaking, main part of SHA-256 is a function of 512 boolean variables. We're going to change only 32 boolean variables to look for certain result, so we can assume other 480 variables fixed. So for a certain block header we can reduce SHA-256 to a function of 32 variables instead of 512 ones, optimizing it in process.

Furthermore, as we're using hardware which can do 32, 64, 128 or 256 binary operations at once, we can utilize this by considering many combinations at once. Let's say for a 64-bit hardware we can, essentially, vary 6 bits at once. Thus we already have a function of (32-6) = 26 variables, so we need only 2^26 combinations. Furthermore, it is possible to optimize things a bit for each such combination.

Processing 64 (or more) combinations at once merely makes it competitive with traditional implementation, it does not reduce number of bit ops done by CPU. Number of bin ops should be reduced at a step when we assume certain bits fixed.


you are forgetting the SHA also is having a internal state...
legendary
Activity: 1022
Merit: 1033
December 28, 2011, 06:52:27 AM
#3
More-or-less so. The aim is, indeed, to 'reduce needless duplication of work'. However, it's not possible to process all inputs at once.

Generally speaking, main part of SHA-256 is a function of 512 boolean variables. We're going to change only 32 boolean variables to look for certain result, so we can assume other 480 variables fixed. So for a certain block header we can reduce SHA-256 to a function of 32 variables instead of 512 ones, optimizing it in process.

Furthermore, as we're using hardware which can do 32, 64, 128 or 256 binary operations at once, we can utilize this by considering many combinations at once. Let's say for a 64-bit hardware we can, essentially, vary 6 bits at once. Thus we already have a function of (32-6) = 26 variables, so we need only 2^26 combinations. Furthermore, it is possible to optimize things a bit for each such combination.

Processing 64 (or more) combinations at once merely makes it competitive with traditional implementation, it does not reduce number of bit ops done by CPU. Number of bin ops should be reduced at a step when we assume certain bits fixed.

donator
Activity: 266
Merit: 252
I'm actually a pineapple
December 28, 2011, 06:21:53 AM
#2
It seems like the main benefit from your proposed approach is that you're considering the function over a bunch of (uniformly distributed) inputs simultaneously, so you'd essentially be trying to reduce needless duplication of work across separate calls to the hash function? At first glance, it does seem like it could be a fruitful direction to pursue.
legendary
Activity: 1022
Merit: 1033
December 28, 2011, 05:58:00 AM
#1
I'm experimenting with an alternative way to compute SHA-256 -- through tracing boolean functions.

Here's the idea:

1. Represent SHA-256 as a huge-ass DAG (directed acyclic graph) with nodes being boolean operations -- AND, OR, NOT, XOR. (One SHA256 block expansion + transform has 121k nodes after some optimization.)

2. Since we go through ~4 billion different nonces it makes sense to assume all other variables constant and optimize some nodes away. I get 86k nodes with 32 variables left.

3. Pre-compute fringe nodes with few dependencies via binary decision diagrams, normal forms or whatever. Optimize out parts of graph which do not need to be computed with certain configurations of nonce.

4. Now we can compute SHA-256 with about as many CPU instructions as there were nodes in DAG. It is possible to compute many combinations at once in parallel, e.g. on CPU which can do 64-bit operations in parallel it is possible to compute 64 combinations.

I've got some preliminary results, without precomputation it, theoretically, should be about as fast as traditional implementation of SHA-256 on CPUs, probably a bit slower due to required load/store operations. With pre-computation and other optimizations it might be faster. Or maybe not. I don't know.

Anybody wants to discuss? I'm not an expert in discrete math, so maybe some SAT solver or boolean function optimization techniques I don't know would be useful here.

Here's discussion in sci.crypt: http://groups.google.com/group/sci.crypt/browse_thread/thread/06ffe40905f725fe

Theoretical considerations:

 * traditional implementation makes advantage of native integer addition, it is very expensive to express addition in form of boolean ops
 * boolean function implementation, however, does not need rotateright -- it is optimized away at graph building stage
 * further, boolean function implementation can take advantage of pre-computation (I see some GPU implementations pre-compute some things, but they cannot go deep).

So this might be of interest for implementation on GPUs which do not have ROTR.
Jump to: