Off-topic. You're not talking about boolean logic implementation anymore. These optimizations apply to traditional implementations just as well.
Sure, I've just illustrated that SHA-256 round's diffusion isn't as good as people think about it. (BTW, even though these optimizations are applicable, I'm not sure that existing miners exploit all of them. Early miners like cpuminer do not seem to take advantage of them at all.)
And if it is not so good, there are optimization possibilities on bit level too (which aren't accessible on word level).
Your point was to calculate just one output bit using a big and wide input logic function (and then parallelize to make good use of available instruction sets).
I think you've missed the main part of my reply -- I'm not calculating just one bit of output, I can calculate all bits of output using this method. So your criticism isn't applicable.
Let's say H is content of H register on 64th round, and H[0] is first bit of it, H[1] is second and so on.
We need to find input such that
H[0] or H[1] or ... H[31] = 0
or alternatively:
not (H[0]) and not (H[1]) ... not(H[31]) = 1
That would be true for approximately one in 4 billion inputs. If that's not enough, we can include bits of G into a target function. Or we can compute fewer bits of H if that helps. Or we can express values of H[0] in terms of other expressions and do early rejection on that level.
A modest speedup is not achievable, only a very tiny one (if any) compared to an equally well optimized traditional implementation.
I don't know yet. So far, I've got pretty good results with BDD, better than I expected.
want to calculate more bits of it. But you've already used 98% of the budget. Now what? Calculate a traditional result (98+100=198%)? Or calculate another single bit (98+98=196%)?
Calculating another single bit would take only 0.1% because it depends on exactly same inputs as the first bit. Or see above -- it is possible to formulate whole mining problem in a form of boolean expression.
I hope I explained it sufficiently clear this time.
It was actually sufficiently clear the first time too, but for some reason you're missing that it is possible to reuse computations to calculate all bits of output.
In that case 2% speedup would be 2% speedup, plain and clear.
Right now I see BDDizing fringe nodes as only viable optimization, so I guess we need to wait until I'll BDDize all what is BDDizable before jumping to conclusions.
But thanks for looking into this, so far you're the one who paid most attention.