Anyway, interesting work you've done on this, but I'm skeptical of - "I'd guestimate a further 10x as the room for improvment" - difficult to make accurate guesstimates like that.
In regards to my guesstimation: Of course it is not accurate... it's in the word. My way of thinking: A new graphics card simply delivers 2-3x more performance than my graphics card. There are AES implementations which are ~2x more performant than mine. I believe this is my first OpenCL code and therefore probably not as good as it gets... a further 2x may be possible here just by how you deploy the workload onto the graphics card. 2-3x * ~2x * 2x = 10x.
In regards to the lookuptable: You have ~250.000x 4KB (let's call them:) cache entrys. With the lookup table you have to calculate every cache entry once (256k calculations), which involves generating 64 SHA512 hashes per cache entry. If you calculate them on-the-fly, you have to calculate every cache entry (very worst case:) 16 times. Now that seems like a bad number at first, because it would take 16 times as long to calculate these cache entrys, but you have to put it in context! Or more specifically: what would be the overall performance decrease?
One way of doing it is mathematically, in which case you would compare
So instead I do it the other way... I implemented both algorithms (with and without Scratchpad) and these are the results I am measuring on CPU:
Calculated 256k cache entrys in (ms): 233364
Total time (ms): 251740
So the penalty hit for not using a 1GB lookup table is 70% in this scenario! And this leaves things out like the fact, that the SHA-512 step has much more optimizations/parallelization possible than a AES-CBC, which would decrease the relative performance hit you would take, after optimization.
You end up with this trade off: Use 1GB of memory versus take a 70% (=0.7x) performance penalty (in the very, very worst case).
Just for a sake of completeness: I did these numbers on my non AES-NI CPU, in a single thread, using the native Java Crypto library, which is faster than my own AES implementation. The worse your AES implementation is, the lower the relative performance hit. The better your SHA implementation is, the lower the relative performance hit.
Conclusion:
Advertising the HOdl algorithm as ASIC proof is total bullshit... the performance penalty when calculating cache entrys on-the-fly opposed to storing them in the 1GB lookup table makes every ASIC builder and their grandmother laugh...
What I don't understand/what I would appreciate if some one would answer:
Where does the algorithm origin from? I don't mean stuff like "it is similiar to Cryptonight" or something like that, but who made this algo? Also there seems to be zero documentation about the algo except for the code itself, which is a bad thing.
All things are not lost, I guess... from what I understand one could make an update to the algorithm via an update? I would strongly suggest it, since as of now, the algo doesn't hold up what it promises imho.
Again: I am _not_ a crypto expert or mathematician. I am simply your average Joe software engineer. And when I see a problem like this... everyone taking a closer look will...
EDIT:
@FreeTrade: I just re-read your post: "if you can calculate faster than you can retrieve from memory, that'll be optimal, but I doubt it" - that is totally besides the point! It is not about what is faster! It is about what are the time/space requirements for the algorithm! Even if not using the lookup table is 0.7x slower, it requires 250000x less memory! This is a great trade off and makes ASICs possible, which still operate on an algorithm, which is 0.7x slower, but do so 1000x faster (=700x total - of course this is not an accurate number again, but you hopefully get the point)!