I'll not reply again unless this gets more interesting, it is repetitive now.
While moving some strings to get compilers better at SIMD processing (btw ICC is crappy too in non-array use -I tested it), I think I stumbled on something which could be... well.... interesting...
I was going over the asm files of known hashtypes, in files like this:
https://github.com/pooler/cpuminer/blob/master/sha2-x64.S...to check what is the level of packed instructions that can cut cycles by 2.
(Don't mind that the case above is about bitcoin (which is now ASICed) - this can be pretty relevant for a whole lotta cases out there, including cpu altcoin mining).
So I'm going over the lines... and I'm like, where can I pack stuff together to make a difference, you know... And while I saw some stuff that can be packed, they are not generally too many because the action is serial... one after the other permutation. Then I did a lookup on the file to see how many registers it uses. It's up to XMM11. So there are quite a few extra registers to play around with. And then BAM. It hit me.
It's all wrong on how these hashtypes are used for mass-hashing. You can't do it one by one.
With hashing they are inserting some data and using sequential operations, where one permutation goes to the next, doing some kind of altering to the data, moving bits around etc etc, and in the end you get the hash. One input, one output, no parallelism - except in ...another thread. You can do that, say, 4 times in a quad core.
But that's wrong because you get no SIMD action and packing per thread, to cut the processing clock cycles in half.
What is needed is a mining program that does this:
1) The miner's main routine sends to the hashing routine 2 (SSE) (or 4 for AVX) inputs for checking, not one.
2) A hashing routine with 2 (or 4) inputs and 2 (or 4) outputs.
While inside the hashing routine, and since the routine will be doing the exact same thing for 2 hashes, these operations can be done in SIMD fashion - with packed instructions, instead of serial/scalar. Say the first step of the hash is "we do this to that" but since we also have another "this" we can pack them both to do it. And then there are some more benefits in maths involving the prime number tables where you can do it in parallel by loading the prime in a movddup on a register, moving the data from the first hash to the lower part of an xmm register and from the second hash to the upper part of the xmm register. Then you do them both, and you'are -1 mov too. LOL. The more "fixed data" there are in the hash, the better for parallelism within the same routine - if you load 2 or 4 hashes.
The routine will have to be custom written to process at least 2 or 4 hashes in parallel in order to be able to use packed instructions. In those stages where packing can't be done for whatever reason, the routine will process the stages in a serial fashion (as it would, normally).
3) In the end the routine returns to the main mining routing 2 (or 4) outputs.
Supposing the bottleneck is CPU and not RAM (but even if it is RAM, the CPU will be finishing faster) we are talking about gains that could be very serious (triple digit % on AVX).
The implications of the above, is that, well, every single cpu altcoin is currently cripple-mined.
How is that for interesting?