Did increasing the threads count beyond the number of CPUs improve performance because you somehow "utilized" the cores better? Or did just the opposite happen because all you did was introducing scheduling contention?
When the thread count was somewhat greater than the core count, I utilized the cores better.
You can decouple the fetching and decoding from the execution. Instructions do not execute until they are ready.
Again, pipelines can't help much in the situation where an instruction depends on the result of a previous one. SHA256 has 64 steps, and each step depends on the result of the previous one. Now there are a number of independent instructions within each step, this is not unlimited though
1. The number of independent instructions is much much greater than the number of steps. So as far as dependencies are concerned, it is trivial.
2. An multithreaded processor can even execute instructions from another SHA hash calculation which WILL be completely independent if dependencies were an issue.
This is a disadvantage of VLIW4/5 and has nothing to do with GPGPU.
Inadequate register/shared memory is a
disadvantage of any GPU, not only VLIW ones. That makes them much less suitable for memory-intensive algorithms, even if they are embarassingly parallel. Moreover,
resource-limited occupancy is a general GPGPU problem, far away from being bitcoin-related or VLIW-related, it is a problem even for ALU-bound kernels like the bitcoin one.
I quoted the section on ALUPacking, not GPR usage.
SHA256 hashing only needs 1 thing: lots and lots of ALUs. Stuff like AVX, threads, and x86 just introduce structural dependencies that slow down the process.
Note in my original post never once did I mention that bitcoin mining just requires tons and tons and tons of ALUs
on the GPU. You just assumed it that way because you work extensively with GPUs.
Of course you would increase the GPR if you increase the number of ALU per CU. But note that a CU generates a structural dependency, a dependency that we created in order to accommodate GPGPU.
What makes you think you would increase the registers count if you increase the ALU units? I see....mmmm....no relation between both.
You just said that bitcoin mining is limited by the number of GPRs... there aren't enough of them per ALU.
If you were to make an ASIC miner, you sure as hell dont need a crapton of GPRs or CUs or "wavefront"... all you would need is tons and tons of ALUs.
Care to elaborate what "ALU" means in terms of ASIC?
Units which do AND OR XOR NOT bitshift maybe integer arithmetic should be sufficient.