Interesting. So the GPU threads stall until the memory read is completed (given that for the full scratchpad, each blockmix cycle needs a 128 bytes read from an address generated by the previous blockmix).
Yes, and the GPU implements something like hyperthreading, but significantly beefed up (not just 2 virtual threads per core as in the CPU, but a lot more). A stalled GPU thread does not mean that the GPU ALU resources are idle, they are just allocated to executing some other threads.
Regarding the bandwidth vs. latency. Fortunately the reads as 128 byte chunks are just perfect for SDRAM memory. SDRAM is generally optimized for large burst reads/writes to do cache line fills and evictions. And the size of cache lines in processors is roughly in the same ballpark (typically even smaller than 128 bytes). Using such large bursts means that the memory bandwidth can be fully utilized without any problems. And the latency can be hidden.
It makes sense for the huge number of threads available on GPU, but I wonder if this approach works with FPGA too (using external SDRAM). Using internal block RAM to hold the thread state (B/Bo) and switch threads while waiting for the SDRAM. Not sure that works actually. Food for thought, thanks.
There is a software optimization technique called pipelining, which is rather widely used. It allows to fully hide the memory access latency for scrypt. In the Cell/BE miner (which was developed long before the mtrlt's GPU miner) I was calculating 8 hashes at once per SPU core. These hashes were split into two groups of 4 hashes for pipelining purposes. So the second loop, where the addresses depend on previous calculations, looks like this:
dma request the initial four 128 byte chunks for the first group
dma request the initial four 128 byte chunks for the second group
loop {
check dma transfer completion and do calculations for the first group
dma request the next needed four 128 byte chunks for the first group
check dma transfer completion and do calculations for the second group
dma request the next needed four 128 byte chunks for the second group
}
The idea is that while the DMA transfer from the external memory to the local memory is in progress, we just do calculations for another group of hashes without blocking. The actual code for this loop is here:
https://github.com/ssvb/cpuminer/blob/058795da62ba45f4/scrypt-cell-spu.c#L331. Cell in Playstation3 has enough memory bandwidth headroom (with its total ~25GB/s memory bandwidth) and is only limited by the performance of ALU computations done by 6 SPU cores (or 7 SPU cores with a hacked firmware). So there was no need to implement the scratchpad lookup gap compression for that particular hardware.
I believe that similar pipelining for hiding the latency of external DRAM accesses can be also easily implemented with FPGA or ASIC. But FPGA or ASIC still must have a lot of memory bandwidth even after the scratchpad size reduction, otherwise the external memory will become a performance bottleneck. Beating the GPUs equipped with fast GDDR5 is going to be a tough challenge.
PS ssvb, you have some very interesting threads linked in your post history. Thank you for posting here, I'm late to this party and this helps enormously.
Well, I'm already away from the party since long ago