At least
ArtForz was mistaken about Cell earlier
Just let's do some simple math. Playstation3 has 6 SPE cores, each clocked at 3.2GHz and 25GB/s of total memory bandwidth. Calculating one hash needs approximately 434176 ADD/ROL/XOR operations on 128-bit vectors in the performance critical part of salsa20/8 which are executed in the even pipe (shuffles and the other instructions are executed in the odd pipe). Also calculating one hash needs 256KB of memory bandwidth (128KB is written sequentially, 128KB is read in scattered 128-byte chunks). So taking into account that SPE core can execute one instruction from the even pipe each cycle, the theoretical performance limit based on computational power is (6 * 3200000000) / 434176 ~= 44.2 khash/s. The theoretical performance limit based on memory bandwidth is 25GB / 256KB ~= 95.4 khash/s. There is a lot of headroom for the memory bandwidth and arithmetic calculations are the bottleneck. Though Cell has precise control over memory operations by scheduling DMA transfers and can overlap DMA transfers with calculations. This allows to utilize memory bandwidth very efficiently for scrypt algorithm.
This
page seems to say that HD 6990 has 320GB/s of memory bandwidth. And
here ArtForz tells us that it is possible to achieve < 20% peak BW with GPU. Doing some math again, we get 320GB * 0.2 / 256KB ~= 244 khash/s. Looks rather believable to me.
edit: corrected HD 6990 memory bandwidth (it is 320GB/s and not 350GB/s)
What about smix and the mul operations in scrypt? I thought the reason for the speed of Cell as implemented in PS3 (~35 kh/s) was do to the 256kb onboard local registers... The slowdown in scrypt(1024,1,1) has little to do with the speed of the memory and everything to do with the speed of random accesses to that memory. Cache (or onboard memory in the case of Cell) is way, way faster in terms of random access to data (L1 and L2 are 4 and 10 clock cycles respectively for an I7).
With DRAM memory, random access is never efficient. In fact, the GPU hardware looks at all memory addresses that the running threads want to access at a given cycle, and attempts to coalesce them into a single DRAM access - in case they are not random. Effectively the contiguous range from i to i+#threads is reverse-engineered from the explicitly computed i,i+1,i+2… - another cost of replicating the index in the first place. If the indexes are in fact random and can not be coalesced, the performance loss depends on “the degree of randomness”. This loss results from the DRAM architecture quite directly, the GPU being unable to do much about it - similarly to any other processor.
http://www.yosefk.com/blog/simd-simt-smt-parallelism-in-nvidia-gpus.htmlGPUs generally have little onboard cache (16-32kb) because the data they process is intended to be sequential (and it usually is for 3D applications).