I will definitely look into further optimizations for Round 0. I think sp_ was talking not about reusing the Blake calculations but executing Round 0 of the next Equihash run in the background during Rounds 1 through 8. I did notice the VALU was being idle quite often during memory transfers but did not know what to do with it until now. I should be able to implement this idea *pretty* soon.
Just running multiple instances of the kernel should help; just don't launch them at exactly the same time. Ideally the 2nd instance should be launched after the first has finished round0.
But you want to make sure than round1 starts at exactly the same time as round0. running with multiple threads, sometimes help, and sometimes not. With proper code, you can make sure that this always happens. No need for 5 threads. (nicehash dj-ezo kernel on the gtx 1080)
At first, I didn't fully understood what you meant, but I think I do now. Your ideal is following; when there is round0 being executed, you would like to execute other rounds in parallel with round0 but with different nonce, so that resources of the card can be better utilized (during round0 there is not much mem ops, but rather alu ops, and during rounds1+ are more mem ops and less alu ops). I had this idea but here is the problem for CUDA, you would need to be able to launch two kernels at the same time, and I am not talking about in various threads, but actually make NVIDIA driver execute two kernels in parallel. That is not how CUDA works to my knowledge. CUDA, at driver level, will always execute certain kernel, then move on to the next one. To acheive parallel solving of rounds, you would need to do it in code on your own (eg say that each odd blockthread is doing round0, each even blockthread is doing round1), but here are different needs of round0 and round1 that would lower your occupation and probably make everything slower (round0 doesn't need shared memory, needs more registers, round1 needs lot's of shared memory, needs less registers).