can I be doing better with something else?
Yeah, for Vertcoin there is a bit of a performance cliff, which you can see firsthand when running
autotune with the -D flag. At some warp number (dependent on the block count) the performance
drops drastically. So the ideal configurations for normal scrypt with warp numbers at the kernel's
limit (x24) don't work.
I think this must be due to saturating/overloading the memory controller.
Have you ever tried autotuning the lower case "t" kernel for Vertcoin? This used to be the fastest
kernel before nVidia submitted something better. This "t" kernel implements two different memory
access schemes. One is called SIMPLE (used for Yacoin), the other is called ANDERSEN (used for scrypt).
We should experiment using SIMPLE also for scrypt with N>=2048. Those with access to the source code
and a working comipler environment can already play with swapping ANDERSEN for SIMPLE in the kernel launches (the places with three brackets like <<< >>>) to check if there are any benefits to be had.
Christian