First implementation:
http://pastebin.com/UccPBr4B
Slower than dcct's plotter.
Takes an array as a parameter of workgroupsize * 262160 bytes. Expects the address/nonce as last 16 bytes of each 262160 byte chunk, and the 262144 before that are written to as output for that plot.
Seems to work ok.
Second attempt:
http://pastebin.com/pQAhvbc2
Don't know if it works or not since it's too slow.
Parameters: array of 16 bytes * workgroupsize, with address/nonce filled in each 16 bytes. array 262144 * workgroupsize for output. local buffer of size 32768 bytes
I can't figure out why it's slower. Some restructuring was done to avoid conditionals, and a rotating buffer was used to read from faster local memory instead of hammering global. The problem seems to be writing the results to global memory, since commenting that out makes it run fast, but obviously won't give correct results.
Any input would be appreciated, since I'm about ready to give up on making a gpu plotter.