I thought the NDRangeKernel went like this...Edit: Oh and Dia, in your CommandQueue try adding this property, "cl.command_queue_properties.OUT_OF_ORDER_EXEC_MODE_ENABLE"
https://github.com/ckolivas/cgminer/blob/master/ocl.c#L710clState->commandQueue = clCreateCommandQueue(clState->context, devices[gpu],
CL_QUEUE_OUT_OF_ORDER_EXEC_MODE_ENABLE, &status);
cgminer has used this for a very long time.
As I wrote, I think OoE mode is not supported on AMD GPUs ... is there a debug or vebose message, if that mode was successfully activated?
It is successfully activated on windows and linux, but osx fails. It does not improve throughput with current GPUs but is harmless to enable for if/when they do.
I saw a significant increase in average nonces being found and a 3 Mhash/sec higher throughput.
Modified from Dia's code I used the following...
self.commandQueue = cl.CommandQueue(self.context, self.device, cl.command_queue_properties.OUT_OF_ORDER_EXEC_MODE_ENABLE)
Then I saw that read buffer was set to this > cl.enqueue_read_buffer(self.commandQueue, self.output_buf, self.output, is_blocking=True)
Since OoE Mode will NOT work if is_blocking=True I set them all to false, and re-enabled self.commandQueue.finish()
Similarly I changed the write buffer
cl.enqueue_write_buffer(self.commandQueue, self.output_buf, self.output, is_blocking=False)
on the cl.output_buf I changed mem flags to cl.mem_flags.READ_WRITE | cl.mem_flags.COPY_HOST_PTR | cl.mem_flags.ALLOC_HOST_PTR
For Async to work the 11.11 AMD drivers tell you to add environmental variable to your system. GPU_ASYNC_MEM_COPY=2
Again, this might only be a 69xx feature, but for my 6970 I turn off BFI_INT and GOFFSET and increased my Memory speed and VECTORS8 was running at over 446 MHash/s. Now it'll find between 5-14 nonces per minute without choking up or freezing system. Before it was struggling to find 5 nonces per minute if at all.
Next, I want to add the Async functions
event_t async_work_group_copy (__local T *dst, const __global T *src,
size_t num_gentypes, event_t event)
event_t async_work_group_copy (__global T *dst, const __local T *src,
size_t num_gentypes, event_t event)
One is for global and other is for local work groups
Then create a prefetch for global cache
void prefetch (const __global T *p, size_t num_gentypes)