Anyway, why did you remove that one?
I understand that workgroup size is configurable in poclbm. However, in most cases, 64 should be the best one. Also, hardcoding the required workgroup size helps the OpenCL compiler to do better the register allocation stuff as it "knows" the workgroup size in compile time and do not make worst-case assumptions. You are losing performance due to this.
Another thing is (don't know if that's possible with pyopencl) - don't use clenqueuereadbuffer() (or whatever it's equivalent is). Use clenqueuemapbuffer() instead. It's noticably faster. Hm really started wondering about modifying some python miner to incorporate that kernel there, looks like a quick way to make it portable to windows. Besides, there are obvious problems with the non-ocl part which are due to code inmaturity.
Tried the first... No measurable difference (technically I first tried it with 64, but it was much slower, then changed it to 256, and there was no difference with the version without the attribute). Tried the second... Mmmmh... perhaps there was a difference, but it was very small. Tried even changing the size of the output buffer. You use a one-unit array, poclbm uses a 256 sized array, but no difference. The memory bandwitdth of the graphical adaptor is probably great enough that 1kb of data every second or so isn't truly measurable.
Tomorrow I'll modify the poclbm kernel to work with your frontend. It shouldn't be much difficult.