Hi Diapolo,
I just wanted to say thank you for your most recent kernel, the 2011-12-21 version is by far the best.
Im getting 455 Mhash/sec on my 6970 using Phoenix 1.7.1 (Aggression=12 Worksize=128 VECTORS2)
It easily averages 7 shares/minute which is fantastic. I had to lower my memory clock to 200 Mhz for optimal performance as it actually runs slower above this speed, which is fine with me as I can raise the core clock a bit higher with the lower temperature (When using VECTORS4 I had to set memory clock to 375 Mhz to get similar performance, but then I had to lower clock speed and it wasn't worth it. Worksize of 256 becomes really slow after a while regardless of settings)
So far this is the best kernel to use, much more efficient than any version of phatk2 if you find the right speed and setting. I wish VECTORS3 would work though, but the way you calculate Worksize it's impossible.
I'm running 12.1 preview drivers with new 2.6 SDK. With a customer forked poclbm or even guiminer, I can set Worksize in "32" increments.
I've been trying to run VECTORS3 with Worksize=192 as this should be divisible, and I've used this worksize before with phatk2.2, which is slightly faster for VECTORS4.
But, when I try to use your kernel with a different Worksize (other than 64,128,256) it just gives me an error saying "this worksize is not valid" even though it should be possible, and with the Worksizes it does accept, I don't think the ratedivisor will accept them. I was hoping there was some way to allow worksize in 32 increments as I think Worksize=192 would be a good test for VECTORS3.
Anyways, currently I'm just using VECTORS2 and so far the results are outstanding on my AMD 6970!
Hi blandead, thanks for your positive feedback. About the WORKSIZE, I read through AMDs OpenCL guide and they say it's best to use a multiple of 64 as WORKSIZE, because a Wavefront on modern AMD GPUS runs 64 work-items in parallel. But it's not that hard to change the code to accept 32 as a divider ... I'll consider this. For the vec3 thing, I'm trying hard to figure out what the problem is, but so far I have no clue :-/.
Edit: The WORKSIZE
can be set to any value smaller than the HW max (256 for current AMD GPUs) and which is divisible by 2, so 32 works ... tried it, but cripples Wavefronts!
Edit 2: From the OpenCL manual: "If local_work_size is specified, the values specified in global_work_size[0],... global_work_size[work_dim - 1] must be evenly divisible by the corresponding values specified in local_work_size[0],... local_work_size[work_dim - 1]." So the global worksize needs to be evenly divisable by the supplied WORKSIZE, which seems to not be true for WORKSIZE=96 for Phoenix.
Edit 3: I edited the __init__.py to allow WORKSIZE to be every value, which is able to evenly divide the global worksize. This is in the next release.
Dia
PS.: Guys if you wish to support my work keep donating, there were nearly no donations in the last weeks!