Pages:
Author

Topic: further improved phatk_dia kernel for Phoenix + SDK 2.6 - 2012-01-13 - page 2. (Read 106700 times)

newbie
Activity: 46
Merit: 0
Quote
Your donation has just arrived, thank you Smiley!

Sounds pretty interesting and I would like to receive a copy of that PDF. Can you upload it somewhere or send me a link via PM? I saw, that there is a new cl_amd_media_ops2 extension in the latest drivers, but I could not find and documentation for it (the first one is used for BFI_INT patching). Would be very nice, if BFI_INT would be directly accessible via OpenCL, so that we could kick the binary patching out. The vec3 bug is really strange, I guess it happens in the Python host code and not in the kernel, because KernelAnalyzer will run it just fine.

I'm looking forward to further discussions!

Dia

I'm not sure where I downloaded it, but I can easily e-mail you it. The cl_amd_media_ops2 command is for mapping 3d images, so that doesn't help us. But if you look at AMD 11.12 driver they tell you to add an environment path "GPU_ASYNC_MEM_COPY=2" to make use of a new feature. There is a preview driver of the opencl 1.2 that adds some functionality. They are lifting the rule of only 1 overloaded function, and will allow you to code directly in c++. Here is a reference card of commands http://www.khronos.org/files/opencl-1-2-quick-reference-card.pdf

Here is the bases to one of the new commands (cl_khr_fp64) http://www.khronos.org/registry/cl/sdk/1.0/docs/man/xhtml/cl_khr_fp64.html -- adds double floating-point precision.
Only works on AMD 69xx devices though, and probably the GCN cards

I'm trying to find a direct link to this nice pdf I found with excellent examples. I have the file on my computer though.

Ah! found it... http://www.bu.edu/pasi/files/2011/01/AndreasKloeckner3-07-1000.pdf Look at page 56-60

This code should look familiar to anybody who took a programming class.
hero member
Activity: 535
Merit: 500
 Why better don't improve kernel for phatk2? It's faster than 1st version and still faster than phatk_dia.
legendary
Activity: 3472
Merit: 1721
I recall that I mentioned this kernel is for SDK 2.6+, sorry!
It's totally ok for this kernel to not work well for older SDK versions.

Dia

My fault then Tongue

Wasn't SDK 2.6 the one that was significantly slower? Which driver version would you recommend to work along with SDK 2.6?
hero member
Activity: 769
Merit: 500
minus 12 Mhash/s for HD 6850
minus 25 Mhash/s for each of my HD 5850s

compared to guiminer from July 1st :/

PS. Yes, I did experiment with flags, etc.

drivers: 11.5 and 2.3 stream SDK
OS: win 7 64 pro

I recall that I mentioned this kernel is for SDK 2.6+, sorry!
It's totally ok for this kernel to not work well for older SDK versions.

Dia
legendary
Activity: 3472
Merit: 1721
minus 12 Mhash/s for HD 6850
minus 25 Mhash/s for each of my HD 5850s

compared to guiminer from July 1st :/

PS. Yes, I did experiment with flags, etc.

drivers: 11.5 and 2.3 stream SDK
OS: win 7 64 pro
sr. member
Activity: 256
Merit: 250
There is no documentation yet. Those are the strings carved from libamdocl64.so. Additionaly, I've tested most of them (excluding max3/min3 and the sad ones) and they work. For some reason, you need to compile with -Dcl_amd_media_ops2, because just the pragma does not enable it.

For the full list see this thread:

http://forums.amd.com/devforum/messageview.cfm?catid=390&threadid=157516&messid=1274705&parentid=1274660&FTVAR_FORUMVIEWTMP=Branch
hero member
Activity: 769
Merit: 500
Hello,

Unfortunately the amd_cl_media_ops2 extension has nothing to do with BFI_INT. There are amd_bfe() and amd_bfm() functions defined, but nothing that maps to bfi_int.

Can I have that pdf too please?

Have you got a link to the amd_media_ops2 documentation?

Thanks,
Dia
sr. member
Activity: 256
Merit: 250
Hello,

Unfortunately the amd_cl_media_ops2 extension has nothing to do with BFI_INT. There are amd_bfe() and amd_bfm() functions defined, but nothing that maps to bfi_int.

Can I have that pdf too please?
hero member
Activity: 769
Merit: 500
Hey Dia,

So I had to do a fresh install on my computer, but I sent you a small donation just now, lemme know if it went through : D

Anyways, while I was in the process of installing AMD drivers I saw an awesome article about OpenCL 1.2 Preview with SDK 2.6. I'm testing the preview drivers out now since they add a couple new extensions, though I have to figure out the best place to use them. I ran your kernel through the latest APP KernelAnalyzer. I think there are many places it can be optimized as I'm seeing BFI_INT directly from the GPU ISA for many of the rounds, and it looks like there are a lot of new patterns they added to do so.

I also found a really cool pdf on new optimizations that are recommended for OpenCL 1.2, and it is supposed to provide a pretty good performance increase for VLIW4 architecture, and there was one part that I think would solve your VECTORS3 issue or even a better way of achieving it. If you have time send me a PM, and I can send you the pdf.

Anyways, new kernel is a little faster with VECTORS4, but for some reason the temperature is higher. That could just be because of the fresh wipe I did, did anyone else notice their GPU running hotter?

Your donation has just arrived, thank you Smiley!

Sounds pretty interesting and I would like to receive a copy of that PDF. Can you upload it somewhere or send me a link via PM? I saw, that there is a new cl_amd_media_ops2 extension in the latest drivers, but I could not find and documentation for it (the first one is used for BFI_INT patching). Would be very nice, if BFI_INT would be directly accessible via OpenCL, so that we could kick the binary patching out. The vec3 bug is really strange, I guess it happens in the Python host code and not in the kernel, because KernelAnalyzer will run it just fine.

I'm looking forward to further discussions!

Dia
newbie
Activity: 46
Merit: 0
Hey Dia,

So I had to do a fresh install on my computer, but I sent you a small donation just now, lemme know if it went through : D

Anyways, while I was in the process of installing AMD drivers I saw an awesome article about OpenCL 1.2 Preview with SDK 2.6. I'm testing the preview drivers out now since they add a couple new extensions, though I have to figure out the best place to use them. I ran your kernel through the latest APP KernelAnalyzer. I think there are many places it can be optimized as I'm seeing BFI_INT directly from the GPU ISA for many of the rounds, and it looks like there are a lot of new patterns they added to do so.

I also found a really cool pdf on new optimizations that are recommended for OpenCL 1.2, and it is supposed to provide a pretty good performance increase for VLIW4 architecture, and there was one part that I think would solve your VECTORS3 issue or even a better way of achieving it. If you have time send me a PM, and I can send you the pdf.

Anyways, new kernel is a little faster with VECTORS4, but for some reason the temperature is higher. That could just be because of the fresh wipe I did, did anyone else notice their GPU running hotter?
full member
Activity: 216
Merit: 100
FASTLOOP is great with AGGRESSION=6 for good desktop responsiveness, I did indeed need it. Mind you, this newest kernel doesn't seem to improve the performance of my 5970 with Catalyst 12.1.
legendary
Activity: 1512
Merit: 1032
worksize:64128256
phatk2VECTORS1000MHz197205195
dia_newVECTORS21000MHz215.71220.37212.23

phatk2 VECTORS WORKSIZE=128: 61.54 MH/s
phatk_dia VECTORS2 WORKSIZE=128: 67.15 MH/s
That corresponds closely with the two-vector results I quote, however in finding the highest output possible from a GPU, VECTORS4 (@ 64 or 128, depending on card), phatk2 still eeks out a win for me.
hero member
Activity: 769
Merit: 500
Uploaded a fixed version, which corrects an error with FASTLOOP=True:
Download version 2012-01-13: http://www.mediafire.com/?xzk6b1yvb24r4dg

There are no other changes in this version!

Dia
hero member
Activity: 769
Merit: 500
Is FASTLOOP broken? I get:

Code:
Unhandled error in Deferred:
Unhandled Error
Traceback (most recent call last):
  File "twisted\internet\defer.pyc", line 361, in callback

  File "twisted\internet\defer.pyc", line 455, in _startRunCallbacks

  File "twisted\internet\defer.pyc", line 542, in _runCallbacks

  File "QueueReader.pyc", line 136, in preprocess

--- ---
  File "twisted\internet\defer.pyc", line 133, in maybeDeferred

  File "kernels\phatk_dia\__init__.py", line 167, in

  File "kernels\phatk_dia\__init__.py", line 381, in preprocess

  File "kernels\phatk_dia\__init__.py", line 377, in updateIterations

exceptions.UnboundLocalError: local variable 'EXP' referenced before assignment

attempting to use it...

I wrote this in the first posting, yes it is broken currently! I'm looking into it.
Are you sure it's needed for you?

Edit: self.loopExponent = int(max(0, EXP)) causes this error, but I'm not sure yet, why this happens with my init and not the default one ...

Edit 2: Fix is to place another tabstop at the beginning in line 377 in front of self.loopExponent = int(max(0, EXP))! Wow that's a stupid one. Will upload a fixed version later today.

Edit 3: It has to look like this in an editor:
Code:
		if not (rate <= 0):
# calculate the number of iterations to run
EXP = max(0, (math.log(rate)/math.log(2)) - (self.AGGRESSION - 8))
# prevent switching between loop exponent sizes constantly
if EXP > self.loopExponent + 0.54:
EXP = round(EXP)
elif EXP < self.loopExponent - 0.65:
EXP = round(EXP)
else:
EXP = self.loopExponent

self.loopExponent = int(max(0, EXP))

Dia
full member
Activity: 216
Merit: 100
Is FASTLOOP broken? I get:

Code:
Unhandled error in Deferred:
Unhandled Error
Traceback (most recent call last):
  File "twisted\internet\defer.pyc", line 361, in callback

  File "twisted\internet\defer.pyc", line 455, in _startRunCallbacks

  File "twisted\internet\defer.pyc", line 542, in _runCallbacks

  File "QueueReader.pyc", line 136, in preprocess

--- ---
  File "twisted\internet\defer.pyc", line 133, in maybeDeferred

  File "kernels\phatk_dia\__init__.py", line 167, in

  File "kernels\phatk_dia\__init__.py", line 381, in preprocess

  File "kernels\phatk_dia\__init__.py", line 377, in updateIterations

exceptions.UnboundLocalError: local variable 'EXP' referenced before assignment

attempting to use it...
hero member
Activity: 769
Merit: 500
legendary
Activity: 1512
Merit: 1032
Why is it so neccesarry for phatk kernal variations to have the memclock at 1k.... Some people cant deal with that extra heat...
This is something that has changed in SDK 2.6; The best performance at the best settings after trying all options comes at a GPU RAM speed of 1000MHz (stock speed for most cards) instead of at an underclock of 300MHz-370MHz. Version 2.6, included with driver 11.12 and 12.1, is significantly different in how it responds to worksizes, vector settings, and OpenCL programming than the previous SDKs.

It is a benefit in that one doesn't need oddly tweak memory speeds from stock to get the best performance (annoying to tell noobs over and over to underclock RAM), but bad in that this old quirk was actually an electricity saver if you did it.
legendary
Activity: 1428
Merit: 1001
Okey Dokey Lokey
legendary
Activity: 1512
Merit: 1032
Benchmarks on a 5770 (VLIW5, 800 stream processors, 980MHz core [scales more like 5870 than 5830]), Catalyst 12.1a/SDK 2.6, Phoenix 1.7.3 exe, win7 x32:

Typical command line (single cpu affinity, realtime priority):
start /AFFINITY 08 /REALTIME phoenix.exe -v -u http://xx/ -k dia VECTORS4 AGGRESSION=12 FASTLOOP=False WORKSIZE=64


worksize:64128256
phatk2VECTORS41000MHz223.88226.34181.40
phatk2VECTORS1000MHz197205195
dia_newVECTORS41000MHz223.28225.48195.75
dia_newVECTORS21000MHz215.71220.37212.23
dia_lastVECTORS41000MHz207.27200.41

less MH/s than phatk2, peak performance at 1000MHz RAM...
legendary
Activity: 1428
Merit: 1001
Okey Dokey Lokey
Ok, I'll let you first play around a bit, before asking for a performance comparison Cheesy.

I asked, what's happening, if only one card is mining in terms of GPU2 usage "bug", does it go up to 99% then?<---... I Said yeah, it works flawlessly when running alone
Are the cards connected via Crossfirebridge?<---I said yes, What OS and driver are you on?<-Win7x64 sdk2.6 cat 12.1

Edit: By the way, did you try to lower mem clock even more via MSI Afterburner and unofficial overclocking mode?

Dia
I never saw a good reason to drop my mem below 600, But i cant do it Easily... I'll go do 1000core 315mem and post results Aswell as 1000core 1000mem.
Using GUIminer+PhatkD, MSIa, sdk 2.6, cat 12.1, crossfired 6870's

1kcore 300mem=255mh/s 70°C Fans@70%
1kcore 1kmem=314.8mh/s 88°C Fans@100%
Using GUIminer+pcolbm, MSIa, sdk 2.6, cat 12.1, crossfired 6870's
1kcore 500mem=307mh/s 77°C Fans@ 80%
1kcore 1kmem=307mh/s OverheatShutdown.

Pages:
Jump to: