further improved phatk_dia kernel for Phoenix + SDK 2.6 - 2012-01-13 - page 2.

blandead

newbie

Activity: 46

Merit: 0

Quote

Your donation has just arrived, thank you

!

Sounds pretty interesting and I would like to receive a copy of that PDF. Can you upload it somewhere or send me a link via PM? I saw, that there is a new cl_amd_media_ops2 extension in the latest drivers, but I could not find and documentation for it (the first one is used for BFI_INT patching). Would be very nice, if BFI_INT would be directly accessible via OpenCL, so that we could kick the binary patching out. The vec3 bug is really strange, I guess it happens in the Python host code and not in the kernel, because KernelAnalyzer will run it just fine.

I'm looking forward to further discussions!

Dia

I'm not sure where I downloaded it, but I can easily e-mail you it. The cl_amd_media_ops2 command is for mapping 3d images, so that doesn't help us. But if you look at AMD 11.12 driver they tell you to add an environment path "GPU_ASYNC_MEM_COPY=2" to make use of a new feature. There is a preview driver of the opencl 1.2 that adds some functionality. They are lifting the rule of only 1 overloaded function, and will allow you to code directly in c++. Here is a reference card of commands http://www.khronos.org/files/opencl-1-2-quick-reference-card.pdf

Here is the bases to one of the new commands (cl_khr_fp64) http://www.khronos.org/registry/cl/sdk/1.0/docs/man/xhtml/cl_khr_fp64.html -- adds double floating-point precision.
Only works on AMD 69xx devices though, and probably the GCN cards

I'm trying to find a direct link to this nice pdf I found with excellent examples. I have the file on my computer though.

Ah! found it... http://www.bu.edu/pasi/files/2011/01/AndreasKloeckner3-07-1000.pdf Look at page 56-60

This code should look familiar to anybody who took a programming class.

BCMan

hero member

Activity: 535

Merit: 500

Why better don't improve kernel for phatk2? It's faster than 1st version and still faster than phatk_dia.

malevolent

legendary

Activity: 3472

Merit: 1727

Quote from: Diapolo on January 19, 2012, 03:21:14 AM

I recall that I mentioned this kernel is for SDK 2.6+, sorry!
It's totally ok for this kernel to not work well for older SDK versions.

Dia

My fault then Tongue

Wasn't SDK 2.6 the one that was significantly slower? Which driver version would you recommend to work along with SDK 2.6?

Diapolo

hero member

Activity: 772

Merit: 500

Quote from: malevolent on January 18, 2012, 06:40:25 PM

minus 12 Mhash/s for HD 6850
minus 25 Mhash/s for each of my HD 5850s

compared to guiminer from July 1st :/

PS. Yes, I did experiment with flags, etc.

drivers: 11.5 and 2.3 stream SDK
OS: win 7 64 pro

I recall that I mentioned this kernel is for SDK 2.6+, sorry!
It's totally ok for this kernel to not work well for older SDK versions.

Dia

malevolent

legendary

Activity: 3472

Merit: 1727

minus 12 Mhash/s for HD 6850
minus 25 Mhash/s for each of my HD 5850s

compared to guiminer from July 1st :/

PS. Yes, I did experiment with flags, etc.

drivers: 11.5 and 2.3 stream SDK
OS: win 7 64 pro

gat3way

sr. member

Activity: 256

Merit: 250

There is no documentation yet. Those are the strings carved from libamdocl64.so. Additionaly, I've tested most of them (excluding max3/min3 and the sad ones) and they work. For some reason, you need to compile with -Dcl_amd_media_ops2, because just the pragma does not enable it.

For the full list see this thread:

http://forums.amd.com/devforum/messageview.cfm?catid=390&threadid=157516&messid=1274705&parentid=1274660&FTVAR_FORUMVIEWTMP=Branch

Diapolo

hero member

Activity: 772

Merit: 500

Quote from: gat3way on January 18, 2012, 05:48:13 AM

Hello,

Unfortunately the amd_cl_media_ops2 extension has nothing to do with BFI_INT. There are amd_bfe() and amd_bfm() functions defined, but nothing that maps to bfi_int.

Can I have that pdf too please?

Have you got a link to the amd_media_ops2 documentation?

Thanks,
Dia

gat3way

sr. member

Activity: 256

Merit: 250

Hello,

Unfortunately the amd_cl_media_ops2 extension has nothing to do with BFI_INT. There are amd_bfe() and amd_bfm() functions defined, but nothing that maps to bfi_int.

Can I have that pdf too please?

Diapolo

hero member

Activity: 772

Merit: 500

Quote from: blandead on January 17, 2012, 10:33:57 PM

Hey Dia,

So I had to do a fresh install on my computer, but I sent you a small donation just now, lemme know if it went through : D

Anyways, while I was in the process of installing AMD drivers I saw an awesome article about OpenCL 1.2 Preview with SDK 2.6. I'm testing the preview drivers out now since they add a couple new extensions, though I have to figure out the best place to use them. I ran your kernel through the latest APP KernelAnalyzer. I think there are many places it can be optimized as I'm seeing BFI_INT directly from the GPU ISA for many of the rounds, and it looks like there are a lot of new patterns they added to do so.

I also found a really cool pdf on new optimizations that are recommended for OpenCL 1.2, and it is supposed to provide a pretty good performance increase for VLIW4 architecture, and there was one part that I think would solve your VECTORS3 issue or even a better way of achieving it. If you have time send me a PM, and I can send you the pdf.

Anyways, new kernel is a little faster with VECTORS4, but for some reason the temperature is higher. That could just be because of the fresh wipe I did, did anyone else notice their GPU running hotter?

Your donation has just arrived, thank you

!

Sounds pretty interesting and I would like to receive a copy of that PDF. Can you upload it somewhere or send me a link via PM? I saw, that there is a new cl_amd_media_ops2 extension in the latest drivers, but I could not find and documentation for it (the first one is used for BFI_INT patching). Would be very nice, if BFI_INT would be directly accessible via OpenCL, so that we could kick the binary patching out. The vec3 bug is really strange, I guess it happens in the Python host code and not in the kernel, because KernelAnalyzer will run it just fine.

I'm looking forward to further discussions!

Dia

blandead

newbie

Activity: 46

Merit: 0

Hey Dia,

So I had to do a fresh install on my computer, but I sent you a small donation just now, lemme know if it went through : D

Anyways, while I was in the process of installing AMD drivers I saw an awesome article about OpenCL 1.2 Preview with SDK 2.6. I'm testing the preview drivers out now since they add a couple new extensions, though I have to figure out the best place to use them. I ran your kernel through the latest APP KernelAnalyzer. I think there are many places it can be optimized as I'm seeing BFI_INT directly from the GPU ISA for many of the rounds, and it looks like there are a lot of new patterns they added to do so.

I also found a really cool pdf on new optimizations that are recommended for OpenCL 1.2, and it is supposed to provide a pretty good performance increase for VLIW4 architecture, and there was one part that I think would solve your VECTORS3 issue or even a better way of achieving it. If you have time send me a PM, and I can send you the pdf.

Anyways, new kernel is a little faster with VECTORS4, but for some reason the temperature is higher. That could just be because of the fresh wipe I did, did anyone else notice their GPU running hotter?

TurdHurdur

full member

Activity: 216

Merit: 100

FASTLOOP is great with AGGRESSION=6 for good desktop responsiveness, I did indeed need it. Mind you, this newest kernel doesn't seem to improve the performance of my 5970 with Catalyst 12.1.

deepceleron

legendary

Activity: 1512

Merit: 1036

Quote from: deepceleron on January 13, 2012, 03:10:41 PM

		worksize:	64	128	256
phatk2	VECTORS	1000MHz	197	205	195
dia_new	VECTORS2	1000MHz	215.71	220.37	212.23

Quote from: Diapolo on January 13, 2012, 05:37:18 PM

phatk2 VECTORS WORKSIZE=128: 61.54 MH/s
phatk_dia VECTORS2 WORKSIZE=128: 67.15 MH/s

That corresponds closely with the two-vector results I quote, however in finding the highest output possible from a GPU, VECTORS4 (@ 64 or 128, depending on card), phatk2 still eeks out a win for me.

Diapolo

hero member

Activity: 772

Merit: 500

Uploaded a fixed version, which corrects an error with FASTLOOP=True:
Download version 2012-01-13: http://www.mediafire.com/?xzk6b1yvb24r4dg

There are no other changes in this version!

Dia

Diapolo

hero member

Activity: 772

Merit: 500

Quote from: TurdHurdur on January 13, 2012, 06:26:28 PM

Is FASTLOOP broken? I get:

Code:

Unhandled error in Deferred:
Unhandled Error
Traceback (most recent call last):
  File "twisted\internet\defer.pyc", line 361, in callback

  File "twisted\internet\defer.pyc", line 455, in _startRunCallbacks

  File "twisted\internet\defer.pyc", line 542, in _runCallbacks

  File "QueueReader.pyc", line 136, in preprocess

---  ---
  File "twisted\internet\defer.pyc", line 133, in maybeDeferred

  File "kernels\phatk_dia\__init__.py", line 167, in 

  File "kernels\phatk_dia\__init__.py", line 381, in preprocess

  File "kernels\phatk_dia\__init__.py", line 377, in updateIterations

exceptions.UnboundLocalError: local variable 'EXP' referenced before assignment

attempting to use it...

I wrote this in the first posting, yes it is broken currently! I'm looking into it.
Are you sure it's needed for you?

Edit: self.loopExponent = int(max(0, EXP)) causes this error, but I'm not sure yet, why this happens with my init and not the default one ...

Edit 2: Fix is to place another tabstop at the beginning in line 377 in front of self.loopExponent = int(max(0, EXP))! Wow that's a stupid one. Will upload a fixed version later today.

Edit 3: It has to look like this in an editor:

Code:

		if not (rate <= 0):
			# calculate the number of iterations to run
			EXP = max(0, (math.log(rate)/math.log(2)) - (self.AGGRESSION - 8))
			# prevent switching between loop exponent sizes constantly
			if EXP > self.loopExponent + 0.54:
				EXP = round(EXP)
			elif EXP < self.loopExponent - 0.65:
				EXP = round(EXP)
			else:
				EXP = self.loopExponent

			self.loopExponent = int(max(0, EXP))

Dia

TurdHurdur

full member

Activity: 216

Merit: 100

Is FASTLOOP broken? I get:

Code:

Unhandled error in Deferred:
Unhandled Error
Traceback (most recent call last):
  File "twisted\internet\defer.pyc", line 361, in callback

  File "twisted\internet\defer.pyc", line 455, in _startRunCallbacks

  File "twisted\internet\defer.pyc", line 542, in _runCallbacks

  File "QueueReader.pyc", line 136, in preprocess

---  ---
  File "twisted\internet\defer.pyc", line 133, in maybeDeferred

  File "kernels\phatk_dia\__init__.py", line 167, in 

  File "kernels\phatk_dia\__init__.py", line 381, in preprocess

  File "kernels\phatk_dia\__init__.py", line 377, in updateIterations

exceptions.UnboundLocalError: local variable 'EXP' referenced before assignment

attempting to use it...

Diapolo

hero member

Activity: 772

Merit: 500

deepceleron

legendary

Activity: 1512

Merit: 1036

Quote from: Fiyasko on January 13, 2012, 03:19:00 PM

Why is it so neccesarry for phatk kernal variations to have the memclock at 1k.... Some people cant deal with that extra heat...

This is something that has changed in SDK 2.6; The best performance at the best settings after trying all options comes at a GPU RAM speed of 1000MHz (stock speed for most cards) instead of at an underclock of 300MHz-370MHz. Version 2.6, included with driver 11.12 and 12.1, is significantly different in how it responds to worksizes, vector settings, and OpenCL programming than the previous SDKs.

It is a benefit in that one doesn't need oddly tweak memory speeds from stock to get the best performance (annoying to tell noobs over and over to underclock RAM), but bad in that this old quirk was actually an electricity saver if you did it.

Fiyasko

legendary

Activity: 1428

Merit: 1001

Okey Dokey Lokey

deepceleron

legendary

Activity: 1512

Merit: 1036

Benchmarks on a 5770 (VLIW5, 800 stream processors, 980MHz core [scales more like 5870 than 5830]), Catalyst 12.1a/SDK 2.6, Phoenix 1.7.3 exe, win7 x32:

Typical command line (single cpu affinity, realtime priority):

start /AFFINITY 08 /REALTIME phoenix.exe -v -u http://xx/ -k dia VECTORS4 AGGRESSION=12 FASTLOOP=False WORKSIZE=64

		worksize:	64	128	256
phatk2	VECTORS4	1000MHz	223.88	226.34	181.40
phatk2	VECTORS	1000MHz	197	205	195
dia_new	VECTORS4	1000MHz	223.28	225.48	195.75
dia_new	VECTORS2	1000MHz	215.71	220.37	212.23
dia_last	VECTORS4	1000MHz	207.27	200.41

less MH/s than phatk2, peak performance at 1000MHz RAM...

Fiyasko

legendary

Activity: 1428

Merit: 1001

Okey Dokey Lokey

Quote from: Diapolo on January 13, 2012, 01:20:07 PM

Ok, I'll let you first play around a bit, before asking for a performance comparison Cheesy

.

I asked, what's happening, if only one card is mining in terms of GPU2 usage "bug", does it go up to 99% then?<---... I Said yeah, it works flawlessly when running alone
Are the cards connected via Crossfirebridge?<---I said yes, What OS and driver are you on?<-Win7x64 sdk2.6 cat 12.1

Edit: By the way, did you try to lower mem clock even more via MSI Afterburner and unofficial overclocking mode?

Dia

I never saw a good reason to drop my mem below 600, But i cant do it Easily... I'll go do 1000core 315mem and post results Aswell as 1000core 1000mem.
Using GUIminer+PhatkD, MSIa, sdk 2.6, cat 12.1, crossfired 6870's

1kcore 300mem=255mh/s 70°C Fans@70%
1kcore 1kmem=314.8mh/s 88°C Fans@100%
Using GUIminer+pcolbm, MSIa, sdk 2.6, cat 12.1, crossfired 6870's
1kcore 500mem=307mh/s 77°C Fans@ 80%
1kcore 1kmem=307mh/s OverheatShutdown.

Topic: further improved phatk_dia kernel for Phoenix + SDK 2.6 - 2012-01-13 - page 2. (Read 107022 times)