Pages:
Author

Topic: Modified Kernel for Phoenix 1.5 - page 3. (Read 96713 times)

hero member
Activity: 560
Merit: 517
August 14, 2011, 05:11:42 AM
I've compiled a Win32 EXE for my poclbm fork (which has phatk, phatk2, phatk2.1, and phatk2.2 support):

http://www.bitcoin-mining.com/poclbm-progranism-win32-20110814a.zip
md5sum - df623a45f8cb0a50fcded92728f12c14

Let me know if it works, I was only able to test it on one machine so far.

Quote
Well I've been talking to a few people about this but got no real response from anyone, that it was possible ...
The optimization you've spelled out is more or less already implemented in most, if not all GPU miners.

The way GPU miners currently work is that they check in the GPU code whether h7==0. If it does, the result (a nonce) is returned, otherwise nothing is returned. It is the responsibility of the CPU software to do any further difficulty checks if needed.

Since the only thing the GPU miners care about is H7, they completely skip the last 3 rounds (stopping after the 61st round).

Also note, that GPU miners don't calculate the first 3 rounds of the first pass. Those rounds are pre-computed, because the inputs to those rounds remains constant for a given unit of getwork. So a GPU miner really only computes a grand total of 122 rounds, minus various other small pre-calculations here and there.
legendary
Activity: 4592
Merit: 1851
Linux since 1997 RedHat 4
August 13, 2011, 10:03:42 PM
Well I've been talking to a few people about this but got no real response from anyone, that it was possible ...
(Woke up with this idea back on the 4th of August ...)

So I guess I need to post in a thread where someone works on a CL kernel and just let them implement it if they don't already do it Tongue

I've written it in pseudo-code coz I still don't follow how the CL file actually does 2^n checks and returns the full list of valid results.
Yeah I've programmed in almost every language known to man (except C# and that's avoided by choice) but I still don't quite get the interface from C/C++ to the CL and how that matches what happens

What I am discussing, is the 2nd call to SHA256 with the output of the first call (not the first call)

Anyway, to explain, here's the end of the SHA256 pseudo code from the wikipedia:
==================
  for i from 0 to 63
    s0 := (a rightrotate 2) xor (a rightrotate 13) xor (a rightrotate 22)
    maj := (a and b) xor (a and c) xor (b and c)
    t2 := s0 + maj
    s1 := (e rightrotate 6) xor (e rightrotate 11) xor (e rightrotate 25)
    ch := (e and f) xor ((not e) and g)
    t1 := h + s1 + ch + k[ i] + w[ i]

    h := g
    g := f
    f := e
    e := d + t1
    d := c
    c := b
    b := a
    a := t1 + t2

  Add this chunk's hash to result:
  h0 := h0 + a
  h1 := h1 + b
  h2 := h2 + c
  h3 := h3 + d
  h4 := h4 + e
  h5 := h5 + f
  h6 := h6 + g
  h7 := h7 + h

Then test if h0..h7 is a share (CHECK0, CHECK1, ?)
==================

Firstly, I added that last line of course.
I understand that with current difficulty, if h0 != 0 then we don't have a share (call this CHECK0)
If h0=0 then check some leading part of h1 based on the current difficulty (call this CHECK1)
... feel free to correct this anyone who knows better Smiley

If a difficulty actually gets to checking h2 then my optimisation can be made even better by going back one more step (adding an i := 61) in the pseudo code shown below

A reasonably simple optimisation of the end code for when we are about to check if h0..h7 is a share (i.e. only the 2nd hash)

==================
 for i from 0 to 61
    s0 := (a rightrotate 2) xor (a rightrotate 13) xor (a rightrotate 22)
    maj := (a and b) xor (a and c) xor (b and c)
    t2 := s0 + maj
    s1 := (e rightrotate 6) xor (e rightrotate 11) xor (e rightrotate 25)
    ch := (e and f) xor ((not e) and g)
    t1 := h + s1 + ch + k[ i] + w[ i]

    h := g
    g := f
    f := e
    e := d + t1
    d := c
    c := b
    b := a
    a := t1 + t2

 i := 62
    s0 := (a rightrotate 2) xor (a rightrotate 13) xor (a rightrotate 22)
    maj := (a and b) xor (a and c) xor (b and c)
    t2 := s0 + maj
    s1 := (e rightrotate 6) xor (e rightrotate 11) xor (e rightrotate 25)
    ch := (e and f) xor ((not e) and g)
    t1 := h + s1 + ch + k[ i] + w[ i]

 tmpa := t1 + t2
 tmpb := h1 + tmpa (this is the actual value of h1 at the end)
 if CHECK1 on tmpb then abort - not a share
  (i.e. return false for a share)

    h := g
    g := f
    f := e
    e := d + t1
    d := c
    c := b
    b := a
    a := tmpa

 i := 63
    s0 := (a rightrotate 2) xor (a rightrotate 13) xor (a rightrotate 22)
    maj := (a and b) xor (a and c) xor (b and c)
    t2 := s0 + maj
    s1 := (e rightrotate 6) xor (e rightrotate 11) xor (e rightrotate 25)
    ch := (e and f) xor ((not e) and g)
    t1 := h + s1 + ch + k[ i] + w[ i]

 tmpa := h0 + t1 + t2 (this is the actual value of h0 at the end)
 if CHECK0 on tmpa then abort - not a share
  (i.e. return false for a share)

    h := g
    g := f
    f := e
    e := d + t1
    d := c
    c := b

 Add this chunk's hash to result:
 h0 := tmpa
 h1 := tmpb
 h2 := h2 + c
 h3 := h3 + d
 h4 := h4 + e
 h5 := h5 + f
 h6 := h6 + g
 h7 := h7 + h

Its a share - unless we need to test h2?
==================

Firstly the obvious (as I've said twice above):
This should only be done when calculating a hash to be tested as a share.
Since the actual process is a double-hash, the first hash should not, of course, do this.

In i=62:
If the tmpb test (CHECK1) says it isn't a share it avoids an entire loop (i=63), the 'e' calculation at i=62 and any unneeded assignments after that
and also we don't care about the actual values of h0-h7 so there is no need to assign them anything (or do the additions) except whatever is needed to affirm the result is not a share (e.g. set h0=-1 if h0..h7 must be examined later - or just return false if that is good enough - I don't know which the code actually needs)

CHECK1's probability of failure is high so it easily cover the issue of an extra calculation (h1 + tmpa) to do it.

In i=63:
If the tmpa test (CHECK0) says it isn't a share it avoids the 'e' calculation at i=63 and any unneeded assigments after that
and also we don't care about the actual values of h0-h7 so there is no need to assign them anything (or do the additions) except whatever is needed to affirm the result is not a share (e.g. set h0=-1 if h0..h7 must be examined later - or just return false if that is good enough - I don't know which the code actually needs)


P.S. any and all mistakes I've made - oh well but the concept is there anyway


Any mistakes? Comments?
full member
Activity: 219
Merit: 120
August 13, 2011, 01:38:02 AM

I will release a version that will work with cgminer early next week (looks like he has already implemented diapolo's old version).


Looking forward to this !!

Just sent one coin your way, and there's another once the work is done.

Quote
We are hitting a ceiling with opencl in general (and perhaps with the current hardware).  In one of the mining threads, vector76 and I were discussing the theoretical limit on hashing speeds... and unless there is a way to make the Maj() operation take 1 instruction, we are within about a percent of the theoretical limit on minimum number of instructions in the kernel unless we are missing something.

Out of curiosity, have you looked into trying to code a version
directly in AMD's assembly language and bypassing OpenCL entirely ?
(I'm thinking: since we're already patching the ELF output, this seems
like the logical next step Smiley)

Also, have you looked at AMD CAL ? I know this is what ufasoft's miner
uses (https://bitcointalk.org/index.php?topic=3486.500), and also what
zorinaq considers the most efficient way to access AMD hardware (somwhere
on http://blog.zorinaq.com)



Replacing one instruction in the ELF with another that uses the exact same inputs/outputs is one thing, but manually editing the ASM code is another thing entirely. Besides, with the work that has been done the GPU is already at >99% of the theoretical maximum throughput. (ALU packing) And as said above, we are also close to the theoretical minimum number of instructions to correctly run SHA256.

Also, if you look near the end of the hdminer thread you will notice that users are able to get the same hashrates from phatk on 69xx. For 58xx and other VLIW5 cards phatk is significantly faster than hdminer. If that's the best he can do with CAL then I don't see any reason to use it. hdminer had a substantial performance advantage back in March/April, but with basically every miner supporting BFI_INT this is no longer the case.
full member
Activity: 154
Merit: 100
August 12, 2011, 03:19:20 PM
Hi, I used phatk 2.2 on my 5 rigs and I had restarting/BSOD errors occuring on all machines (5850 multi/single, 6850) on several occasions already.

Yes, there was an increase in hashrate however, it seemed to have a memory leak or something. Just thought I'd inform you on this. Anyways, great work still. Looking forward to further improvements on the project. But for now, I'll revert to my previous settings.
newbie
Activity: 52
Merit: 0
August 12, 2011, 03:01:13 PM
I took a look at the comparison between version 2.2 and version 2.1
could it because __constant uint ConstW[128] change that broke VECTORS4?

That change is inconsequential (I was trying some things that required the change but did not keep them).. the compiler doesn't use those values, so they code should be exactly the same doing it either way (you can try and replace the code with the old code if you want to check).

You keep saying that it is broken.. if it does not run, post the errors.

I have found that on my card, VECTORS4 is much slower in version 2.2 than 2.1, but this is not a bug... it seems to be because openCL does not like allocating that many registers... Version 2.1 uses around 99.7% of instruction slots with VECTORS4 and I have tried many many ways to make it faster and more reliable (in 2.1), but I have given up on it.  It is still in the release because I don't see any point in taking it out...  but getting 2.2 to run as fast as 2.1 with VECTORS4 is not going to happen.  Also, the differences between 2.1 and 2.2 with VECTORS are very tiny anyway (less than .5%)...

Getting into more detail about it: If you look at the graph on the main page of the thread, you can see the graph of VECTORS4 in version 2.1... in version 2.2 for some reason, the spike (and corresponding valley) is located higher (somewhere around 500), this could mean that it would be just as fast if you had 1500 Mhz memory, but I have no idea why openCL reacts this way to changing the memory speed.  There are way to many GPU architecture/GPU bios/PCIe bus/CPU-GPU transfer/driver/openCL implementation unknowns to try to predict this behavior.


-Phateus
member
Activity: 77
Merit: 10
August 12, 2011, 02:38:31 PM
I took a look at the comparison between version 2.2 and version 2.1
could it because __constant uint ConstW[128] change that broke VECTORS4?
full member
Activity: 140
Merit: 100
August 12, 2011, 01:50:47 PM
Thanks for the answer. Can you indicate the version of Diapolo's kernel you are refering to?

https://bitcointalksearch.org/topic/m.428882
newbie
Activity: 52
Merit: 0
August 12, 2011, 01:30:17 PM
Sent another donation your way.  Look forward to your work on cgminer.

Thanks Cheesy
full member
Activity: 182
Merit: 100
August 12, 2011, 01:04:56 PM
Sent another donation your way.  Look forward to your work on cgminer.
newbie
Activity: 52
Merit: 0
August 12, 2011, 12:53:22 PM


What I am currently working on is a modified version of phoenix which runs multiple kernels with a single instance and a single work queue (to decrease excessive getwork).
I am also working on plugin support for it, so you can use various added features (such as built-in gui, Web interface, logger, autotune, variable aggression for when computer is idle, overclocking support, etc...)
This would make it tremendously easier for anyone to add features and you can still use whichever kernel works best for you.

As for cgminer support, I haven't tried it, are there any benefits over phoenix?  I may fork that instead of phoenix and make the plugin support via command-line, lua or javascript, although I find that python is much easier to code than c (especially for cross platform support).

Would definitely be interested in a cgminer fork.  Don't get me wrong, phoenix is great and has always given me the best performance overall but it does lack some of the more refined features, which the other poster listed above.  Failover and nice static but updated command line "UI".  Seems like you and diapolo are hitting the ceiling with phoenix anyway.

I will release a version that will work with cgminer early next week (looks like he has already implemented diapolo's old version).

We are hitting a ceiling with opencl in general (and perhaps with the current hardware).  In one of the mining threads, vector76 and I were discussing the theoretical limit on hashing speeds... and unless there is a way to make the Maj() operation take 1 instruction, we are within about a percent of the theoretical limit on minimum number of instructions in the kernel unless we are missing something.

Now that doesn't mean that there is NO room for improvement, just that any other improvement will probably have to be faster hardware, a more efficient implementation of openCL by AMD or figuring out a better way to finagle the current openCL implementation to reduce the implementation overhead.  But, unless there is a problem with pyopenCL, c and python should give equivalent speeds as long as they are just calling the openCL interface (the actual miner uses negligible resources).  I suppose it could be possible to access the hardware drivers directly and run the kernel that way... but I don't see that as being feasible.

But, with all of that said, I have looked through some of his code, and it some really clean code.  Part of the reason I want to add these features is to learn more python (this is the first thing I have programmed in python), but it probably will just be easier modifying the cgminer code.  Thanks for pointing out cgminer to me Smiley
legendary
Activity: 1148
Merit: 1001
Radix-The Decentralized Finance Protocol
August 12, 2011, 08:35:19 AM
In theory, fewer ALU ops translates to less energy consumption. In practice, each ALU op uses a slightly different amount of power and a kernel which 10x instruction A may burn more power than 12x instruction B. Unfortunately, instruction power numbers aren't documented anywhere so it is almost impossible to optimize in a theoretical sense, and could vary from GPU to GPU (due to minor manufacturing defects.)

One of Diapolo's recent kernels lowered operating temperature by ~3C without changing hashrate significantly. Presumably that particular kernel is ~10% more power efficient than others.

Thanks for the answer. Can you indicate the version of Diapolo's kernel you are refering to?
full member
Activity: 140
Merit: 100
August 12, 2011, 08:28:56 AM
There is a thing I dont understand about the results of these modifications. They increase the hash rate but they also increase consumption, and I always though that since they are making the kernel more efficient (same task with less instructions, less work for the gpu per hash) they should increase the hash rate without chaning consumption too much. Does anyone know why the more efficient kernel is not also more energy efficient?

Also, if one of you guys is out of ideas to make the cards runs faster it could be interesting to target energy efficiency instead of speed. A lot of us are not interested in running our cards at the maximum MHash/s rate but are more interested on having a better MHash/J rate.


In theory, fewer ALU ops translates to less energy consumption. In practice, each ALU op uses a slightly different amount of power and a kernel which 10x instruction A may burn more power than 12x instruction B. Unfortunately, instruction power numbers aren't documented anywhere so it is almost impossible to optimize in a theoretical sense, and could vary from GPU to GPU (due to minor manufacturing defects.)

One of Diapolo's recent kernels lowered operating temperature by ~3C without changing hashrate significantly. Presumably that particular kernel is ~10% more power efficient than others.
member
Activity: 224
Merit: 10
August 12, 2011, 08:23:02 AM
It is more efficient - the more output per unit time you have, the more efficient it is since the card will be wasting less power sitting idle.

If you want to increase efficiency, that is a hardware thing - namely undervolt your card.
legendary
Activity: 1148
Merit: 1001
Radix-The Decentralized Finance Protocol
August 12, 2011, 01:54:36 AM
There is a thing I dont understand about the results of these modifications. They increase the hash rate but they also increase consumption, and I always though that since they are making the kernel more efficient (same task with less instructions, less work for the gpu per hash) they should increase the hash rate without chaning consumption too much. Does anyone know why the more efficient kernel is not also more energy efficient?

Also, if one of you guys is out of ideas to make the cards runs faster it could be interesting to target energy efficiency instead of speed. A lot of us are not interested in running our cards at the maximum MHash/s rate but are more interested on having a better MHash/J rate.

full member
Activity: 182
Merit: 100
August 11, 2011, 10:46:11 PM


What I am currently working on is a modified version of phoenix which runs multiple kernels with a single instance and a single work queue (to decrease excessive getwork).
I am also working on plugin support for it, so you can use various added features (such as built-in gui, Web interface, logger, autotune, variable aggression for when computer is idle, overclocking support, etc...)
This would make it tremendously easier for anyone to add features and you can still use whichever kernel works best for you.

As for cgminer support, I haven't tried it, are there any benefits over phoenix?  I may fork that instead of phoenix and make the plugin support via command-line, lua or javascript, although I find that python is much easier to code than c (especially for cross platform support).

Would definitely be interested in a cgminer fork.  Don't get me wrong, phoenix is great and has always given me the best performance overall but it does lack some of the more refined features, which the other poster listed above.  Failover and nice static but updated command line "UI".  Seems like you and diapolo are hitting the ceiling with phoenix anyway.
hero member
Activity: 812
Merit: 502
August 11, 2011, 10:24:26 PM
Using the latest 2.2 version got quite a noticeable increase:

Before:
4x 440Mh/s = 1760Mh/s

After:
4x 446Mh/s = 1784Mh/s

My best settings are:
Worksize = 256
Aggresion = 12
VECTORS
legendary
Activity: 1512
Merit: 1036
August 11, 2011, 10:12:32 PM
Big Edit:

I looked again at the AMD APP SDK v2.5, trying to get it to not suck. I did one more thing, not only did I install the 2.5 SDK (on Catalyst 11.6), but I also re-compiled pyopencl 0.92 against the newer SDK. On phatk 2.2, changing just from 2.4 SDK to 2.5 SDK with a matching pyOpenCL gets a hair more mhash:
SDK 2.4: 309.97
SDK 2.5: 310.10

Just to let people know, regarding the APP SDK, the version installed as well as the version used to compile pyopencl both seem to matter (not that this helps you if you are using just the prepackaged Windows phoenix.exe.)

Using a pyOpenCL newer than 0.92 gives a deprecation warning:

[0 Khash/sec] [0 Accepted] [0 Rejected] [RPC]kernels\phatk\__init__.py:414: Depr
ecationWarning: 'enqueue_read_buffer' has been deprecated in version 2011.1. Ple
ase use enqueue_copy() instead.
  self.commandQueue, self.output_buf, self.output)
[11/08/2011 21:10:22] Server gave new work; passing to WorkQueue
[291.32 Mhash/sec] [0 Accepted] [0 Rejected] [RPC (+LP)]kernels\phatk\__init__.p
y:427: DeprecationWarning: 'enqueue_write_buffer' has been deprecated in version
 2011.1. Please use enqueue_copy() instead.
  self.commandQueue, self.output_buf, self.output)


Using pyOpenCL 2011.1.2 with the kernel in its current form gets me less mhash though:
SDK 2.4: 307.98
SDK 2.5: 307.84

(5830@955/350; Catalyst 11.6; Win7; py 2.6.6)
full member
Activity: 219
Merit: 120
August 11, 2011, 03:44:55 PM

What I am currently working on is a modified version of phoenix which runs multiple kernels with a single instance and a single work queue (to decrease excessive getwork).
I am also working on plugin support for it, so you can use various added features (such as built-in gui, Web interface, logger, autotune, variable aggression for when computer is idle, overclocking support, etc...)
This would make it tremendously easier for anyone to add features and you can still use whichever kernel works best for you.

As for cgminer support, I haven't tried it, are there any benefits over phoenix?  I may fork that instead of phoenix and make the plugin support via command-line, lua or javascript, although I find that python is much easier to code than c (especially for cross platform support).

In most cases you won't see much if any decrease in the number of getwork requests by running multiple kernels behind the same work queue. The reason for having a work queue in the first place is so that the miner only needs to ask for more work when the queue falls below a certain size. During normal operation Phoenix won't request more work than absolutely necessary. There might be a small benefit to doing this when the block changes, but aside from that the getwork count for a single instance running 2 kernels compared to 2 instances will be very close.

That said, I am interested to see the results of the other changes you mentioned. Feel free to PM me if you have any questions.
newbie
Activity: 52
Merit: 0
August 11, 2011, 11:50:32 AM
As of version 2.1, phatk now has command line option "VECTORS4" which can be used instead of "VECTORS".
This option works on 4 nonces per thread instead of 2 and may increase speed mainly if you do not underclock your memory, but feel free to try it out.  Note that if you use this, you will more than likely have to decrease your WORKSIZE to 128 or 64.

I'm using a 6770 @ 1.01Ghz with phatk 2.2.  When I run the memory clock at 300Mhz with the VECTORS option, I get 234.5Mhps.  However, I can't seem to reap the benefits of VECTORS2 or VECTORS4 at a higher memory clock (i.e. 1.2Ghz).  I've reduced the WORKSIZE from 256 to 128 and 64 and can only seem to peek at 213Mhps.  With these options, I can only achieve between 204 and 213 Mhps.

I have found that VECTORS4 is extremely unreliable... even tiny changes in the kernel and other factors affect the hashrate tremendously...  OpenCL gets really weird when you use a lot of registers.  I added it in 2.1 because it was comparable to VECTORS in some situations, but changing the kernel slightly in 2.2 seems to have broken it (even though kernel analyer says it uses less registers and less ALU ops... *sigh*)

Anyone wondering about any new kernel improvements, I seem to be at a standstill... I have tried the following:
  • Removing all control flow operations (about 1MH/s slower)
  • Sending all kernel arguments in a buffer (about 1MH/s slower)
  • Using an atomic counter for the output so that the output buffer is written sequentially (about the same speed and only works on ATI xxx cards and newer)
  • Using an internal loop in the kernel to process multiple nonces (Either significantly slower or massive desktop lag)
  • Calling set_arg only once per getwork instead of once per kernel call (only faster when using very low aggression and FASTLOOP, I will add this to my next kernel release)

-Phateus
newbie
Activity: 52
Merit: 0
August 11, 2011, 11:33:14 AM
Just did a test:

Rig setup:
  Linuxcoin v0.2b (Linux version 2.6.38-2-amd64)
  Dual HD5970 (4 GPU cores in the rig)
  Mem clock @ 300Mhz
  Core clock @ 800Mhz
  VCore @ 1.125v
  AMD SDK 2.5
  Phoenix r100
  Phatk v2.2
  -v -k phatk BFI_INT VECTORS WORKSIZE=256 AGGRESSION=11 FASTLOOP=false

Result:
  Overall Rig rate: 1484 MH/s
  Rate per core: 371 MH/s

This is ~4MH/s faster than Diapolo's latest.

On 5970, phatk 2.2 is current king of the hill.

For the world to be perfect, this kernel needs to be integrated into cgminer Smiley



The last kernel releases show, that it is a bit of trial and error to find THE perfect kernel for a specific setup. Phaetus and I try to use the KernelAnalyzer and our Setups as a first measurement, if a new Kernel got "faster". But there are many different factors that come into play like OS, driver, SDK, miner-software and so on.

I would suggest that we should try to create a kernel which is based on the same kernel-parameters for phatk and phatk-Diapolo so that the users are free to chose which kernel is used. One thing is CGMINER kernel uses the switch VECTORS2, where Phoenix used only VECTORS (which I changed to VECTORS2 in my last kernel releases). It doesn't even matter to use the same variable names in the kernel (in fact they are different sometimes) as long as the main miner software passes the awaited values in a defined sequence to the kernel.

Dia

A good idea.

A further improvement: I'd like to have an option in my miner that spends ~2mn
benchmarking all the kernels available in the current directory (without talking to
a pool, i.e. doing pure SHA256 on bogus nonces), and picking the fastest for the
current rig.

For people with lots of different rigs/setups, that would save them the headache
of having to hand-tune each instance.


What I am currently working on is a modified version of phoenix which runs multiple kernels with a single instance and a single work queue (to decrease excessive getwork).
I am also working on plugin support for it, so you can use various added features (such as built-in gui, Web interface, logger, autotune, variable aggression for when computer is idle, overclocking support, etc...)
This would make it tremendously easier for anyone to add features and you can still use whichever kernel works best for you.

As for cgminer support, I haven't tried it, are there any benefits over phoenix?  I may fork that instead of phoenix and make the plugin support via command-line, lua or javascript, although I find that python is much easier to code than c (especially for cross platform support).
Pages:
Jump to: