Pages:
Author

Topic: Modified Kernel for Phoenix 1.5 - page 2. (Read 96725 times)

hero member
Activity: 658
Merit: 500
August 18, 2011, 03:23:43 AM
I'm getting hardware errors on phatk 2.2, didn't get them on diapolo's or 2.1

the three are about undistinguishable in terms of speed for me
newbie
Activity: 51
Merit: 0
August 18, 2011, 01:28:30 AM
excellent work on the cgminer, I'm seeing the ~same performance as phoenix and am enjoying the fancy cg features. Donation on it's way.
-ck
legendary
Activity: 4088
Merit: 1631
Ruu \o/
August 17, 2011, 01:44:39 AM
If you want to restore your tree without losing your changes, create a new branch and reset the master to the last one before your commits.

git checkout master
git branch newphatk
git reset --hard 58eb4d58599521933a3fef599e1dcba4f996dadc
git pull

that will pull my changes into the master branch and your personal changes will be in newphatk. Unfortunately your github account has a messed up master now so

git push -f

will force the changes to propagate. Do not use this command normally as it makes it impossible for people pulling from your branch to keep in sync.
newbie
Activity: 52
Merit: 0
August 17, 2011, 12:51:08 AM
Seems to me like you've got it all under control, so I'll leave you to finish up. Thanks for your involvement. However I don't want multiple phatk kernels so just replace the current one in-situ and don't bother enumming a different kernel. As for the output code, I prefer to use 4k so feel free to do it your way, but be aware I plan to change it back.

Ok, the source is up... I am trying to figure out how to compile this for windows without the cygwin layer (I really haven't done any of this before... I am soooo lost)...

https://github.com/Phateus/cgminer

ckolivas... if you want to merge this into your code at some point, let me know what I have to do... I literally installed git yesterday, and there is only so much you can learn on the internet in a day ;-)

As for the buffer, my kernel only uses WORKSIZE+1 parts of your buffer, but I left the buffer size intact.
Very good work. Nice of you to figure out how to do git and all as well. Don't worry about the merge, I've taken care of everything and cherry picked your changes as I needed to. I've modified a few things too to be consistent with cgminer's code and there is definitely a significant speed advantage thanks to your changes. Note that if you're ever working on git doing your own changes, do them to a branch that's not called master as you may end up making it impossible to pull back my changes since I won't necessarily take all your code. Thanks again, and I'm sure the cgminer users will be most grateful. Smiley


Ah, that's how that works... good to know.  This whole git seems really useful for working together.  Thanks

-Phateus
-ck
legendary
Activity: 4088
Merit: 1631
Ruu \o/
August 17, 2011, 12:14:20 AM
Seems to me like you've got it all under control, so I'll leave you to finish up. Thanks for your involvement. However I don't want multiple phatk kernels so just replace the current one in-situ and don't bother enumming a different kernel. As for the output code, I prefer to use 4k so feel free to do it your way, but be aware I plan to change it back.

Ok, the source is up... I am trying to figure out how to compile this for windows without the cygwin layer (I really haven't done any of this before... I am soooo lost)...

https://github.com/Phateus/cgminer

ckolivas... if you want to merge this into your code at some point, let me know what I have to do... I literally installed git yesterday, and there is only so much you can learn on the internet in a day ;-)

As for the buffer, my kernel only uses WORKSIZE+1 parts of your buffer, but I left the buffer size intact.
Very good work. Nice of you to figure out how to do git and all as well. Don't worry about the merge, I've taken care of everything and cherry picked your changes as I needed to. I've modified a few things too to be consistent with cgminer's code and there is definitely a significant speed advantage thanks to your changes. Note that if you're ever working on git doing your own changes, do them to a branch that's not called master as you may end up making it impossible to pull back my changes since I won't necessarily take all your code. Thanks again, and I'm sure the cgminer users will be most grateful. Smiley
newbie
Activity: 52
Merit: 0
August 16, 2011, 11:13:55 PM
Seems to me like you've got it all under control, so I'll leave you to finish up. Thanks for your involvement. However I don't want multiple phatk kernels so just replace the current one in-situ and don't bother enumming a different kernel. As for the output code, I prefer to use 4k so feel free to do it your way, but be aware I plan to change it back.

Ok, the source is up... I am trying to figure out how to compile this for windows without the cygwin layer (I really haven't done any of this before... I am soooo lost)...

https://github.com/Phateus/cgminer

ckolivas... if you want to merge this into your code at some point, let me know what I have to do... I literally installed git yesterday, and there is only so much you can learn on the internet in a day ;-)

As for the buffer, my kernel only uses WORKSIZE+1 parts of your buffer, but I left the buffer size intact.
-ck
legendary
Activity: 4088
Merit: 1631
Ruu \o/
August 16, 2011, 05:18:23 PM
Seems to me like you've got it all under control, so I'll leave you to finish up. Thanks for your involvement. However I don't want multiple phatk kernels so just replace the current one in-situ and don't bother enumming a different kernel. As for the output code, I prefer to use 4k so feel free to do it your way, but be aware I plan to change it back.
newbie
Activity: 52
Merit: 0
August 16, 2011, 12:49:36 PM
Alright... I'm getting a little delayed on the prerelease for cgminer... mingw is a pain in the ass..


Yeah, mingw is most certainly a giant PITA.

To compile cgminer with mingw, the trick is to use msys and get pkg-config and libcurl installed properly

For pkg-config, the best is to install this: http://ftp.gnome.org/pub/gnome/binaries/win32/gtk+/2.22/gtk+-bundle_2.22.1-20101227_win32.zip

Once you have that, libcurl is rather easy.

Quote
trying a full cygwin install next...

Mmmh. Not sure this'll get you very far.

If your main dev box is windows and your goal is to integrate
phatk into cgminer, your best bet is probably to install a small
virtual machine (qemu or vmplayer) running ubuntu inside your
windows box and work on cgminer directly on Linux in there.

That's exactly what I do (the other way round) when I have to
try windows-specific things or a piece of code.


Yeah, I think I will stay away from using the mingw environment from now on... Cygwin was easy as pie.  No issues, I think can cross compile from cygwin using mingw if I want native Win32 support.  Apparently, getting pkg-conf (i think) working without POSIX support is terrible.  I got my kernel working around 5am last night linking against the cygwin dlls.. so tonight I will release the changes when I get home.

Alright... I'm getting a little delayed on the prerelease for cgminer... mingw is a pain in the ass.. trying a full cygwin install next...

Bear with me, hopefully I'll get it running tomorrow.

-Phateus
You could just tell me what to do to interface it with cgminer (i.e. what new variables you want) and I'd copy most of your kernel across. Only the return code and define macros are actually different in cgminer in the kernel itself.

Yeah, if you want, I can send you the changes tonight so you can put it in your release.  The only modifications I had to make to the kernel is changing VECTORS to VECTORS2 , hardcoding OUTPUT_SIZE = 4095 and hardcoding WORKSIZE=256 (I really do need this passed to the kernel though).  Also, my kernel only uses WORKSIZE+1 entries in the buffer, it would be better if you made the buffer that size.

As for the changes in the miner, I think I only had to change the precalc_hash() function, the kernel input and output file name, queue_phatk_kernel() function
what I will do tonight, is add KL_PHATK_2_2 to the cl_kernel enum and copy the function code and add the corresponding command line argument (right now I have just replaced PHATK with mine) and add -DWORKSIZE= arguments for the kernel.

Anyway, I will give you more details tonight when I am in front of my code.
My fork is https://github.com/Phateus/cgminer, I will upload the changes tonight (as soon as I figure out git... never used that before)

-Phateus

P.S. thanks for the easy to read code Smiley
member
Activity: 77
Merit: 10
August 16, 2011, 10:20:46 AM
It seems like your latest kernel and mine have problems if BFI_INT gets forced of via (BFI_INT=false) ... it seems the results are invalid every time.
Any idea Phateus?

Perhaps #define Ch(x, y, z) bitselect(x, y, z) is not right?

Edit and solved, non BFI_INT Ch has to be:
Code:
#define Ch(x, y, z) bitselect(z, y, x)

If you want to thank someone, you can donate to 1LY4hGSY6rRuL7BQ8cjUhP2JFHFrPp5JVe (Vince -> who did a GREAT job during my kernel development)!

Dia

Awesome, thank you!  I was under the assumption that BFI_INT and bitselect were the same operation, apparently, the operand order is different.  I will fix it in my next release.

Thank you everyone for your support (both in BTC and discussion).

I should have a drop-in version of the kernel available for cgminer soon, so anyone wanting to try out the pre-release, I'll be posting it tonight.

@BOARBEAR
*sigh*.... come on man... do you even read my posts? There is no single cause of the bad performance.  2.2 executes less instructions and uses less registers than 2.1, but as I said... there is some weird issue which makes openCL slower behind the scenes.  My best guess is that it has to do with register allocation. 

The GPU has a total of 256x32x4 registers (8192 UINT4).  At the most, there are 256 threads per workgroup (8192/256 = 32 registers per thread).  Using VECTORS, the number of registers is far below this number, therefore the hardware can operate on the maximum allowable threads at a time.  However, when you compile with VECTORS4, there is more than 32 registers per thread.  OpenCL must determine how to allocate the threads, and the utilization of the video card is sub-optimal)  Below is a diagram of what I think is going on.


4 thread groups running simultaneously VECTORS (2 running at a time)
[1111111122222222]
[3333333344444444]

using an optimal version of VECTORS4, it would look much like this (double the work is done per thread)
[1111111111111111]
[2222222222222222]
[3333333333333333]
[4444444444444444]

now making it use slightly less resources will make it slower because the threads are out of sync and there will be overhead in syncing and tracking data within threadgroups:
[1111111111111112]
[2222222222222233]
[3333333333333444]
[4444444444445555]

Now, I may be waaaaay off here, but something like this is what makes sense to me.  Especially, since this would explain why decreasing the memory actually improves performance in some cases (by forcing synchronization).

Anyway, enough of my off-topic analysis...


I will release a version that will work with cgminer early next week (looks like he has already implemented diapolo's old version).


Looking forward to this !!

Just sent one coin your way, and there's another once the work is done.

Quote
We are hitting a ceiling with opencl in general (and perhaps with the current hardware).  In one of the mining threads, vector76 and I were discussing the theoretical limit on hashing speeds... and unless there is a way to make the Maj() operation take 1 instruction, we are within about a percent of the theoretical limit on minimum number of instructions in the kernel unless we are missing something.

Out of curiosity, have you looked into trying to code a version
directly in AMD's assembly language and bypassing OpenCL entirely ?
(I'm thinking: since we're already patching the ELF output, this seems
like the logical next step Smiley)

Also, have you looked at AMD CAL ? I know this is what ufasoft's miner
uses (https://bitcointalk.org/index.php?topic=3486.500), and also what
zorinaq considers the most efficient way to access AMD hardware (somwhere
on http://blog.zorinaq.com)



Replacing one instruction in the ELF with another that uses the exact same inputs/outputs is one thing, but manually editing the ASM code is another thing entirely. Besides, with the work that has been done the GPU is already at >99% of the theoretical maximum throughput. (ALU packing) And as said above, we are also close to the theoretical minimum number of instructions to correctly run SHA256.

Also, if you look near the end of the hdminer thread you will notice that users are able to get the same hashrates from phatk on 69xx. For 58xx and other VLIW5 cards phatk is significantly faster than hdminer. If that's the best he can do with CAL then I don't see any reason to use it. hdminer had a substantial performance advantage back in March/April, but with basically every miner supporting BFI_INT this is no longer the case.

Agreed, the kernel itself is pretty optimal.  I might look into calling lower level CAL functions to manage the (OpenCL compiled) GPU threads (instead of using openCL), but I doubt this will give any speedup (although, I might be able to reduce the CPU overhead).
I understand what you are saying.  Perhaps version2.1 will be the last version that works well with VECTORS4.  You said the work that has been done on the GPU is already at >99% of the theoretical maximum throughput.  But VECTORS4 alone gives me about 1.5% boost.(contraindication?)  That is why I tried hard to find a way to make VECTORS4 work so that the future versions can use it.
-ck
legendary
Activity: 4088
Merit: 1631
Ruu \o/
August 16, 2011, 07:07:38 AM
Alright... I'm getting a little delayed on the prerelease for cgminer... mingw is a pain in the ass.. trying a full cygwin install next...

Bear with me, hopefully I'll get it running tomorrow.

-Phateus
You could just tell me what to do to interface it with cgminer (i.e. what new variables you want) and I'd copy most of your kernel across. Only the return code and define macros are actually different in cgminer in the kernel itself.
newbie
Activity: 52
Merit: 0
August 15, 2011, 09:35:42 PM
Alright... I'm getting a little delayed on the prerelease for cgminer... mingw is a pain in the ass.. trying a full cygwin install next...

Bear with me, hopefully I'll get it running tomorrow.

-Phateus
newbie
Activity: 52
Merit: 0
August 15, 2011, 12:54:59 PM
It seems like your latest kernel and mine have problems if BFI_INT gets forced of via (BFI_INT=false) ... it seems the results are invalid every time.
Any idea Phateus?

Perhaps #define Ch(x, y, z) bitselect(x, y, z) is not right?

Edit and solved, non BFI_INT Ch has to be:
Code:
#define Ch(x, y, z) bitselect(z, y, x)

If you want to thank someone, you can donate to 1LY4hGSY6rRuL7BQ8cjUhP2JFHFrPp5JVe (Vince -> who did a GREAT job during my kernel development)!

Dia

Awesome, thank you!  I was under the assumption that BFI_INT and bitselect were the same operation, apparently, the operand order is different.  I will fix it in my next release.

Thank you everyone for your support (both in BTC and discussion).

I should have a drop-in version of the kernel available for cgminer soon, so anyone wanting to try out the pre-release, I'll be posting it tonight.

@BOARBEAR
*sigh*.... come on man... do you even read my posts? There is no single cause of the bad performance.  2.2 executes less instructions and uses less registers than 2.1, but as I said... there is some weird issue which makes openCL slower behind the scenes.  My best guess is that it has to do with register allocation. 

The GPU has a total of 256x32x4 registers (8192 UINT4).  At the most, there are 256 threads per workgroup (8192/256 = 32 registers per thread).  Using VECTORS, the number of registers is far below this number, therefore the hardware can operate on the maximum allowable threads at a time.  However, when you compile with VECTORS4, there is more than 32 registers per thread.  OpenCL must determine how to allocate the threads, and the utilization of the video card is sub-optimal)  Below is a diagram of what I think is going on.


4 thread groups running simultaneously VECTORS (2 running at a time)
[1111111122222222]
[3333333344444444]

using an optimal version of VECTORS4, it would look much like this (double the work is done per thread)
[1111111111111111]
[2222222222222222]
[3333333333333333]
[4444444444444444]

now making it use slightly less resources will make it slower because the threads are out of sync and there will be overhead in syncing and tracking data within threadgroups:
[1111111111111112]
[2222222222222233]
[3333333333333444]
[4444444444445555]

Now, I may be waaaaay off here, but something like this is what makes sense to me.  Especially, since this would explain why decreasing the memory actually improves performance in some cases (by forcing synchronization).

Anyway, enough of my off-topic analysis...


I will release a version that will work with cgminer early next week (looks like he has already implemented diapolo's old version).


Looking forward to this !!

Just sent one coin your way, and there's another once the work is done.

Quote
We are hitting a ceiling with opencl in general (and perhaps with the current hardware).  In one of the mining threads, vector76 and I were discussing the theoretical limit on hashing speeds... and unless there is a way to make the Maj() operation take 1 instruction, we are within about a percent of the theoretical limit on minimum number of instructions in the kernel unless we are missing something.

Out of curiosity, have you looked into trying to code a version
directly in AMD's assembly language and bypassing OpenCL entirely ?
(I'm thinking: since we're already patching the ELF output, this seems
like the logical next step Smiley)

Also, have you looked at AMD CAL ? I know this is what ufasoft's miner
uses (https://bitcointalk.org/index.php?topic=3486.500), and also what
zorinaq considers the most efficient way to access AMD hardware (somwhere
on http://blog.zorinaq.com)



Replacing one instruction in the ELF with another that uses the exact same inputs/outputs is one thing, but manually editing the ASM code is another thing entirely. Besides, with the work that has been done the GPU is already at >99% of the theoretical maximum throughput. (ALU packing) And as said above, we are also close to the theoretical minimum number of instructions to correctly run SHA256.

Also, if you look near the end of the hdminer thread you will notice that users are able to get the same hashrates from phatk on 69xx. For 58xx and other VLIW5 cards phatk is significantly faster than hdminer. If that's the best he can do with CAL then I don't see any reason to use it. hdminer had a substantial performance advantage back in March/April, but with basically every miner supporting BFI_INT this is no longer the case.

Agreed, the kernel itself is pretty optimal.  I might look into calling lower level CAL functions to manage the (OpenCL compiled) GPU threads (instead of using openCL), but I doubt this will give any speedup (although, I might be able to reduce the CPU overhead).
legendary
Activity: 1904
Merit: 1007
August 15, 2011, 08:24:41 AM
Sent another donation your way.  Look forward to your work on cgminer.
+1
hero member
Activity: 772
Merit: 500
August 14, 2011, 02:47:31 PM
It seems like your latest kernel and mine have problems if BFI_INT gets forced of via (BFI_INT=false) ... it seems the results are invalid every time.
Any idea Phateus?

Perhaps #define Ch(x, y, z) bitselect(x, y, z) is not right?

Edit and solved, non BFI_INT Ch has to be:
Code:
#define Ch(x, y, z) bitselect(z, y, x)

If you want to thank someone, you can donate to 1LY4hGSY6rRuL7BQ8cjUhP2JFHFrPp5JVe (Vince -> who did a GREAT job during my kernel development)!

Dia
hero member
Activity: 560
Merit: 517
August 14, 2011, 01:30:05 PM
Quote
Still, why not do the share/H6 test in GPU - it would certainly be faster - shares are also rare compared to a job (about 1 in 2 billion)
Is that an issue with the CL not being able to be changed based on the difficulty?
There are several reasons.

99.99% of the time the mining software only needs to look for Difficulty 1 (a share, H7==0), so there is rarely the needed to check for anything else.
GPU's absolutely hate branching; a full Difficulty check involves many branches.
Smaller GPU programs are better GPU programs.
The CPU runs in parallel to the GPU. Since the CPU is fully capable of checking for extra Difficulty levels, why would you burden the GPU with such work?
The CPU should double-check the GPU's results anyway, to detect errors. Since the CPU will thus be recomputing the full two SHA-256 passes for each result returned by the GPU, it again makes sense to only check for higher difficulties on the CPU.
member
Activity: 77
Merit: 10
August 14, 2011, 12:58:50 PM
I tried to figure out the reason version 2.2 does not work well with VECTORS4
I could not find out why as I do not have enough knowledge.
Here are some results I found:

replacing this block of code in version 2.1 with the corresponding block in version 2.2 will make VECTORS4 much slower


#define P1(n) ((rot(W[(n)-2],15u)^rot(W[(n)-2],13u)^((W[(n)-2])>>10U)))
#define P2(n) ((rot(W[(n)-15],25u)^rot(W[(n)-15],14u)^((W[(n)-15])>>3U)))
#define P3(x)  W[x-7]
#define P4(x)  W[x-16]


//Partial Calcs for constant W values
#define P1C(n) ((rotate(ConstW[(n)-2],15)^rotate(ConstW[(n)-2],13)^((ConstW[(n)-2])>>10U)))
#define P2C(n) ((rotate(ConstW[(n)-15],25)^rotate(ConstW[(n)-15],14)^((ConstW[(n)-15])>>3U)))
#define P3C(x)  ConstW[x-7]
#define P4C(x)  ConstW[x-16]

//SHA round with built in W calc
#define sharoundW(n)  Vals[(3 + 128 - (n)) % 8] += t1W(n); Vals[(7 + 128 - (n)) % 8] = t1W(n) + t2(n);  

//SHA round without W calc
#define sharound(n) Vals[(3 + 128 - (n)) % 8] += t1(n); Vals[(7 + 128 - (n)) % 8] = t1(n) + t2(n);

//SHA round for constant W values
#define sharoundC(n) Barrier(n); Vals[(3 + 128 - (n)) % 8] += t1C(n); Vals[(7 + 128 - (n)) % 8] = t1C(n) + t2(n);

//The compiler is stupid... I put this in there only to stop the compiler from (de)optimizing the order
#define Barrier(n) t1 = t1C((n) % 64)

And this block is not the only thing that causes the problem.

I am guessing there is something to do with rotC function.(it is a guess only
full member
Activity: 140
Merit: 100
August 14, 2011, 11:46:33 AM
One of my cards (5850, 835 MHz. down clock to 810M still failed) seems not able to work well with phatk 2.1/2.2. it die after a while(<1h), having to restart miner(win32)/reset(linux).
diabolo's 2011.7.17 is ok @ 835MHz.
hero member
Activity: 772
Merit: 500
August 14, 2011, 11:05:07 AM
It seems like your latest kernel and mine have problems if BFI_INT gets forced of via (BFI_INT=false) ... it seems the results are invalid every time.
Any idea Phateus?

Perhaps #define Ch(x, y, z) bitselect(x, y, z) is not right?

Edit: Could be my setup if no one else has this error Cheesy.

Dia
legendary
Activity: 4592
Merit: 1851
Linux since 1997 RedHat 4
August 14, 2011, 06:47:00 AM
...
Quote
Well I've been talking to a few people about this but got no real response from anyone, that it was possible ...
The optimization you've spelled out is more or less already implemented in most, if not all GPU miners.

The way GPU miners currently work is that they check in the GPU code whether h7==0. If it does, the result (a nonce) is returned, otherwise nothing is returned. It is the responsibility of the CPU software to do any further difficulty checks if needed.

Since the only thing the GPU miners care about is H7, they completely skip the last 3 rounds (stopping after the 61st round).

Also note, that GPU miners don't calculate the first 3 rounds of the first pass. Those rounds are pre-computed, because the inputs to those rounds remains constant for a given unit of getwork. So a GPU miner really only computes a grand total of 122 rounds, minus various other small pre-calculations here and there.
OK, so I've got the H's back-to-front (H7 is the first one, not H0) then yeah that makes sense of doing fewer steps yet again than what I said.
Still, why not do the share/H6 test in GPU - it would certainly be faster - shares are also rare compared to a job (about 1 in 2 billion)
Is that an issue with the CL not being able to be changed based on the difficulty?
Yet it could be done as a simple pre-calculated number to AND against the H6 value (extra calculation) when H7 is zero.
(I should work out what's the difficulty value high enough to need to test H5 ... though that may be so large it would never be reached)

Edit: of course if a nonce (H7=0) is the requirement of a share - then there is no more testing (no testing of H6) required
I need to read pushpool more closely to determine exactly what a share is ... unless someone feels like answering that ... Smiley

Edit2: so skipping the first 3 rounds of the first pass is possible (128 - 3 = 125)
but there are actually 3.5 rounds not needed at the end of the 2nd pass - though I guess you already do that
Round 60 (2nd round) becomes only the calculations necessary to get t1 (s1 & ch) since unneeded are s0 and maj (and of course t2)
hero member
Activity: 504
Merit: 502
August 14, 2011, 05:57:41 AM
You may be one, but you are the champion of many Tongue

Its working great on my lazy spare windows machine, thanks Smiley
Pages:
Jump to: