Author

Topic: OFFICIAL CGMINER mining software thread for linux/win/osx/mips/arm/r-pi 4.11.0 - page 632. (Read 5805971 times)

-ck
legendary
Activity: 4088
Merit: 1631
Ruu \o/
min(x,y) http://www.khronos.org/registry/cl/sdk/1.2/docs/man/xhtml/commonMin.html
gets implemented low-level as
Code:
w: MIN_UINT    R0.w,  R0.x,  PV1350.y
, which *should*  (I know, AMD... Roll Eyes) be rather stable. The big problem with the alternative (&) is the huge number of false positives, since it's bitwise, like 01010011 & 10101100 = 00000000, which is bad for the branch predictor. I'm testing now with a conservative approach (just this one change from default),
Code:
#elif defined VECTORS2
bool result = min(W[117].x,W[117].y);
if (!result) {
if (!W[117].x)
output[FOUND] = output[NFLAG & W[3].x] = W[3].x;
if (!W[117].y)
output[FOUND] = output[NFLAG & W[3].y] = W[3].y;
}
and got a slight (3~4MH/s) increase (5850, SDK 2.5 from Cat 11.11).
You can do the maths on false positives. You're greatly exaggerating the "HUGE NUMBER". It's about 1 share for 1 false positive. More so on 4 vectors (but no one uses them). That is not remotely common...

Increase eh?

Call me sceptical to the core.

EDIT: I will look into it, but I'm so terrified of unintentionally breaking shit like I did last time. It was in this code specifically where the slowdown was, so you can imagine why I'm so resistant.
Vbs
hero member
Activity: 504
Merit: 500
Been testing some changes on phatk with the KernelAnalyzer and my own personal testing.

Using a VECTORS2 example,
Code:
bool result = W[117].x & W[117].y;

gives a lot of false positives, changing it to
Code:
bool result = min(W[117].x,W[117].y);

is guaranteed to give yummy results!  Grin

(same ALU #ops and fetch, no false positives on the next 'if')  Cool
See now this is dangerous. Do you REALLY  know how fast the "min" function is on all SDKs? Don't expect AMD to do the right thing and to guarantee it's as fast as &.

min(x,y) http://www.khronos.org/registry/cl/sdk/1.2/docs/man/xhtml/commonMin.html
gets implemented low-level as
Code:
w: MIN_UINT    R0.w,  R0.x,  PV1350.y
, which *should*  (I know, AMD... Roll Eyes) be rather stable. The big problem with the alternative (&) is the huge number of false positives, since it's bitwise, like 01010011 & 10101100 = 00000000, which is bad for the branch predictor. I'm testing now with a conservative approach (just this one change from default),
Code:
#elif defined VECTORS2
bool result = min(W[117].x,W[117].y);
if (!result) {
if (!W[117].x)
output[FOUND] = output[NFLAG & W[3].x] = W[3].x;
if (!W[117].y)
output[FOUND] = output[NFLAG & W[3].y] = W[3].y;
}
and got a slight (3~4MH/s) increase (5850, SDK 2.5 from Cat 11.11).
-ck
legendary
Activity: 4088
Merit: 1631
Ruu \o/
SDK 2.4:
GPU 1:  51.5C 1569RPM | 375.7/375.7Mh/s | A: 98 R:0 HW:0 U:  4.86/m I:10
GPU 2:  55.0C 1569RPM | 375.7/375.7Mh/s | A: 97 R:0 HW:0 U:  4.81/m I:10

SDK 2.1:
GPU 0:  82.5C 3840RPM | 375.6/375.4Mh/s | A:457 R:2 HW:0 U: 5.27/m I:10
GPU 1:  82.5C 3840RPM | 375.4/375.4Mh/s | A:477 R:0 HW:0 U: 5.50/m I:10
So it seems that 2.4 is very slightly better at 300 memclocks and I 10, 1 thread.

I would say the difference is below noise levels, so I would say they perform identically on that hardware/software combo.
member
Activity: 121
Merit: 10
Can you try 950 / 300 on SDK 2.1 and 2.4 and see what the difference is ( make sure to delete the bins etc. ) ?

Maybe also try 960 core / 300 memory ?

What OS btw ?

Thanks !

Sorry, can't. Cards are on different computer, and the 5870 on the sdk 2.4 machine is on an extender, which slows down the hashrate somewhat.

What I can do however, is give you the difference between a 5970 @ 810/300 on 2.4 and a 5970 at the same clocks on 2.1.

SDK 2.4:
GPU 1:  51.5C 1569RPM | 375.7/375.7Mh/s | A: 98 R:0 HW:0 U:  4.86/m I:10
GPU 2:  55.0C 1569RPM | 375.7/375.7Mh/s | A: 97 R:0 HW:0 U:  4.81/m I:10

SDK 2.1:
GPU 0:  82.5C 3840RPM | 375.6/375.4Mh/s | A:457 R:2 HW:0 U: 5.27/m I:10
GPU 1:  82.5C 3840RPM | 375.4/375.4Mh/s | A:477 R:0 HW:0 U: 5.50/m I:10

You can multiply that by 960/810 or 950/810 to get a good estimate of a 5870's performance at 960 and 950 clocks, respectively.

So it seems that 2.4 is very slightly better at 300 memclocks and I 10, 1 thread. Haven't had time to test other settings, it could vary. Oh and OS is 64-bit Lubuntu.
legendary
Activity: 4634
Merit: 1851
Linux since 1997 RedHat 4
I've got a nice idea for VECTORS2 and the nonce-check ^^ ... so the chance to get 2 positive nonces within a single uint2 work-item is extremely small, right?
Will play around with it tomorrow and perhaps I'll do another commit for diakgcn.

Dia
The chance of getting a positive nonce is ALWAYS the same for each hash you do, no matter when you do it.

If a single thread is idle it is wasted.

Edit: and aborting all threads when you find a nonce means you on average double the overhead of setting up work.
(i.e. time wasted when the GPU could be mining)
Vbs
hero member
Activity: 504
Merit: 500
Thanks for this mate. This means that the probability of finding 2 hashes in the same vector is 1/(4.3e9*4.3e9)), which is infinitesimally close to 1/inf ~= 0. This allows for a further optimization of the code. Using a VECTORS2 example,
Code:
#elif defined VECTORS2
bool result = min(W[117].x,W[117].y);
if (!result) {
if (!W[117].x)
output[FOUND] = output[NFLAG & W[3].x] = W[3].x;
else //if (!W[117].y)
output[FOUND] = output[NFLAG & W[3].y] = W[3].y;
}
Since min() takes care of the false positives, the 'else' branch is only true when W[117].y==0. The result in the KernelAnalyzer for a 5870 is:
Code:
phatk 120223 -> cycles: min:67.65, max:68.15, avg:67.82, alu:1363
phatk "new" -> cycles: min:67.65, max:67.90, avg:67.78, alu:1362

 Grin
This looks okay but it's in the output path so not hit very often so unlikely to make a demonstrable performance change :\

True, and the better the branching prediction works with "if (!result)" the lesser it will be taken. I'll check how min() gets implemented in low level.
hero member
Activity: 518
Merit: 500
Great job on the 2.3.1! I gained some 1% or even a bit more with my 5k series cards on both SDK 2.1 and SDK 2.4 systems, compared to 2.3.0 on phatk kernel.   Smiley

What kernel is this ? Still phatk one ?

I got 5870s. Can memory still be underclocked to 300 and you get still good performance ?

Thanks !

phatk as I mentioned  Wink

And on 2.1 & 2.4 SDK, yes they can. Not sure about 2.6, never used that with 5870s.

Currently hashing away at 444.4MH/s on a 950/300 5870, SDK 2.1 -g 1 -I 10 -w 256 -v 2, although -g 1 is probably not a good idea, just something I've stuck with . SDK 2.4 gives me the same, perhaps even very slightly faster hashrate Smiley


Can you try 950 / 300 on SDK 2.1 and 2.4 and see what the difference is ( make sure to delete the bins etc. ) ?

Maybe also try 960 core / 300 memory ?

What OS btw ?

Thanks !
member
Activity: 121
Merit: 10
Great job on the 2.3.1! I gained some 1% or even a bit more with my 5k series cards on both SDK 2.1 and SDK 2.4 systems, compared to 2.3.0 on phatk kernel.   Smiley

What kernel is this ? Still phatk one ?

I got 5870s. Can memory still be underclocked to 300 and you get still good performance ?

Thanks !

phatk as I mentioned  Wink

And on 2.1 & 2.4 SDK, yes they can. Not sure about 2.6, never used that with 5870s.

Currently hashing away at 444.4MH/s on a 950/300 5870, SDK 2.1 -g 1 -I 10 -w 256 -v 2, although -g 1 is probably not a good idea, just something I've stuck with . SDK 2.4 gives me the same, perhaps even very slightly faster hashrate Smiley
hero member
Activity: 518
Merit: 500
Great job on the 2.3.1! I gained some 1% or even a bit more with my 5k series cards on both SDK 2.1 and SDK 2.4 systems, compared to 2.3.0 on phatk kernel.   Smiley

What kernel is this ? Still phatk one ?

I got 5870s. Can memory still be underclocked to 300 and you get still good performance ?

Thanks !
hero member
Activity: 772
Merit: 500
I've got a nice idea for VECTORS2 and the nonce-check ^^ ... so the chance to get 2 positive nonces within a single uint2 work-item is extremely small, right?
Will play around with it tomorrow and perhaps I'll do another commit for diakgcn.

Dia
member
Activity: 121
Merit: 10
Great job on the 2.3.1! I gained some 1% or even a bit more with my 5k series cards on both SDK 2.1 and SDK 2.4 systems, compared to 2.3.0 on phatk kernel.   Smiley
-ck
legendary
Activity: 4088
Merit: 1631
Ruu \o/
Thanks for this mate. This means that the probability of finding 2 hashes in the same vector is 1/(4.3e9*4.3e9)), which is infinitesimally close to 1/inf ~= 0. This allows for a further optimization of the code. Using a VECTORS2 example,
Code:
#elif defined VECTORS2
bool result = min(W[117].x,W[117].y);
if (!result) {
if (!W[117].x)
output[FOUND] = output[NFLAG & W[3].x] = W[3].x;
else //if (!W[117].y)
output[FOUND] = output[NFLAG & W[3].y] = W[3].y;
}
Since min() takes care of the false positives, the 'else' branch is only true when W[117].y==0. The result in the KernelAnalyzer for a 5870 is:
Code:
phatk 120223 -> cycles: min:67.65, max:68.15, avg:67.82, alu:1363
phatk "new" -> cycles: min:67.65, max:67.90, avg:67.78, alu:1362

 Grin
This looks okay but it's in the output path so not hit very often so unlikely to make a demonstrable performance change :\
-ck
legendary
Activity: 4088
Merit: 1631
Ruu \o/
Been testing some changes on phatk with the KernelAnalyzer and my own personal testing.

Using a VECTORS2 example,
Code:
bool result = W[117].x & W[117].y;

gives a lot of false positives, changing it to
Code:
bool result = min(W[117].x,W[117].y);

is guaranteed to give yummy results!  Grin

(same ALU #ops and fetch, no false positives on the next 'if')  Cool
See now this is dangerous. Do you REALLY  know how fast the "min" function is on all SDKs? Don't expect AMD to do the right thing and to guarantee it's as fast as &.
Vbs
hero member
Activity: 504
Merit: 500
Ok, one unrolled branch then!  Wink

Code:
#elif defined VECTORS2
          if (!W[117].x) {
               output[FOUND] = FOUND;
       output[NFLAG & W[3].x] = W[3].x;
            if (!W[117].y)
                         output[NFLAG & W[3].y] = W[3].y;
          }
          else if (!W[117].y) {
               output[FOUND] = FOUND;
       output[NFLAG & W[3].y] = W[3].y;
          }
Heh, you're not a coder are you? That's still two branches unless it's positive on the first branch.

To save ckolivas from more frustration maybe I can help.

vbs:  AMD (and I assume Nvidia) GPU take a horrible hit on branches.  The number of total checks is irrelevant.  What matters is the number of branches on the main path.

Only one in 4.3 billion hashes will be a share thus 99.999999976716935634613037109375% of the time any subsequent share checks are never executed.  Optimizing the path which occurs one in 4.3 billion executions is silly right?

We want to make the one that occurs 4.29999999999 billlion out of 4.3 billion attempts as fast as possible.  Given the massive (and I do mean massive forget what you think you know about C++ compilers on x86 hardware) hit that AMD GPU take when it comes to branches that means making the main path have as few branches as possible. 

Neither of your code snippets do that.

Thanks for this mate. This means that the probability of finding 2 hashes in the same vector is 1/(4.3e9*4.3e9)), which is infinitesimally close to 1/inf ~= 0. This allows for a further optimization of the code. Using a VECTORS2 example,
Code:
#elif defined VECTORS2
bool result = min(W[117].x,W[117].y);
if (!result) {
if (!W[117].x)
output[FOUND] = output[NFLAG & W[3].x] = W[3].x;
else //if (!W[117].y)
output[FOUND] = output[NFLAG & W[3].y] = W[3].y;
}
Since min() takes care of the false positives, the 'else' branch is only true when W[117].y==0. The result in the KernelAnalyzer for a 5870 is:
Code:
phatk 120223 -> cycles: min:67.65, max:68.15, avg:67.82, alu:1363
phatk "new" -> cycles: min:67.65, max:67.90, avg:67.78, alu:1362

 Grin
donator
Activity: 1218
Merit: 1079
Gerald Davis
VDDC: 1.084 V, VDDC current: 144 A so the card is using about 150W, got to measure wall power consumption.

Just a heads up.
VDDC isn't all power consumed by card.

There is also VDDCI (which handles things like PCIe interface, memory controller, and ancillary ASICS) and then a separate memory VRM which isn't adjustable/reported.

VDDC of 150W simply means the "cores" are using 150W.
Vbs
hero member
Activity: 504
Merit: 500
Been testing some changes on phatk with the KernelAnalyzer and my own personal testing.

Using a VECTORS2 example,
Code:
bool result = W[117].x & W[117].y;

gives a lot of false positives, changing it to
Code:
bool result = min(W[117].x,W[117].y);

is guaranteed to give yummy results!  Grin

(same ALU #ops and fetch, no false positives on the next 'if')  Cool
newbie
Activity: 28
Merit: 0
Excellent update,  Running FASTer +2-3% AND cooler!  So happy!

Linux 11.04 SDK 2.4 cgminer 2.3.1

single rig open air no risers

GPU 0 5870   850/300 392 Mh/s I:9 82.5C
GPU 1 5770   850/300 193 Mh/s I:9 84.0C
GPU 2 6950   840/900 369.Mh/x I:11 73.0C

Efficiency 90%

Total avg 954.8 Mh/s  Was getting 925 with 2.2.7

Running for 2 hours now  Deepbit  w/Tripplemining fallback pools



-ck
legendary
Activity: 4088
Merit: 1631
Ruu \o/
I wonder what settings are people running their 7970s with.  Not for few hours, but days :-)
The weather's hot here at the moment, but..

GPU 0: 718.2 / 713.3 Mh/s | A:5180  R:16  HW:0  U:10.00/m  I:11
74.0 C  F: 79% (4532 RPM)  E: 1200 MHz  M: 1050 Mhz  V: 1.170V  A: 99% P: 5%
Last initialised: [2012-02-24 17:38:34]
Intensity: 11
Thread 0: 357.7 Mh/s Enabled ALIVE
Thread 1: 360.4 Mh/s Enabled ALIVE

Running flat out since the day I installed it a couple of weeks back (note the +5% powertune as well).

Thanks. I will try those. Fan is on auto, right?  
Which kernel, what options?
You need to confirm your GPU will actually run at those speeds. Every card has different top stability levels.
--auto-gpu --auto-fan -I 11 --gpu-engine 450-1200 --gpu-memdiff -150 --gpu-powertune 5

This is driver 8.921 on Linux 64 bit with GL SYNC enabled. This means it ends up being -k poclbm -w 64 -v 1 . On windows you will not be able to run that high an intensity without running into high CPU usage issues (probably -I 9 is max), and there's no way to enable GL SYNC that anyone's aware of.
-ck
legendary
Activity: 4088
Merit: 1631
Ruu \o/
I wonder what settings are people running their 7970s with.  Not for few hours, but days :-)
The weather's hot here at the moment, but..

GPU 0: 718.2 / 713.3 Mh/s | A:5180  R:16  HW:0  U:10.00/m  I:11
74.0 C  F: 79% (4532 RPM)  E: 1200 MHz  M: 1050 Mhz  V: 1.170V  A: 99% P: 5%
Last initialised: [2012-02-24 17:38:34]
Intensity: 11
Thread 0: 357.7 Mh/s Enabled ALIVE
Thread 1: 360.4 Mh/s Enabled ALIVE

Running flat out since the day I installed it a couple of weeks back (note the +5% powertune as well).
full member
Activity: 164
Merit: 100
Hi,
I am running a 5870 and with 2.3.1-2 i get ~415Mh/s approximately the same results as with
2.2.3 (which was a bit better then 2.2.6).

When i started the new version it was ok but after a while when i connected to the console it looked like this.
Notice the inflated hashrates:
--------------------------------------------------------------------------------
 (5s):5218.6(avg):4154.5Mh/s | Q:171  A:2956 R:5  HW:0  E:173%  U:5.547m
 TQ: 2  ST: 3  SS: 1  DW: 13  NB: 8  LW: 4335 GF: 0  RF: 0
 Connected to http://xxx:xxxx with LP as user xxx.xxx1
 Block: 0000002e559399ea9c7e863264a387ce...  Started: [06:01:52]
--------------------------------------------------------------------------------
 [P]ool management [G]PU management [fixeStrikerhough]ettings [D]isplay options [Q]uit
 GPU 0: 417.20/414.5h/s | A:296 R:5 HW:0 U: 5.547m I:10
--------------------------------------------------------------------------------

after a while the averages went back to 414.7 but the 5s average did not change back (yet)

//GoK
Jump to: