[ANN] cudaMiner & ccMiner CUDA based mining applications [Windows/Linux/MacOSX] - page 34.

Epsylon3

legendary

Activity: 1484

Merit: 1082

ccminer/cpuminer developer

will be cleanup to do in the next version so... nice for the final binary size

djm34

legendary

Activity: 1400

Merit: 1050

Quote from: sp_ on December 17, 2014, 06:03:10 AM

yes, this code is wrong.

__inline__ __device__ uint2 ROL2(const uint2 v, const int n)
{
   uint2 result;
   result.x = (((v.x) << (n)) | ((v.x) >> (64 - (n))));
   result.y = (((v.y) << (n)) | ((v.y) >> (64 - (n))));
   return result;
}

This one should work:
__inline__ __device__ uint2 ROL2(const uint2 v, const int n) {
      uint2 result;
      if (n <= 32) {
         result.y = ((v.y << (n)) | (v.x >> (32 - n)));
         result.x = ((v.x << (n)) | (v.y >> (32 - n)));
      }
      else {
         result.y = ((v.x << (n - 32)) | (v.y >> (64 - n)));
         result.x = ((v.y << (n - 32)) | (v.x >> (64 - n)));
      }
      return result;
   }

there is a way to do it without a condition statement, but haven't looked into it yet

sp_

legendary

Activity: 2954

Merit: 1087

Team Black developer

yes, this code is wrong.

__inline__ __device__ uint2 ROL2(const uint2 v, const int n)
{
   uint2 result;
   result.x = (((v.x) << (n)) | ((v.x) >> (64 - (n))));
   result.y = (((v.y) << (n)) | ((v.y) >> (64 - (n))));
   return result;
}

djm34

legendary

Activity: 1400

Merit: 1050

Quote from: sp_ on December 17, 2014, 05:45:41 AM

Quote from: djm34 on December 17, 2014, 05:31:36 AM

Quote from: ?? on ??

Quote from: scryptr on December 17, 2014, 01:18:03 AM

NEOSCRYPT-
DJM34, I now notice that your version of ccminer has blocks of code labelled Neoscrypt. Would you be able to give a status report on how developed this code is? Is it possible to mine neoscrypt on GPU with your ccminer? --scryptr

I wonder how that will do - I just cleared 610kh/s on 290X.

current status is: "not yet"
actually, I was thinking to implement neoscrypt before I got hired to work on lyra.
But I must admit I kinda lack motivation on this project... (may-be trying to beat Wolf0 will help Grin

)
If there was a bounty for nvidia that would definitely help (greedy me...) Grin

But for the moment, I am still looking into improvement into lyra... once I am done with that may-be...

The groestl(256) in lyra implementation is different. Is it faster than the groestl in x11(512). I guess it needs some more work to support 512 rounds?

it is based on post killer groestl code... the only reason it is faster that groestl 512 is that it does less mixing (by definition small_core versus big_core in sph definition). In principle a killer groestl256 (bit slicing) should be faster... but it isn't obvious to write even from the existing code...

I also tried a 64bit implementation with uint2, but it wasn't faster either...

I think I should get uint2 working for compute 3.0, there was a bug (actually it was just wrong Grin

) in the uint2 rotation without funnelshift...

sp_

legendary

Activity: 2954

Merit: 1087

Team Black developer

Quote from: djm34 on December 17, 2014, 05:31:36 AM

Quote from: ?? on ??

Quote from: scryptr on December 17, 2014, 01:18:03 AM

NEOSCRYPT-
DJM34, I now notice that your version of ccminer has blocks of code labelled Neoscrypt. Would you be able to give a status report on how developed this code is? Is it possible to mine neoscrypt on GPU with your ccminer? --scryptr

I wonder how that will do - I just cleared 610kh/s on 290X.

current status is: "not yet"
actually, I was thinking to implement neoscrypt before I got hired to work on lyra.
But I must admit I kinda lack motivation on this project... (may-be trying to beat Wolf0 will help Grin

)
If there was a bounty for nvidia that would definitely help (greedy me...) Grin

But for the moment, I am still looking into improvement into lyra... once I am done with that may-be...

The groestl(256) in lyra implementation is different. Is it faster than the groestl in x11(512). I guess it needs some more work to support 512 rounds?

djm34

legendary

Activity: 1400

Merit: 1050

Quote from: ?? on ??

Quote from: scryptr on December 17, 2014, 01:18:03 AM

NEOSCRYPT-

DJM34, I now notice that your version of ccminer has blocks of code labelled Neoscrypt. Would you be able to give a status report on how developed this code is? Is it possible to mine neoscrypt on GPU with your ccminer? --scryptr

I wonder how that will do - I just cleared 610kh/s on 290X.

current status is: "not yet"
actually, I was thinking to implement neoscrypt before I got hired to work on lyra.
But I must admit I kinda lack motivation on this project... (may-be trying to beat Wolf0 will help Grin

)
If there was a bounty for nvidia that would definitely help (greedy me...) Grin

But for the moment, I am still looking into improvement into lyra... once I am done with that may-be...

cbuchner1

hero member

Activity: 756

Merit: 502

Quote from: ZeroFossilFuel on December 16, 2014, 10:37:37 PM

Christian,

A while back I posted a question regarding performance gains to ccminer versions newer than the v1.0 that I was using. Recently I had to reload my computer from scratch (Xubuntu 14.04) so I compiled v1.2 with Cuda 6.5, driver 340.29. Whereas with two GT640 (GDDR5) mining BTQ I was getting ~1.6Mh/s combined, I'm now getting ~2.1Mh/s. 31% boost for zero added hardware. There will be a little something from me in your stocking soon. Thanks a bunch! Grin

Z

I think the third party code forks around here have even more hash rate improvements.

I've reduced my involvement in the mining scene a lot since last summer.

The last thing I did was a GPU miner for CoinShield's CPU channel (together with ChrisH), but that was never made public. Also it looks like CoinShield was quite a dud (i.e. the coin's value is not taking off and the feature rollout of the wallet is really slow)

sp_

legendary

Activity: 2954

Merit: 1087

Team Black developer

Quote from: Epsylon3 on December 16, 2014, 07:27:25 AM

Quote from: sp_ on December 16, 2014, 06:20:24 AM

Not sure. Will test this weekend on the 970 and 980.
Note that with the last improvements, and my last echo change (wich is not checked in yet) a windforce 750ti black edition should do above 3MHASH on stock clocks mining x11.

There we are !) good job... i see 2950 with your current git on linux and 2905 on mine, i think i will grab skein one too for the 1.5.1 release... but all your merges with KlausT make your work really hard to follow
We were talking about uint2 possible improvements some days ago with djm on irc... nice you made it :p

Check the head on github. I committed the 50KHASH on echo improvement. You should now get around 3MHASH@x11 on the stock clocked windforce black 750ti.

scryptr

legendary

Activity: 1797

Merit: 1028

NEOSCRYPT-

DJM34, I now notice that your version of ccminer has blocks of code labelled Neoscrypt. Would you be able to give a status report on how developed this code is? Is it possible to mine neoscrypt on GPU with your ccminer? --scryptr

ZeroFossilFuel

sr. member

Activity: 346

Merit: 250

Christian,

A while back I posted a question regarding performance gains to ccminer versions newer than the v1.0 that I was using. Recently I had to reload my computer from scratch (Xubuntu 14.04) so I compiled v1.2 with Cuda 6.5, driver 340.29. Whereas with two GT640 (GDDR5) mining BTQ I was getting ~1.6Mh/s combined, I'm now getting ~2.1Mh/s. 31% boost for zero added hardware. There will be a little something from me in your stocking soon. Thanks a bunch! Grin

Z

djm34

legendary

Activity: 1400

Merit: 1050

Quote from: sp_ on December 16, 2014, 08:38:41 AM

Quote from: djm34 on December 16, 2014, 08:20:44 AM

[ tried to use it for the whole routine, but for some reason I haven't been able to get it to work...
(actually not sure it will help a lot, the amd is a lot stronger than the nvidia on 64bit calculation, that's why it makes a such a difference on nvidia)

In BMW I added a couple of operators. - And SHL/SHR for the Uint2 types. The funnelshift on nvidia is only calculating 32 bit a time so you need 2 instructions to do a full rotate, and then you need code that merge the Upper/Lower registers. In code it meens that a ROL64 4-5 instructions while a shL64 is 1 instruction.
When working in uint2 mode, alot of the merge code 32bit->64 bit is removed. This meens that the ROL64 is down to 2 instructions. For bmw that caused the final routine to use less registers (no memory spills).
But 33% faster was a surprise for me, since the bmw has a lot of SHL/SHL/ + and - wich I suspected would run slower on uint2. (2 instructions instead of 1)

yes but the number of uint32 operations over a cycle is larger than twice the number of uint64 operation (I think... something like that, a table was posted recently on that on the thread)

sp_

legendary

Activity: 2954

Merit: 1087

Team Black developer

Quote from: djm34 on December 16, 2014, 08:20:44 AM

[ tried to use it for the whole routine, but for some reason I haven't been able to get it to work...
(actually not sure it will help a lot, the amd is a lot stronger than the nvidia on 64bit calculation, that's why it makes a such a difference on nvidia)

In BMW I added a couple of operators. - And SHL/SHR for the Uint2 types. The funnelshift on nvidia is only calculating 32 bit a time so you need 2 instructions to do a full rotate, and then you need code that merge the Upper/Lower registers. In code it meens that a ROL64 4-5 instructions while a shL64 is 1 instruction.
When working in uint2 mode, alot of the merge code 32bit->64 bit is removed. This meens that the ROL64 is down to 2 instructions. For bmw that caused the final routine to use less registers (no memory spills).
But 33% faster was a surprise for me, since the bmw has a lot of SHL/SHL/ + and - wich I suspected would run slower on uint2. (2 instructions instead of 1)

djm34

legendary

Activity: 1400

Merit: 1050

Quote from: ?? on ??

Quote from: Epsylon3 on December 16, 2014, 07:27:25 AM

Quote from: sp_ on December 16, 2014, 06:20:24 AM

Not sure. Will test this weekend on the 970 and 980.
Note that with the last improvements, and my last echo change (wich is not checked in yet) a windforce 750ti black edition should do above 3MHASH on stock clocks mining x11.

There we are !) good job... i see 2950 with your current git on linux and 2905 on mine, i think i will grab skein one too for the 1.5.1 release... but all your merges with KlausT make your work really hard to follow

We were talking about uint2 possible improvements some days ago with djm on irc... nice you made it :p

I did the uint2 shit on AMD - little help. Replacing one rotate in Blake2b helps, though.

I tried to use it for the whole routine, but for some reason I haven't been able to get it to work...
(actually not sure it will help a lot, the amd is a lot stronger than the nvidia on 64bit calculation, that's why it makes a such a difference on nvidia)

Epsylon3

legendary

Activity: 1484

Merit: 1082

ccminer/cpuminer developer

Quote from: sp_ on December 16, 2014, 06:20:24 AM

Not sure. Will test this weekend on the 970 and 980.
Note that with the last improvements, and my last echo change (wich is not checked in yet) a windforce 750ti black edition should do above 3MHASH on stock clocks mining x11.

There we are !) good job... i see 2950 with your current git on linux and 2905 on mine, i think i will grab skein one too for the 1.5.1 release... but all your merges with KlausT make your work really hard to follow

We were talking about uint2 possible improvements some days ago with djm on irc... nice you made it :p

djm34

legendary

Activity: 1400

Merit: 1050

Quote from: sp_ on December 16, 2014, 06:20:24 AM

Quote from: djm34 on December 16, 2014, 06:04:51 AM

Quote from: sp_ on December 16, 2014, 05:01:11 AM

Quote from: djm34 on December 13, 2014, 08:06:21 AM

Quote from: sp_ on December 13, 2014, 05:15:50 AM

Quote from: djm34 on December 12, 2014, 09:39:11 AM

That improves for sure keccak and skein, but in x11 context, you won't see almost anything as these algo are already pretty fast. The main problem with uint2 is that it does not work at all with compute 30 (the logic should work but it seems there is a problem with register allocation...)

The uint2 keccak made x11 20-25 KHASH faster. on a 750ti

that's what I call "almost anything"

I implemented it now, Bmw +32%, keccac + 40%, skein +30% x11 is 70-100 KHASH faster on the 750ti. Blake512 is running slower with uint2.

not bad, what are the numbers on the 9xx serie ? In principle it should be even better however it might require some tuning...

Not sure. Will test this weekend on the 970 and 980.
Note that with the last improvements, and my last echo change (wich is not checked in yet) a windforce 750ti black edition should do above 3MHASH on stock clocks mining x11.

For the 970 and 980 is you use compute 5.2, it can be a bit tricky (register allocation is a bit tricky with uint2 and compute 5.2)

djm34

legendary

Activity: 1400

Merit: 1050

Quote from: Epsylon3 on December 16, 2014, 05:26:42 AM

me, i was not paid by vtc team to do it... but somebody was not able to compile it on linux... its also why i imported it.

The next time you dont want i use published code (by VTC not you) tell us before we commit it... not 2 days after. I was planning to release it today, but... the default difficulty setting is still not clear... 128 and 256 seems ok on most pools ive tested.

It isn't that I don't want you to publish my code, but I prefer to be the first (aside from the VTC team release...) to do it for obvious reason (my own publicity obviously... I guess this one is really obvious Grin

)

The way it happened (or my perception of the event): I sent the code to vtc team for the test release/release candidate before going to bed had some run of correction before they decided to go to sleep and when I woke the next morning someone (you) has published my code with no clear reference to the author on another forum where I don't have an account his own version of the code. Sorry by without enough coffee it looks pretty sneaky, and with more coffee it still looks sneaky. (and I am pretty sure you would have reacted in the same way, if someone had done that to you).

Then, I told you to wait (or asked to jk_14 to ask you to wait as I didn't have an account on litecointalk) because it wasn't finished and as it was a test release it is way better if people report problems from only one release and don't start to use code we don't know... (I also told you to wait on irc channel, as I was planning to move the code to your interface )

Then later in the week you sent your code (actually the same, while the RC had already incorporated new parts) to the vtc team with no clear intention since they asked me what they should do with that. That time it was just looking like you were trying to go behind my back on the job I was hired to do... Grin

That's why I feel a bit pissed off about the whole situation, and I am pretty sure you would have felt the same if you had been in my situation.
I don't think all this was intentional from you, but it was kind of careless...

anyway...

Regarding the difficulty 128 or 256 should do it, just check the pool hashrate and use the one which give the result closest to the real hashrate.

sp_

legendary

Activity: 2954

Merit: 1087

Team Black developer

Quote from: djm34 on December 16, 2014, 06:04:51 AM

Quote from: sp_ on December 16, 2014, 05:01:11 AM

Quote from: djm34 on December 13, 2014, 08:06:21 AM

Quote from: sp_ on December 13, 2014, 05:15:50 AM

Quote from: djm34 on December 12, 2014, 09:39:11 AM

That improves for sure keccak and skein, but in x11 context, you won't see almost anything as these algo are already pretty fast. The main problem with uint2 is that it does not work at all with compute 30 (the logic should work but it seems there is a problem with register allocation...)

The uint2 keccak made x11 20-25 KHASH faster. on a 750ti

that's what I call "almost anything"

I implemented it now, Bmw +32%, keccac + 40%, skein +30% x11 is 70-100 KHASH faster on the 750ti. Blake512 is running slower with uint2.

not bad, what are the numbers on the 9xx serie ? In principle it should be even better however it might require some tuning...

Not sure. Will test this weekend on the 970 and 980.
Note that with the last improvements, and my last echo change (wich is not checked in yet) a windforce 750ti black edition should do above 3MHASH on stock clocks mining x11.

djm34

legendary

Activity: 1400

Merit: 1050

Quote from: sp_ on December 16, 2014, 05:01:11 AM

Quote from: djm34 on December 13, 2014, 08:06:21 AM

Quote from: sp_ on December 13, 2014, 05:15:50 AM

Quote from: djm34 on December 12, 2014, 09:39:11 AM

That improves for sure keccak and skein, but in x11 context, you won't see almost anything as these algo are already pretty fast. The main problem with uint2 is that it does not work at all with compute 30 (the logic should work but it seems there is a problem with register allocation...)

The uint2 keccak made x11 20-25 KHASH faster. on a 750ti

that's what I call "almost anything"

I implemented it now, Bmw +32%, keccac + 40%, skein +30% x11 is 70-100 KHASH faster on the 750ti. Blake512 is running slower with uint2.

not bad, what are the numbers on the 9xx serie ? In principle it should be even better however it might require some tuning...

Epsylon3

legendary

Activity: 1484

Merit: 1082

ccminer/cpuminer developer

me, i was not paid by vtc team to do it... but somebody was not able to compile it on linux... its also why i imported it.

The next time you dont want i use published code (by VTC not you) tell us before we commit it... not 2 days after. I was planning to release it today, but... the default difficulty setting is still not clear... 128 and 256 seems ok on most pools ive tested.

fredeq

legendary

Activity: 1537

Merit: 1005

Quote from: djm34 on December 15, 2014, 09:39:50 PM

Quote from: fredeq on December 15, 2014, 08:19:43 PM

Check out this new version for VTC http://cryptomining-blog.com/4049-updated-windows-binary-of-the-ccminer-1-5-1-git-fork-by-sp-for-maxwell/

It isn't a new version it is (again) the old test version which doesn't support compute 3.0 Roll Eyes

and he wonder why I was unhappy that he released the test code...
"include some djm34 code addition" I like the irony: "some addition" here consists in the whole lyra2re code. Grin

(band of vultures Grin

and next time cryptomining will say I haven't done anything in a long time like the last time... )

newest version here: Lyra2RE with support to compute 3.0 https://github.com/djm34/ccminer
include m7 (xcn) code optimized for 9xx serie: 980: 12.8MH/s 780: 13.5MH/s

Sorry for the confusion djm. It yielded better results than miner listed in VTC thread, so I assumed its indeed new Wink

Thanks for the explanation.

Topic: [ANN] cudaMiner & ccMiner CUDA based mining applications [Windows/Linux/MacOSX] - page 34. (Read 3427002 times)