Pages:
Author

Topic: [ANN] cudaMiner & ccMiner CUDA based mining applications [Windows/Linux/MacOSX] - page 34. (Read 3426980 times)

legendary
Activity: 1484
Merit: 1082
ccminer/cpuminer developer
will be cleanup to do in the next version so... nice for the final binary size
legendary
Activity: 1400
Merit: 1050
yes, this code is wrong.

__inline__ __device__ uint2 ROL2(const uint2 v, const int n)
{
   uint2 result;
   result.x = (((v.x) << (n)) | ((v.x) >> (64 - (n))));
   result.y = (((v.y) << (n)) | ((v.y) >> (64 - (n))));
   return result;
}

This one should work:
__inline__ __device__ uint2 ROL2(const uint2 v, const int n) {
      uint2 result;
      if (n <= 32) {
         result.y = ((v.y << (n)) | (v.x >> (32 - n)));
         result.x = ((v.x << (n)) | (v.y >> (32 - n)));
      }
      else {
         result.y = ((v.x << (n - 32)) | (v.y >> (64 - n)));
         result.x = ((v.y << (n - 32)) | (v.x >> (64 - n)));
      }
      return result;
   }

there is a way to do it without a condition statement, but haven't looked into it yet
sp_
legendary
Activity: 2954
Merit: 1087
Team Black developer
yes, this code is wrong.

__inline__ __device__ uint2 ROL2(const uint2 v, const int n)
{
   uint2 result;
   result.x = (((v.x) << (n)) | ((v.x) >> (64 - (n))));
   result.y = (((v.y) << (n)) | ((v.y) >> (64 - (n))));
   return result;
}
legendary
Activity: 1400
Merit: 1050
NEOSCRYPT-
DJM34, I now notice that your version of ccminer has blocks of code labelled Neoscrypt.  Would you be able to give a status report on how developed this code is?  Is it possible to mine neoscrypt on GPU with your ccminer?       --scryptr
I wonder how that will do - I just cleared 610kh/s on 290X.
current status is: "not yet"
actually, I was thinking to implement neoscrypt before I got hired to work on lyra.
But I must admit I kinda lack motivation on this project... (may-be trying to beat Wolf0 will help  Grin)
If there was a bounty for nvidia that would definitely help  (greedy me...) Grin
But for the moment, I am still looking into improvement into lyra... once I am done with that may-be...

The groestl(256) in lyra implementation is different. Is it faster than the groestl in x11(512). I guess it needs some more work to support 512 rounds?
it is based on post killer groestl code... the only reason it is faster that groestl 512 is that it does less mixing (by definition small_core versus big_core in sph definition). In principle a killer groestl256 (bit slicing) should be faster... but it isn't obvious to write even from the existing code...

I also tried a 64bit implementation with uint2, but it wasn't faster either...

I think I should get uint2 working for compute 3.0, there was a bug (actually it was just wrong Grin) in the uint2 rotation without funnelshift...
sp_
legendary
Activity: 2954
Merit: 1087
Team Black developer
NEOSCRYPT-
DJM34, I now notice that your version of ccminer has blocks of code labelled Neoscrypt.  Would you be able to give a status report on how developed this code is?  Is it possible to mine neoscrypt on GPU with your ccminer?       --scryptr
I wonder how that will do - I just cleared 610kh/s on 290X.
current status is: "not yet"
actually, I was thinking to implement neoscrypt before I got hired to work on lyra.
But I must admit I kinda lack motivation on this project... (may-be trying to beat Wolf0 will help  Grin)
If there was a bounty for nvidia that would definitely help  (greedy me...) Grin
But for the moment, I am still looking into improvement into lyra... once I am done with that may-be...

The groestl(256) in lyra implementation is different. Is it faster than the groestl in x11(512). I guess it needs some more work to support 512 rounds?
legendary
Activity: 1400
Merit: 1050
NEOSCRYPT-

DJM34, I now notice that your version of ccminer has blocks of code labelled Neoscrypt.  Would you be able to give a status report on how developed this code is?  Is it possible to mine neoscrypt on GPU with your ccminer?       --scryptr

I wonder how that will do - I just cleared 610kh/s on 290X.
current status is: "not yet"
actually, I was thinking to implement neoscrypt before I got hired to work on lyra.
But I must admit I kinda lack motivation on this project... (may-be trying to beat Wolf0 will help  Grin)
If there was a bounty for nvidia that would definitely help  (greedy me...) Grin
But for the moment, I am still looking into improvement into lyra... once I am done with that may-be...
hero member
Activity: 756
Merit: 502
Christian,

A while back I posted a question regarding performance gains to ccminer versions newer than the v1.0 that I was using. Recently I had to reload my computer from scratch (Xubuntu 14.04) so I compiled v1.2 with Cuda 6.5, driver 340.29. Whereas with two GT640 (GDDR5) mining BTQ I was getting ~1.6Mh/s combined, I'm now getting ~2.1Mh/s. 31% boost for zero added hardware. There will be a little something from me in your stocking soon. Thanks a bunch!  Grin

Z


I think the third party code forks around here have even more hash rate improvements.

I've reduced my involvement in the mining scene a lot since last summer.

The last thing I did was a GPU miner for CoinShield's CPU channel (together with ChrisH), but that was never made public. Also it looks like CoinShield was quite a dud (i.e. the coin's value is not taking off and the feature rollout of the wallet is really slow)
sp_
legendary
Activity: 2954
Merit: 1087
Team Black developer
Not sure. Will test this weekend on the 970 and 980.
Note that with the last improvements, and my last echo change (wich is not checked in yet)  a windforce 750ti black edition should do above 3MHASH on stock clocks mining x11. Smiley
There we are !) good job... i see 2950 with your current git on linux and 2905 on mine, i think i will grab skein one too for the 1.5.1 release... but all your merges with KlausT make your work really hard to follow
We were talking about uint2 possible improvements some days ago with djm on irc... nice you made it :p

Check the head on github. I committed the 50KHASH on echo improvement. You should now get around 3MHASH@x11 on the stock clocked windforce black 750ti.
legendary
Activity: 1797
Merit: 1028
NEOSCRYPT-

DJM34, I now notice that your version of ccminer has blocks of code labelled Neoscrypt.  Would you be able to give a status report on how developed this code is?  Is it possible to mine neoscrypt on GPU with your ccminer?       --scryptr
sr. member
Activity: 346
Merit: 250
Christian,

A while back I posted a question regarding performance gains to ccminer versions newer than the v1.0 that I was using. Recently I had to reload my computer from scratch (Xubuntu 14.04) so I compiled v1.2 with Cuda 6.5, driver 340.29. Whereas with two GT640 (GDDR5) mining BTQ I was getting ~1.6Mh/s combined, I'm now getting ~2.1Mh/s. 31% boost for zero added hardware. There will be a little something from me in your stocking soon. Thanks a bunch!  Grin

Z
legendary
Activity: 1400
Merit: 1050
[ tried to use it for the whole routine, but for some reason I haven't been able to get it to work...
(actually not sure it will help a lot, the amd is a lot stronger than the nvidia on 64bit calculation, that's why it makes a such a difference on nvidia)

In BMW I added a couple of operators. - And SHL/SHR for the Uint2 types. The funnelshift on nvidia is only calculating 32 bit a time so you need 2 instructions to do a full rotate, and then you need code that merge the Upper/Lower registers. In code it meens that a ROL64 4-5 instructions while a shL64 is 1 instruction.
When working in uint2 mode, alot of the merge code 32bit->64 bit is removed. This meens that the ROL64 is down to 2 instructions. For bmw that caused the final routine to use less registers (no memory spills).
But 33% faster was a surprise for me, since the bmw has a lot of SHL/SHL/ + and - wich I suspected would run slower on uint2. (2 instructions instead of 1)
yes but the number of uint32 operations over a cycle is larger than twice the number of uint64 operation (I think... something like that, a table was posted recently on that on the thread)
sp_
legendary
Activity: 2954
Merit: 1087
Team Black developer
[ tried to use it for the whole routine, but for some reason I haven't been able to get it to work...
(actually not sure it will help a lot, the amd is a lot stronger than the nvidia on 64bit calculation, that's why it makes a such a difference on nvidia)

In BMW I added a couple of operators. - And SHL/SHR for the Uint2 types. The funnelshift on nvidia is only calculating 32 bit a time so you need 2 instructions to do a full rotate, and then you need code that merge the Upper/Lower registers. In code it meens that a ROL64 4-5 instructions while a shL64 is 1 instruction.
When working in uint2 mode, alot of the merge code 32bit->64 bit is removed. This meens that the ROL64 is down to 2 instructions. For bmw that caused the final routine to use less registers (no memory spills).
But 33% faster was a surprise for me, since the bmw has a lot of SHL/SHL/ + and - wich I suspected would run slower on uint2. (2 instructions instead of 1)
legendary
Activity: 1400
Merit: 1050
Not sure. Will test this weekend on the 970 and 980.
Note that with the last improvements, and my last echo change (wich is not checked in yet)  a windforce 750ti black edition should do above 3MHASH on stock clocks mining x11. Smiley

There we are !) good job... i see 2950 with your current git on linux and 2905 on mine, i think i will grab skein one too for the 1.5.1 release... but all your merges with KlausT make your work really hard to follow

We were talking about uint2 possible improvements some days ago with djm on irc... nice you made it :p

I did the uint2 shit on AMD - little help. Replacing one rotate in Blake2b helps, though.
I tried to use it for the whole routine, but for some reason I haven't been able to get it to work...
(actually not sure it will help a lot, the amd is a lot stronger than the nvidia on 64bit calculation, that's why it makes a such a difference on nvidia)
legendary
Activity: 1484
Merit: 1082
ccminer/cpuminer developer
Not sure. Will test this weekend on the 970 and 980.
Note that with the last improvements, and my last echo change (wich is not checked in yet)  a windforce 750ti black edition should do above 3MHASH on stock clocks mining x11. Smiley

There we are !) good job... i see 2950 with your current git on linux and 2905 on mine, i think i will grab skein one too for the 1.5.1 release... but all your merges with KlausT make your work really hard to follow

We were talking about uint2 possible improvements some days ago with djm on irc... nice you made it :p
legendary
Activity: 1400
Merit: 1050
That improves for sure keccak and skein, but in x11 context, you won't see almost anything as these algo are already pretty fast.  The main problem with uint2 is that it does not work at all with compute 30 (the logic should work but it seems there is a problem with register allocation...)
The uint2 keccak made x11 20-25 KHASH faster. on a 750ti
that's what I call "almost anything"
I implemented it now, Bmw +32%, keccac + 40%, skein +30% x11 is 70-100 KHASH faster on the 750ti. Blake512 is running slower with uint2.
not bad, what are the numbers on the 9xx serie ? In principle it should be even better however it might require some tuning...

Not sure. Will test this weekend on the 970 and 980.
Note that with the last improvements, and my last echo change (wich is not checked in yet)  a windforce 750ti black edition should do above 3MHASH on stock clocks mining x11. Smiley
For the 970 and 980 is you use compute 5.2, it can be a bit tricky (register allocation is a bit tricky with uint2 and compute 5.2)
legendary
Activity: 1400
Merit: 1050
me, i was not paid by vtc team to do it... but somebody was not able to compile it on linux... its also why i imported it.

The next time you dont want i use published code (by VTC not you) tell us before we commit it... not 2 days after. I was planning to release it today, but... the default difficulty setting is still not clear... 128 and 256 seems ok on most pools ive tested.

It isn't that I don't want you to publish my code, but I prefer to be the first (aside from the VTC team release...) to do it for obvious reason (my own publicity obviously... I guess this one is really obvious Grin)

The way it happened (or my perception of the event): I sent the code to vtc team for the test release/release candidate before going to bed had some run of correction before they decided to go to sleep and when I woke the next morning someone (you) has published my code with no clear reference to the author on another forum where I don't have an account his own version of the code. Sorry by without enough coffee it looks pretty sneaky, and with more coffee it still looks sneaky. (and I am pretty sure you would have reacted in the same way, if someone had done that to you).

Then, I told you to wait (or asked to jk_14 to ask you to wait as I didn't have an account on litecointalk) because it wasn't finished and as it was a test release it is way better if people report problems from only one release and don't start to use code we don't know... (I also told you to wait on irc channel, as I was planning to move the code to your interface )

Then later in the week you sent your code (actually the same, while the RC had already incorporated new parts) to the vtc team with no clear intention since they asked me what they should do with that. That time it was just looking like you were trying to go behind my back on the job I was hired to do...  Grin

That's why I feel a bit pissed off about the whole situation, and I am pretty sure you would have felt the same if you had been in my situation.
I don't think all this was intentional from you, but it was kind of careless...

anyway...

Regarding the difficulty 128 or 256 should do it, just check the pool hashrate and use the one which give the result closest to the real hashrate.


sp_
legendary
Activity: 2954
Merit: 1087
Team Black developer
That improves for sure keccak and skein, but in x11 context, you won't see almost anything as these algo are already pretty fast.  The main problem with uint2 is that it does not work at all with compute 30 (the logic should work but it seems there is a problem with register allocation...)
The uint2 keccak made x11 20-25 KHASH faster. on a 750ti
that's what I call "almost anything"
I implemented it now, Bmw +32%, keccac + 40%, skein +30% x11 is 70-100 KHASH faster on the 750ti. Blake512 is running slower with uint2.
not bad, what are the numbers on the 9xx serie ? In principle it should be even better however it might require some tuning...

Not sure. Will test this weekend on the 970 and 980.
Note that with the last improvements, and my last echo change (wich is not checked in yet)  a windforce 750ti black edition should do above 3MHASH on stock clocks mining x11. Smiley
legendary
Activity: 1400
Merit: 1050
That improves for sure keccak and skein, but in x11 context, you won't see almost anything as these algo are already pretty fast.  The main problem with uint2 is that it does not work at all with compute 30 (the logic should work but it seems there is a problem with register allocation...)
The uint2 keccak made x11 20-25 KHASH faster. on a 750ti
that's what I call "almost anything"

I implemented it now, Bmw +32%, keccac + 40%, skein +30% x11 is 70-100 KHASH faster on the 750ti. Blake512 is running slower with uint2.


not bad, what are the numbers on the 9xx serie ? In principle it should be even better however it might require some tuning...
legendary
Activity: 1484
Merit: 1082
ccminer/cpuminer developer
me, i was not paid by vtc team to do it... but somebody was not able to compile it on linux... its also why i imported it.

The next time you dont want i use published code (by VTC not you) tell us before we commit it... not 2 days after. I was planning to release it today, but... the default difficulty setting is still not clear... 128 and 256 seems ok on most pools ive tested.
legendary
Activity: 1537
Merit: 1005
It isn't a new version it is (again) the old test version which doesn't support compute 3.0 Roll Eyes  and he wonder why I was unhappy that he released the test code...
"include some djm34 code addition"  I like the irony: "some addition" here consists in the whole lyra2re code.  Grin
(band of vultures  Grin and next time cryptomining will say I haven't done anything in a long time like the last time... )

newest version here: Lyra2RE with support to compute 3.0 https://github.com/djm34/ccminer
include m7 (xcn) code optimized for 9xx serie: 980: 12.8MH/s 780: 13.5MH/s

Sorry for the confusion djm. It yielded better results than miner listed in VTC thread, so I assumed its indeed new Wink

Thanks for the explanation.
Pages:
Jump to: