Pages:
Author

Topic: [XPM] CUDA enabled qt client miner for primecoins. Source code inside. WIP - page 3. (Read 31789 times)

member
Activity: 75
Merit: 10
I'm still on it - with a different idea. As it turns out, doing Fermat tests on the GPU is not a no brainer and getting that fast requires too much effort for now, so I'll try to port something else to the GPU.

I'm still sure a GPU miner is possible, but right now I would say it's a lot harder than for the other coins. The other OpenCL miner project is (amusingly!) also having problems.
sr. member
Activity: 294
Merit: 250
has anyone tested this yet? is it working?
hero member
Activity: 532
Merit: 500
It's abandoned. Lol. Probably everyone figured out that this is too difficult. Heck even mlmrt was having trouble.

LIES! He's managed to have the same efficiency as an AMD multi-core.

With an AMD multi-core + a HD6990.

WOW, we can spend ~194 watts running an AMD multi-core or ~525 watts running the AMD and a GPU and get the same results!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
I WANT I WANT I WANT I WANT!
sr. member
Activity: 336
Merit: 250
Cuddling, censored, unicorn-shaped troll.
It's abandoned. Lol. Probably everyone figured out that this is too difficult. Heck even mlmrt was having trouble.

LIES! He's managed to have the same efficiency as an AMD multi-core.

With an AMD multi-core + a HD6990.
sr. member
Activity: 406
Merit: 250
It's abandoned. Lol. Probably everyone figured out that this is too difficult. Heck even mlmrt was having trouble.
full member
Activity: 122
Merit: 100
Is this project still active or has it been abandoned?
member
Activity: 75
Merit: 10
Please check that you're using the latest SDK. I also encountered memory problems with cuda 5.0 and I'm using 5.5 now which works for me.
Just curious, have you looked at the Mfaktc source code at all?  While it is used for trial factoring Mersenne Primes, which may not be helpful, the writer did get it to sieve completely on the GPU, which may.

I looked into it, yes. Code is not very understandable though...
full member
Activity: 145
Merit: 100
I've updated to cuda-5.5 (and driver 319.21)

Running with cuda-gdb I get the following error:

Code:
Have 2400 candidates after main loop
Cuda start!
[New Thread 0x7fffacc38700 (LWP 14248)]
[Context Create of context 0x7fff700234f0 on Device 0]
[Launch of CUDA Kernel 0 (runPrimeCandidateSearch<<<(25,1,1),(192,1,1)>>>) on Device 0]

Program received signal CUDA_EXCEPTION_10, Device Illegal Address.
[Switching focus to CUDA kernel 0, grid 1, block (15,0,0), thread (0,0,0), device 0, sm 3, warp 0, lane 0]
0x00007fff7091b760 in long_multiplication(unsigned int * @generic, unsigned int * @generic, unsigned int * @generic, unsigned int, unsigned int) (
    product=0x3fff6b4, op1=0x3fff734, op2=0x3fff634, num_digits=17,
    prod_capacity=1073741824)
    at primecoin/src/cuda/digit.h:406
406     product[i] = 0;
sr. member
Activity: 406
Merit: 250
Fascinating. This CUDA miner is already vaguely functional now? Now that's some community effort. I wonder what will be the eventual result of this. Will fast CPU's and GPU's working together be the new mining rigs?
full member
Activity: 210
Merit: 100
I not use any kind of messenger beware of scammers
Having the same problem as K1773R  (GTX670 using CUDA 5.5 and the driver it includes). Tried it on mainnet as I still cannot connect to testnet for some reason.

Code:
Have 101 candidates after main loop
Cuda start!
{... some block messages i.e. getblocks -1 to blah, accept etc}
Have -1 candidates after main loop
Cuda+host test round finished with -1 candidates (0 host chain tests)
Cuda error: cudaMemcpy: cudaMemcpyDeviceToHost, unspecified launch failure
ERROR: PrimecoinMiner() : primorial minimum overflow
ERROR: PrimecoinMiner() : primorial minimum overflow
ERROR: PrimecoinMiner() : primorial minimum overflow
ERROR: PrimecoinMiner() : primorial minimum overflow
ERROR: PrimecoinMiner() : primorial minimum overflow
ERROR: PrimecoinMiner() : primorial minimum overflow
ERROR: PrimecoinMiner() : primorial minimum overflow
ERROR: PrimecoinMiner() : primorial minimum overflow

from GDB
Code:
[0] start! 
sizeof(struct) = 400
mpz_print:mpz_capacity: 0
[0] string candidate is 
[0] N is: mpz_capacity: 30 ffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffff
[0] E is: mpz_capacity: 30 fffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffe

Edit: This may just be PEBKAC / RTFM issue on my part. Just saw your note about running with 1 cpu only.

Still crashed, managed to get a few rounds. Used to crash right away
Code:
2013-07-25 19:51:56 primemeter         0 prime/h    498885 test/h         0 5-chains/h
2013-07-25 19:52:56 primemeter         0 prime/h   8404040 test/h         0 5-chains/h
2013-07-25 19:53:56 primemeter         0 prime/h   4184750 test/h         0 5-chains/h
hero member
Activity: 532
Merit: 500
Please check that you're using the latest SDK. I also encountered memory problems with cuda 5.0 and I'm using 5.5 now which works for me.
Just curious, have you looked at the Mfaktc source code at all?  While it is used for trial factoring Mersenne Primes, which may not be helpful, the writer did get it to sieve completely on the GPU, which may.
full member
Activity: 210
Merit: 100
I not use any kind of messenger beware of scammers
Just got this compiled  (Talk about a mess, when my cuda sdk was installed the paths were completely different then they should of /nvidia-304 vs /nvidia-current etc, then some fun Qt conflicts).

Anyone have a working node for testnet they can post? Not having any luck connecting.
legendary
Activity: 1792
Merit: 1008
/dev/null
Please check that you're using the latest SDK. I also encountered memory problems with cuda 5.0 and I'm using 5.5 now which works for me.
ACK, will do later and report back Wink
hero member
Activity: 675
Merit: 514
Would it make any difference if we use __restricted__ pointers in the CUDA code?
member
Activity: 75
Merit: 10
Please check that you're using the latest SDK. I also encountered memory problems with cuda 5.0 and I'm using 5.5 now which works for me.
full member
Activity: 224
Merit: 100
More than wiling to help perform tests as instructed if a windows binary is posted. Got an old GTX 475 rattling around that I could out to work..
legendary
Activity: 1792
Merit: 1008
/dev/null
My 2 cents: mining entirely on GPU wont be easy and is impractical, but tandem mining with interleaved CPU+GPU computations may very well give good speed ups.

Some feedback from knowledgeable people indicates that mod_exp probably would not speed up as well on gpu. However I think if gpu can do the sieve much more efficiently it could generate a lot less candidates for the Fermat test, which could speed things up quite a bit.

There is indeed a problem with the speeds of Fermat tests on the GPU. GNU GMP uses the most sophisticated algorithms available, the student library I found and which I started to extend uses the most basic algorithms.

Mpz_powmod needs fast multiplications of big ints, GMP's algorithm is most likely in O(log(n)*n), school book multiplication which the GPU now uses is O(n^2). I hoped that for the ~400 bit numbers involved it wouldn't make such a difference. Currently, the new version in my repo does Fermat tests on the GPU (rate is 10 per second), but my CPU is still faster due to better algorithms and a better big num implementation.

But don't worry, I won't give up so fast! The current situation is that I either need to look into porting better algorithms to the GPU or to do something else than Fermat tests on the GPU to sieve candidates (e.g. trial division with most common primes).

Anybody with a better GPU than the Geforce 570 TI I own, please test this! My stats (base version is still hp4):

2013-07-24 21:53:38 primemeter     24303 prime/h    490729 test/h        47 5-chains/h

prime/h  and test/h seem to fluctuate enormously and seem to be rather meaningless. As most tests are on the GPU, I have no idea if this is even measuring the tests right. 5-chains is accurate though.

You have to use setgenerate true 1, i.e. one CPU thread for mining.  
running current git (b0c7062f3925482935f5eb352b17737d21b95c5b) and i cant see any usage of my GPU, no heat nor used mem increases when using the QT. anything special to activate so it mines with the GPU? i got a powerfull GPU to test with Wink

EDIT:
Code:
2013-07-25 13:35:43 primemeter         0 prime/h   34261932 test/h         0 5-chains/h
seems the miner thread which should launch the CUDA is borked?

EDIT2:
Code:
Have 2400 candidates after main loop
Cuda start!
Cuda error: cudaMemcpy: cudaMemcpyDeviceToHost, the launch timed out and was terminated
from debug log

You can also run it with -printmining -printtoconsole to see that output directly. Could you compile the cuda portion with -G -g (change the qt project file where it invokes nvcc) and give me the output of cuda-memcheck?

You can also #define CUDA_DEBUG in the cu file, to see the GPU printfs from the console
was already running with -g just waiting for the "cuda start message", stoped it now and recompile with -D CUDA_DEBUG
EDIT: its up and running, waiting for the cuda init + crash Wink
EDIT2: why does it take so long until the miner starts the cuda thread? that seems stupid :S
EDIT3: here we go, it crashed Smiley
debug.log
Code:
Have 2400 candidates after main loop
Cuda start!
Cuda error: cudaMemcpy: cudaMemcpyDeviceToHost, unspecified launch failure
stdout
Code:
[0] start! 
sizeof(struct) = 400
mpz_print:mpz_capacity: 0
[0] string candidate is 
[0] N is: mpz_capacity: 30 ffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffff
[0] E is: mpz_capacity: 30 fffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffe
gdb: dont want to spam, sending per PM and message too big -.-
legendary
Activity: 1713
Merit: 1029
If someone could give me some specific compilation directions (or a windows binary!) I can test on a 780. Smiley
member
Activity: 75
Merit: 10
My 2 cents: mining entirely on GPU wont be easy and is impractical, but tandem mining with interleaved CPU+GPU computations may very well give good speed ups.

Some feedback from knowledgeable people indicates that mod_exp probably would not speed up as well on gpu. However I think if gpu can do the sieve much more efficiently it could generate a lot less candidates for the Fermat test, which could speed things up quite a bit.

There is indeed a problem with the speeds of Fermat tests on the GPU. GNU GMP uses the most sophisticated algorithms available, the student library I found and which I started to extend uses the most basic algorithms.

Mpz_powmod needs fast multiplications of big ints, GMP's algorithm is most likely in O(log(n)*n), school book multiplication which the GPU now uses is O(n^2). I hoped that for the ~400 bit numbers involved it wouldn't make such a difference. Currently, the new version in my repo does Fermat tests on the GPU (rate is 10 per second), but my CPU is still faster due to better algorithms and a better big num implementation.

But don't worry, I won't give up so fast! The current situation is that I either need to look into porting better algorithms to the GPU or to do something else than Fermat tests on the GPU to sieve candidates (e.g. trial division with most common primes).

Anybody with a better GPU than the Geforce 570 TI I own, please test this! My stats (base version is still hp4):

2013-07-24 21:53:38 primemeter     24303 prime/h    490729 test/h        47 5-chains/h

prime/h  and test/h seem to fluctuate enormously and seem to be rather meaningless. As most tests are on the GPU, I have no idea if this is even measuring the tests right. 5-chains is accurate though.

You have to use setgenerate true 1, i.e. one CPU thread for mining.  
running current git (b0c7062f3925482935f5eb352b17737d21b95c5b) and i cant see any usage of my GPU, no heat nor used mem increases when using the QT. anything special to activate so it mines with the GPU? i got a powerfull GPU to test with Wink

EDIT:
Code:
2013-07-25 13:35:43 primemeter         0 prime/h   34261932 test/h         0 5-chains/h
seems the miner thread which should launch the CUDA is borked?

EDIT2:
Code:
Have 2400 candidates after main loop
Cuda start!
Cuda error: cudaMemcpy: cudaMemcpyDeviceToHost, the launch timed out and was terminated
from debug log

You can also run it with -printmining -printtoconsole to see that output directly. Could you compile the cuda portion with -G -g (change the qt project file where it invokes nvcc) and give me the output of cuda-memcheck?

You can also #define CUDA_DEBUG in the cu file, to see the GPU printfs from the console
legendary
Activity: 1792
Merit: 1008
/dev/null
My 2 cents: mining entirely on GPU wont be easy and is impractical, but tandem mining with interleaved CPU+GPU computations may very well give good speed ups.

Some feedback from knowledgeable people indicates that mod_exp probably would not speed up as well on gpu. However I think if gpu can do the sieve much more efficiently it could generate a lot less candidates for the Fermat test, which could speed things up quite a bit.

There is indeed a problem with the speeds of Fermat tests on the GPU. GNU GMP uses the most sophisticated algorithms available, the student library I found and which I started to extend uses the most basic algorithms.

Mpz_powmod needs fast multiplications of big ints, GMP's algorithm is most likely in O(log(n)*n), school book multiplication which the GPU now uses is O(n^2). I hoped that for the ~400 bit numbers involved it wouldn't make such a difference. Currently, the new version in my repo does Fermat tests on the GPU (rate is 10 per second), but my CPU is still faster due to better algorithms and a better big num implementation.

But don't worry, I won't give up so fast! The current situation is that I either need to look into porting better algorithms to the GPU or to do something else than Fermat tests on the GPU to sieve candidates (e.g. trial division with most common primes).

Anybody with a better GPU than the Geforce 570 TI I own, please test this! My stats (base version is still hp4):

2013-07-24 21:53:38 primemeter     24303 prime/h    490729 test/h        47 5-chains/h

prime/h  and test/h seem to fluctuate enormously and seem to be rather meaningless. As most tests are on the GPU, I have no idea if this is even measuring the tests right. 5-chains is accurate though.

You have to use setgenerate true 1, i.e. one CPU thread for mining.  
running current git (b0c7062f3925482935f5eb352b17737d21b95c5b) and i cant see any usage of my GPU, no heat nor used mem increases when using the QT. anything special to activate so it mines with the GPU? i got a powerfull GPU to test with Wink

EDIT:
Code:
2013-07-25 13:35:43 primemeter         0 prime/h   34261932 test/h         0 5-chains/h
seems the miner thread which should launch the CUDA is borked?

EDIT2:
Code:
Have 2400 candidates after main loop
Cuda start!
Cuda error: cudaMemcpy: cudaMemcpyDeviceToHost, the launch timed out and was terminated
from debug.log
after the message it segfaults, going to debug with gdb Wink
Pages:
Jump to: