Pages:
Author

Topic: [XPM] So, why is porting primecoin mining to the GPU so difficult? - page 3. (Read 18723 times)

sr. member
Activity: 490
Merit: 250
Could you explain what primecoin algorithm actually does?

Why doesn't primecoin for example just read some known prime numbers from some database, compare them if there is any Cunningham chain candidate and ... offer result?

Does this algorithm have to find all these prime numbers by brute force first, and then find the chains..?

If you could describe this in layman's terms, I'd be very happy.

+1
legendary
Activity: 1901
Merit: 1024
You sum up the problem quite nicely. But I have to disagree in one point. The fermat tests are not the bottleneck. They were at the beginning, but using highly optimized mpn routines we can test a few thousand chains (not single primes) per second by now. If you sieve for chains of length 9 with only 50000 prime factors you will still spend a majority of time in the sieve. Of course we are speaking of an optimized sieve routine, not the original one. Sieving time will get even higher when the difficulty rises to 10.

I personally would recommend doing a hybrid by porting only the sieve to the GPU and doing the primality tests on the CPU. The GPU has faster memory access and segmented sieves will likely work well. The CPU is good at processing data with varying length and many conditional branches like it is the case for the primality tests. But then again I am not sure if the overall speed gain will actually be worth the increased electricity bill.
mtrlt already accomplished sieving on the GPU and fermat on the CPU and it came out to the same speed as the CPU alone.  Data flow between GPU/CPU is also a bottleneck.

If this is true then only CPU + integreted GPU might get "some" benefit... data flow should be a lot better
sr. member
Activity: 282
Merit: 250
I would offer 5 BTC to the first to provide me with a working GPU that the public does not know about.  This will need to be atleast 10x faster than CPU.
mtrlt have already been donated over 75BTC and he's not been able to accomplish more that ~2x speed increase at last report.  Good luck with that Tongue.
alex p, better donate that btc into the development process.

let me tell you something, people like you are bad for the ecosystem, people who only want to have the best thing delivered for them first and paid it afterward, but dont want to contribute during the development process.

i have a quote for you,

Code:
"if you are absent during my struggle, dont expect my presence during my success"
sr. member
Activity: 434
Merit: 250
Could you explain what primecoin algorithm actually does?

Why doesn't primecoin for example just read some known prime numbers from some database, compare them if there is any Cunningham chain candidate and ... offer result?

Does this algorithm have to find all these prime numbers by brute force first, and then find the chains..?

If you could describe this in layman's terms, I'd be very happy.
sr. member
Activity: 490
Merit: 250
i am wondering if in primecoin, it find just a probablity of primility, so one can start from 1 and go to next, next,next and test each one. am i wrong ?
hero member
Activity: 532
Merit: 500
You sum up the problem quite nicely. But I have to disagree in one point. The fermat tests are not the bottleneck. They were at the beginning, but using highly optimized mpn routines we can test a few thousand chains (not single primes) per second by now. If you sieve for chains of length 9 with only 50000 prime factors you will still spend a majority of time in the sieve. Of course we are speaking of an optimized sieve routine, not the original one. Sieving time will get even higher when the difficulty rises to 10.

I personally would recommend doing a hybrid by porting only the sieve to the GPU and doing the primality tests on the CPU. The GPU has faster memory access and segmented sieves will likely work well. The CPU is good at processing data with varying length and many conditional branches like it is the case for the primality tests. But then again I am not sure if the overall speed gain will actually be worth the increased electricity bill.
mtrlt already accomplished sieving on the GPU and fermat on the CPU and it came out to the same speed as the CPU alone.  Data flow between GPU/CPU is also a bottleneck.
hero member
Activity: 532
Merit: 500
I would offer 5 BTC to the first to provide me with a working GPU that the public does not know about.  This will need to be atleast 10x faster than CPU.
mtrlt have already been donated over 75BTC and he's not been able to accomplish more that ~2x speed increase at last report.  Good luck with that Tongue.
member
Activity: 97
Merit: 10
I would offer 5 BTC to the first to provide me with a working GPU that the public does not know about.  This will need to be atleast 10x faster than CPU.
member
Activity: 75
Merit: 10
...
I was under the impression that the sieving part of the algorithm was taking up much more than 1% of the computing time. For example, currently my "Sieve/Test ratio" is at 75%/25%... or am I confusing things?
Redacted that statement, but still, even if you made that 25% sieving 1000 times faster, you would still need those other 75% of the time and wouldn't mine much faster.
It's 75% sieving, not 25%. If you make the sieve 10000 times faster you can eliminate 99.9999% of candidates that need testing by filtering more prime divisors. The purpose of the sieve is to avoid having to do the slow tests, so improving the sieve does actually make more sense.

Ah, sorry, need some sleep... But you'd still have dimishing returns for sieving, as filtering by 2x as many primes won't filter out 2x more candidates.
newbie
Activity: 51
Merit: 0
This is a great thread to show the complexity of the problem...

Though, out of curiosity, why are people wanting a GPU miner so badly for Primecoin anyway? I kinda enjoy a CPU only coin, since it lets me mine different coins with my graphics card and my processor at the same time...
People mainly just wanted to chase after the large short term profits a GPU may provide (assuming the increase in performance was sufficiently large).  There was also the argument that a GPU miner would open up the coin to more people are drive up the price, but I can't wrap my head around that logic when everybody already has CPU.  All in all I don't see any advantage to releasing a GPU miner at this point, and I think that it will just lead to a higher difficulty, a lower market price, or both.
newbie
Activity: 39
Merit: 0
...
I was under the impression that the sieving part of the algorithm was taking up much more than 1% of the computing time. For example, currently my "Sieve/Test ratio" is at 75%/25%... or am I confusing things?
Redacted that statement, but still, even if you made that 25% sieving 1000 times faster, you would still need those other 75% of the time and wouldn't mine much faster.
It's 75% sieving, not 25%. If you make the sieve 10000 times faster you can eliminate 99.9999% of candidates that need testing by filtering more prime divisors. The purpose of the sieve is to avoid having to do the slow tests, so improving the sieve does actually make more sense.
member
Activity: 75
Merit: 10

 The hot spot in a primecoin miner is testing numbers for primality by using Fermat tests (with base 2). Porting the sieve to the GPU is almost pointless, because only 1% of the computing time is spend there.




I was under the impression that the sieving part of the algorithm was taking up much more than 1% of the computing time. For example, currently my "Sieve/Test ratio" is at 75%/25%... or am I confusing things?



Redacted that statement, but still, even if you made that 75% sieving 1000 times faster, you would still need those other 25% of the time and wouldn't mine much faster.


full member
Activity: 126
Merit: 100
This is a great thread to show the complexity of the problem...

Though, out of curiosity, why are people wanting a GPU miner so badly for Primecoin anyway? I kinda enjoy a CPU only coin, since it lets me mine different coins with my graphics card and my processor at the same time...
legendary
Activity: 1901
Merit: 1024
Maybe some low end GPU even integrated one can help in sieva part, like haswell HD4600 opencl or AMD integrated?
newbie
Activity: 12
Merit: 0

 The hot spot in a primecoin miner is testing numbers for primality by using Fermat tests (with base 2). Porting the sieve to the GPU is almost pointless, because only 1% of the computing time is spend there.




I was under the impression that the sieving part of the algorithm was taking up much more than 1% of the computing time. For example, currently my "Sieve/Test ratio" is at 75%/25%... or am I confusing things?

newbie
Activity: 39
Merit: 0
You sum up the problem quite nicely. But I have to disagree in one point. The fermat tests are not the bottleneck. They were at the beginning, but using highly optimized mpn routines we can test a few thousand chains (not single primes) per second by now. If you sieve for chains of length 9 with only 50000 prime factors you will still spend a majority of time in the sieve. Of course we are speaking of an optimized sieve routine, not the original one. Sieving time will get even higher when the difficulty rises to 10.

I personally would recommend doing a hybrid by porting only the sieve to the GPU and doing the primality tests on the CPU. The GPU has faster memory access and segmented sieves will likely work well. The CPU is good at processing data with varying length and many conditional branches like it is the case for the primality tests. But then again I am not sure if the overall speed gain will actually be worth the increased electricity bill.
member
Activity: 75
Merit: 10
Tl;dr: Some explanations as to why there currently is still no GPU miner for primecoins


Some of you may know me from the other GPU miner thread, where I tried to accelerate the primecoin miner with CUDA.

While I have no idea how mtrlt is doing on his port, I can only image that he encounters similar problems. A primecoin miner has two stages, filtering candidates and testing numbers for primality by using Fermat tests (with base 2). Latest default setting are: On testnet you spend 30% sieving and 70% testing candidates, on mainnet 70% sieving and 30% testing. On mainnet you will filter out 97% of all candidates you started with by sieving.

In order to produce a meaningful speedup, you will have to port both stages to the GPU or make one of them very small. Lets say you keep the original settings, port the sieve to GPU, interleave Fermat tests on the CPU. In this scenario your speedup is at most ~3x, because you still have to do that 30% candidate testing. Ok, you can sieve much faster now, so why not sieve with 10x more primes? You already filter 97%, and each new prime will not kill more than 1/p more candidates. The low-hanging fruits are the smaller primes and you have already used them. So, dimishing returns for the extra computations if these primes get bigger. At some point, even if you sieve 100x faster it won't be worth the extra computations and just checking if that number is prime with a Fermat test makes more sense. (>90% of all Fermat tests come back positive after sieving, that is, the start of that possible chain is definitely composite. Only for the remaining <10% you check if you hit a chain of some type.). Bottom line: to produce meaningful speedups with gpu mining, you will have to port both the sieve and Fermat tests to the GPU. Another argument for this is that transfer between CPU and GPU is costly and can quickly become a bottleneck.

Ok, why is porting Fermats test to the GPU so hard?  

That Fermat test is actually very simple in principle, it's a one liner in python:

2==pow(2,n,n)

In python, it can be also easily used with bigger numbers:

>>> n = 8229677002348755372716479379024478187948307769969295540186468240943252147753090 395716866121321015192338179630120010069556821511582091
>>> 2 == pow(2,n,n)
True
>>>

But a lot of magic is running under the hood, which makes this a quite complex problem for GPUs: the numbers involved are bigger than what fits into standard 32/64bit values and 2^n can't be evaluated directly. For the first problem, you use e.g. seven 64bit integers that represent the above number n and emulate multiplication and addition on them with 64-instructions.

A fast algorithm is known for the second problem, called Modular exponentiation (http://en.wikipedia.org/wiki/Modular_exponentiation). The algorithm is simple, but there are many more complex tricks to speed up the computation. For example, with a Montgomery reduction (which, at least on a CPU, is faster than a Barrett reduction but works only on odd numbers) no divisions are required, which is great, since multiplications are much faster than divisions. Memory can be traded for faster execution time by precomputing some values, which however won't make much sense on a GPU.

So doing many Fermat tests means evaluating many independent modular exponentiations. At first, doing many independent modular exponentiation seems to be a great fit for a GPU, since each calculation could be done independently in its own thread. The root of the problem is that Modular exponentiation is a recursive evaluation (= can't be easily parallelized) and that the execution flow is strongly dependent on the input data. That is really a tough problem, since GPU's are "Single instruction, multiple data" devices. A CUDA GPU will execute many threads in warps, which on newer CUDA GPU's are in the order of 48 threads. If these warps divergate, which they will do if each threads execution flow is strongly data dependent, then these threads will be serialized. Whoops, that's a 48x penalty.

Second problem is that at least most NVIDIA GPU's synthesise 64-bit integer instructions from 32-bit instructions. 64-bit addition and subtraction can be synthesized efficiently from 32-bit operations (2 or 3 instructions depending on compute capability). Multiplication requires longer instruction sequences (on the order of 10 to 20 instructions depending on architecture), so that is another 10x penalty compared to a 64-bit CPU which can execute these instructions natively.

So that's easily a 480x combined penalty, even if a GPU is 100x faster calculating simple instructions than a CPU, it will be ~5 times slower with the above idea than a CPU. There might be a clever idea to do something radically different than what was outlined above, but at this point I'm very sceptical that we will see a GPU miner soon which would be practical given the higher electricity use of a GPU compared to a CPU.
Pages:
Jump to: