Pages:
Author

Topic: [XPM] So, why is porting primecoin mining to the GPU so difficult? - page 2. (Read 18739 times)

hero member
Activity: 724
Merit: 500
Base 2 Fermat test is strong enough. It's known that pseudoprimes are extremely rare compared to primes. The term probable prime is misleading to layman because they are actually almost certain primes.

In fact I would be very surprised if you can find a pseudoprime in the block chain.
Pseudoprimes are not that rare! Here is an example:

341, 561, 645, 1105, 1387, 1729, 1905, 2047, 2465, 2701, 2821, 3277, 4033, 4369, 4371, 4681, 5461, 6601, 7957, 8321, 8481, 8911
hero member
Activity: 532
Merit: 500
There is already a fast implementation of the Lucas–Lehmer primality test on AMD cards, I'd guess one for the Fermat primality test would be a matter of time.

http://mersenneforum.org/showthread.php?t=12576&page=176

Mfakto implements sieving as well: http://www.mersennewiki.org/index.php/Mfakto
The Lucas-Lehmer test currently available on GPU is very simple compared to the fermat testing needed for Primecoin.  A mersenne number is stated as 2p-1.  In order for a mersenne number to be prime, P must also be prime.  As stated on http://Mersenne.org/various/math.php, The Lucas-Lehmer primality test is remarkably simple. It states that for P > 2, 2P-1 is prime if and only if Sp-2 is zero in this sequence: S0 = 4, SN = (SN-12 - 2) mod (2P-1).  Trust me when I say the explanation is a lot harder than the actual math involved.  To prove 27-1 is prime take 6 steps, which is 1 less that the exponent of 7.  The current record Mersenne Prime, 257,885,161-1 took 57,885,160 calculations to prove as prime, requiring 15 days on my overclocked i5-2500k (4.3GHz) compared to ~4 days on an nVidia GTX 480 or ~3.5 days on a GTX 580.

Trial factoring is even easier, as all factors of Mersenne numbers can be expressed as 2 * k * P - 1.  Since 2 and P are givens, you simply sieve out prime k's from whatever bit depth you are working at and take 2kp-1 mod (2^P-1).  The fast fournier transforms available make GPUs operates at speeds in excess of 100 times a CPU.

The more complicate P-1 test, which has 'somewhat' successfully been ported to GPU can barely get 2-3x the speed of a good i5/i7.
legendary
Activity: 1205
Merit: 1010
Base 2 Fermat test is strong enough. It's known that pseudoprimes are extremely rare compared to primes. The term probable prime is misleading to layman because they are actually almost certain primes.

In fact I would be very surprised if you can find a pseudoprime in the block chain.
sr. member
Activity: 274
Merit: 250
So, to get my numbers straight (bottom line still is that porting Fermat's test is though):

A primecoin miner has two stages, filtering candidates and testing numbers for primality by using Fermat tests (with base 2).

It's not sufficient to do the test with base 2 only. You get just about 99% prim probability with that.

I think that's what this coin does? I mean some chains accepted in this coin are composite?
hero member
Activity: 724
Merit: 500
So, to get my numbers straight (bottom line still is that porting Fermat's test is though):

A primecoin miner has two stages, filtering candidates and testing numbers for primality by using Fermat tests (with base 2).

It's not sufficient to do the test with base 2 only. You get just about 99% prim probability with that.
legendary
Activity: 1484
Merit: 1005
There is already a fast implementation of the Lucas–Lehmer primality test on AMD cards, I'd guess one for the Fermat primality test would be a matter of time.

http://mersenneforum.org/showthread.php?t=12576&page=176

Mfakto implements sieving as well: http://www.mersennewiki.org/index.php/Mfakto
hero member
Activity: 566
Merit: 500
Why couldn't each core / thread on the GPU run the whole algorithm from start to finish? yeah, each one would be slow, but as a whole, everything wins. It gets around the threading issue that I know plagues most of algorithm being ported. Then again, I could be totally out in left field.
member
Activity: 75
Merit: 10
Bottom line: to produce meaningful speedups with gpu mining, you will have to port both the sieve and Fermat tests to the GPU. Another argument for this is that transfer between CPU and GPU is costly and can quickly become a bottleneck.
I imagine that it can work like this:
Run the sieve on the GPU until you have lots of candidates.
Then run a single fermat test on the GPU for all of these candidates.
All candidates passing the first test will be tested for chain length 2.
And so on.
I'm not sure how much memory we will need on the GPU if we store all the candidates there.

Problem being, that Fermat tests are not a good fit for current GPUs, because these tests have a strongly data-dependent execution flow. Memory isn't a problem, even for many many candidates.
sr. member
Activity: 274
Merit: 250

The algorithm is kind of complicated to explain in lay terms.  First, I'd like to point out that there is no database of primes of this size that would be even kind of useful.  The numbers used are on the order of 10100.  In that range about 1 in 180 numbers is prime.  That means that in order to store all of the prime numbers between 10100 and 10101 you would more memory than there could ever exist in the universe--more bits than there are atoms.  No such database can ever exist within the current understanding of physics.

So, the algorithm is somewhat of a brute force.  It has some elegance to it, though.  The chains have a very specific form, which gives a lot of opportunity for speedups.  For example, if your chain starts at N then 2N-1 is in the chain (for one of the chain types).  It would be inefficient to check the primality of 2N-1 if N is already shown to be composite--the chain is already proven to be invalid.  Many optimizations take this general form.

The algorithm itself is broken into two main parts: sieving and primality tests.  Sieving is a really fast way to show that lots of numbers are composite, but it is impractical to show that any number is prime by a sieve alone.  The most famous (and oldest) sieve is the Sieve of Eratosthenes, which has a lovely Gif and explanation on its wikipedia page.  Sieves work by eliminating numbers that have small factors;  it is easy and computationally cheap to carry out a single division to determine if a big number, N, is divisible by a small number, k, and about 1 in k divisions will show that N is divisible by k and that N is therefore composite.  This can be taken a step farther by noting that if N is divisible by k then N + a*k is also divisible by k, allowing many candidates to be eliminated by a single division and a few multiplications.  Thus, the mining algorithm first generates a set of candidate numbers based on a hash function and the data in the block, then runs that set of numbers through a sieve.

The next step is to actually test the remaining numbers after the sieve takes out most of them.  This is accomplished by the Fermat (Probable) Primality Test, which holds that if aP-1 ≡ 1 (mod P) for all a and prime P.  That is to say, you can take any number (a) and raise it to the power of your candidate (P) minus one, then divide the result by the candidate; if you get anything other than 1 as the remainder of that division then P is composite, otherwise P is probably prime.  For PrimeCoin a is always taken to be 2, and "probable" is taken to be good enough--the odds that 2 is a "Fermat Liar" for P are much less than the odds that P is prime.

Thus, the general flow is (Protocol to build a block and find the appropriate hash value to start from) -> (form a set of numbers that could be valid as part of prime chains for this block) -> (run the numbers through a sieve) -> (Eliminate numbers that passed the sieve but aren't parts of chains of length 9 or longer (current integer difficulty)) -> (for each chain of possibly prime numbers remaining, execute a Fermat Primality test until you either find a complete chain or find a composite number that proves that no chain long enough will be found there).

You explained very well. Thank you.
member
Activity: 75
Merit: 10
Could you explain what primecoin algorithm actually does?

Why doesn't primecoin for example just read some known prime numbers from some database, compare them if there is any Cunningham chain candidate and ... offer result?

Does this algorithm have to find all these prime numbers by brute force first, and then find the chains..?

If you could describe this in layman's terms, I'd be very happy.

The algorithm is kind of complicated to explain in lay terms.  First, I'd like to point out that there is no database of primes of this size that would be even kind of useful.  The numbers used are on the order of 10100.  In that range about 1 in 180 numbers is prime.  That means that in order to store all of the prime numbers between 10100 and 10101 you would more memory than there could ever exist in the universe--more bits than there are atoms.  No such database can ever exist within the current understanding of physics.

So, the algorithm is somewhat of a brute force.  It has some elegance to it, though.  The chains have a very specific form, which gives a lot of opportunity for speedups.  For example, if your chain starts at N then 2N-1 is in the chain (for one of the chain types).  It would be inefficient to check the primality of 2N-1 if N is already shown to be composite--the chain is already proven to be invalid.  Many optimizations take this general form.

The algorithm itself is broken into two main parts: sieving and primality tests.  Sieving is a really fast way to show that lots of numbers are composite, but it is impractical to show that any number is prime by a sieve alone.  The most famous (and oldest) sieve is the Sieve of Eratosthenes, which has a lovely Gif and explanation on its wikipedia page.  Sieves work by eliminating numbers that have small factors;  it is easy and computationally cheap to carry out a single division to determine if a big number, N, is divisible by a small number, k, and about 1 in k divisions will show that N is divisible by k and that N is therefore composite.  This can be taken a step farther by noting that if N is divisible by k then N + a*k is also divisible by k, allowing many candidates to be eliminated by a single division and a few multiplications.  Thus, the mining algorithm first generates a set of candidate numbers based on a hash function and the data in the block, then runs that set of numbers through a sieve.

The next step is to actually test the remaining numbers after the sieve takes out most of them.  This is accomplished by the Fermat (Probable) Primality Test, which holds that if aP-1 ≡ 1 (mod P) for all a and prime P.  That is to say, you can take any number (a) and raise it to the power of your candidate (P) minus one, then divide the result by the candidate; if you get anything other than 1 as the remainder of that division then P is composite, otherwise P is probably prime.  For PrimeCoin a is always taken to be 2, and "probable" is taken to be good enough--the odds that 2 is a "Fermat Liar" for P are much less than the odds that P is prime.

Thus, the general flow is (Protocol to build a block and find the appropriate hash value to start from) -> (form a set of numbers that could be valid as part of prime chains for this block) -> (run the numbers through a sieve) -> (Eliminate numbers that passed the sieve but aren't parts of chains of length 9 or longer (current integer difficulty)) -> (for each chain of possibly prime numbers remaining, execute a Fermat Primality test until you either find a complete chain or find a composite number that proves that no chain long enough will be found there).
hero member
Activity: 675
Merit: 514
Bottom line: to produce meaningful speedups with gpu mining, you will have to port both the sieve and Fermat tests to the GPU. Another argument for this is that transfer between CPU and GPU is costly and can quickly become a bottleneck.
I imagine that it can work like this:
Run the sieve on the GPU until you have lots of candidates.
Then run a single fermat test on the GPU for all of these candidates.
All candidates passing the first test will be tested for chain length 2.
And so on.
I'm not sure how much memory we will need on the GPU if we store all the candidates there.
hero member
Activity: 524
Merit: 500
So once we are at it - what about fpgas?

I don't know much about FPGA's, but in theory, if you build a simple 1024-bit processor that can multiply,add,subtract and shift, then you could also do Fermat tests. 1024-bit would be enough for current prime records on the primecoin chain.

RSA also heavily relies on modular expansion and a basic FPGA implementation would start with 1024bit, so if you search for research papers on the topic you find papers like that one: http://irep.iium.edu.my/29540/1/FPGA_Implementation_of_RSA_Encryption.pdf

No idea how fast that would be though.


http://www-mtl.mit.edu/researchgroups/icsystems/pubs/conferences/2001/goodman_isscc01_slides.pdf
Primitives performance points (@50 MHz):
Modexp (IF/DL)
 GFexp (DL)
 ECmult (ECDL)
8.2/5.2 ms @ 512b
 8.0ms @ 512b
 7.0ms @ 176b
32.1/17 ms @ 1024b
 31.7ms @ 1024b
 14.5ms @ 256b
hero member
Activity: 639
Merit: 500
I am happy it is proving GPU resistant  Cheesy

Just like LTC is currently ASIC resistant :-)
Yep. Like it was GPU resistant just one and half year ago.  Roll Eyes
member
Activity: 75
Merit: 10
So once we are at it - what about fpgas?

I don't know much about FPGA's, but in theory, if you build a simple 1024-bit processor that can multiply,add,subtract and shift, then you could also do Fermat tests. 1024-bit would be enough for current prime records on the primecoin chain.

RSA also heavily relies on modular expansion and a basic FPGA implementation would start with 1024bit, so if you search for research papers on the topic you find papers like that one: http://irep.iium.edu.my/29540/1/FPGA_Implementation_of_RSA_Encryption.pdf

No idea how fast that would be though.

sr. member
Activity: 274
Merit: 250
Interesting insight, thank you.

Does Nvidia or AMD have a better chance of boosting efficiency in primecoin mining? Is CUDA more flexible in some ways and would that help?
Sy
legendary
Activity: 1484
Merit: 1003
Bounty Detective
So once we are at it - what about fpgas?
member
Activity: 75
Merit: 10
So, to get my numbers straight (bottom line still is that porting Fermat's test is though):

A primecoin miner has two stages, filtering candidates and testing numbers for primality by using Fermat tests (with base 2). Latest default setting are: On testnet you spend 30% sieving and 70% testing candidates, on mainnet 70% sieving and 30% testing. On mainnet you will filter out 97% of all candidates you started with by sieving.

In order to produce a meaningful speedup, you will have to port both stages to the GPU or make one of them very small. Lets say you keep the original settings, port the sieve to GPU, interleave Fermat tests on the CPU. In this scenario your speedup is at most ~3x, because you still have to do that 30% candidate testing. Ok, you can sieve much faster now, so why not sieve with 10x more primes? You already filter 97%, and each new prime will not kill more than 1/p more candidates. The low-hanging fruits are the smaller primes and you have already used them. So, dimishing returns for the extra computations if these primes get bigger. At some point, even if you sieve 100x faster it won't be worth the extra computations and just checking if that number is prime with a Fermat test makes more sense. (90% of all Fermat tests come back positive, that is, the start of that possible chain is definitely composite. Only for the remaining 10% you check if you hit a chain of some type.). Bottom line: to produce meaningful speedups with gpu mining, you will have to port both the sieve and Fermat tests to the GPU. Another argument for this is that transfer between CPU and GPU is costly and can quickly become a bottleneck.
hero member
Activity: 675
Merit: 507
Freedom to choose
I am happy it is proving GPU resistant  Cheesy

Just like LTC is currently ASIC resistant :-)
sr. member
Activity: 784
Merit: 250
DIA | Data infrastructure for DeFi
 I am happy it is proving GPU resistant  Cheesy
Pages:
Jump to: