If I recall correctly, there was a GPU miner for primecoin that somebody claimed to be developing, and took donations for it, but never delivered.
As far as I know there still is no GPU miner for primecoin, mostly because the CPU is still needed to do certain things efficiently, but the communication time between the CPU and GPU killed any performance gains from the GPU running any part of the code. Or it was something to that effect, AFAIK.
If it would be possible to make a GPU miner I think that would be better long-term for this coin. CPU coins get dominated by botnets, drowning out potential profits from people running it legitimately, since botnet operators don't care about electricity costs. That kind of kills a lot of potential enthusiasm for it. There's only so many people in the world willing to lose money calculating prime numbers for fun.
Yeah - that was an infamous one.
Barring advances in the algorithmic techniques for RIC, the basics of RIC mining in the context of the GPU look like this.
Given a target number T determined by the diff and the blockchain hash:
(a) Round T up to T' where T' is a multiple of the primorial being used. (easy - do once on CPU for every block).
(b) Compute T%p for every prime p being used to sieve
Less easy: Requires bigint math in various forms.
Potentially optimizable: Because the p's are all known in advance and never change, you can optimize this by computing on the CPU once per difficulty 1/p, and then use this to compute T%p using multiplication.
Requires: Decently fast bigint multiplication on GPU
http://www.hpcs.cs.tsukuba.ac.jp/~nakayama/cump/index.php?CUMP%20Performance%20Evaluation
Note the 1000 decimal digit number results: Slower than a dual proc opteron. (!)
But perhaps there's some speedup hiding in there if you do many of them in parallel.
(c) Sieve - write zeros to the sieve at the locations indicated by T%p.
Easy: Optimizing this is standard GPGPU programming. It's not trivial, but there are a lot of people who could do it.
(d) Test candidates:
Potentially painful: Requires modular exponentiation on the GPU (2^(n-1) % n).
Algorithmic competitiveness with GMP probably requires using montgomery reduction.
This is quite a bit of work. There's nothing in this list that is impossible, but it's a substantial engineering challenge to make it worthwhile. Otherwise, even though the GPU has more horsepower, the algorithmic and engineering advantages of GMP will dominate.
There are some fun possibilities for doing this - i.e., because a lot of the pain in the bigint is handling variable-length things, you could just synthesize the kernel when a new block arrives (or a new diff arrives). But it's the kind of stuff that an expert at could go get a lot more money from security applications than hacking for a cryptocurrency. :-)
Nvidia GTX 780 Ti Classified + Intel i7-4770k(initialization) Single static array of primes and modular multiplicative inverses stored on the GPU (16,777,216 x 2 items interleaved)
(a) Done on the CPU 173# (check)
(b) Only the first offsets for (p) are calculated on the CPU (p+4, p+6, p+10, p+12, and p+16 are calculated on the GPU with 64-bit integers only)
(c) Done on the GPU using shared memory, slightly faster than the CPU implementation. (check)
(d) Done on the GPU. Modular multiplication code (Montgomery Reduction - CIOS) is generated using my Python script for various fixed precision - dynamically loaded at run-time during difficulty changes. See sample output below for 320-bit precision.
Modular exponentiation uses square-and-double because I am using a base of 2, otherwise I would have been forced to use square-and-multiply which is much slower.
Benchmarking the modular multiplication by itself is 12x faster than using GMP with 8 threads on the i7-4770k. However, because of some compiler issues resulting in low occupancy,
the entire application is currently only 3x faster on the GPU, but the target goal is 6x. I will have some time to resolve this issue in May.
__device__ void
nvidia_gfn_multiply(nvidia_word_t *rop, const int rop_interleaved,
const nvidia_word_t *op1, const int op1_interleaved,
const nvidia_word_t *op2, const int op2_interleaved,
nvidia_word_t *nvidia_gfn_n)
{
nvidia_gfn_t r;
nvidia_word_t q;
nvidia_word_t c0=0, c1;
nvidia_word_t tasm=0;
r[0]=0;
r[1]=0;
r[2]=0;
r[3]=0;
r[4]=0;
r[5]=0;
r[6]=0;
r[7]=0;
r[8]=0;
r[9]=0;
tasm=0;
asm( "add.cc.u32 %0, %0, %1;" : "+r"(r[0]) : "r"(tasm));
asm( "madc.hi.u32 %0, %1, %2, 0;" : "=r"(tasm) : "r"(op1[0*op1_interleaved]), "r"(op2[0*op2_interleaved]));
asm( "mad.lo.cc.u32 %0, %1, %2, %0;" : "+r"(r[0]) : "r"(op1[0*op1_interleaved]), "r"(op2[0*op2_interleaved]));
asm( "addc.u32 %0, %0, 0;" : "+r"(tasm));
asm( "add.cc.u32 %0, %0, %1;" : "+r"(r[1]) : "r"(tasm));
asm( "madc.hi.u32 %0, %1, %2, 0;" : "=r"(tasm) : "r"(op1[1*op1_interleaved]), "r"(op2[0*op2_interleaved]));
asm( "mad.lo.cc.u32 %0, %1, %2, %0;" : "+r"(r[1]) : "r"(op1[1*op1_interleaved]), "r"(op2[0*op2_interleaved]));
asm( "addc.u32 %0, %0, 0;" : "+r"(tasm));
asm( "add.cc.u32 %0, %0, %1;" : "+r"(r[2]) : "r"(tasm));
asm( "madc.hi.u32 %0, %1, %2, 0;" : "=r"(tasm) : "r"(op1[2*op1_interleaved]), "r"(op2[0*op2_interleaved]));
asm( "mad.lo.cc.u32 %0, %1, %2, %0;" : "+r"(r[2]) : "r"(op1[2*op1_interleaved]), "r"(op2[0*op2_interleaved]));
asm( "addc.u32 %0, %0, 0;" : "+r"(tasm));
asm( "add.cc.u32 %0, %0, %1;" : "+r"(r[3]) : "r"(tasm));
asm( "madc.hi.u32 %0, %1, %2, 0;" : "=r"(tasm) : "r"(op1[3*op1_interleaved]), "r"(op2[0*op2_interleaved]));
asm( "mad.lo.cc.u32 %0, %1, %2, %0;" : "+r"(r[3]) : "r"(op1[3*op1_interleaved]), "r"(op2[0*op2_interleaved]));
asm( "addc.u32 %0, %0, 0;" : "+r"(tasm));
asm( "add.cc.u32 %0, %0, %1;" : "+r"(r[4]) : "r"(tasm));
asm( "madc.hi.u32 %0, %1, %2, 0;" : "=r"(tasm) : "r"(op1[4*op1_interleaved]), "r"(op2[0*op2_interleaved]));
asm( "mad.lo.cc.u32 %0, %1, %2, %0;" : "+r"(r[4]) : "r"(op1[4*op1_interleaved]), "r"(op2[0*op2_interleaved]));
asm( "addc.u32 %0, %0, 0;" : "+r"(tasm));
asm( "add.cc.u32 %0, %0, %1;" : "+r"(r[5]) : "r"(tasm));
asm( "madc.hi.u32 %0, %1, %2, 0;" : "=r"(tasm) : "r"(op1[5*op1_interleaved]), "r"(op2[0*op2_interleaved]));
asm( "mad.lo.cc.u32 %0, %1, %2, %0;" : "+r"(r[5]) : "r"(op1[5*op1_interleaved]), "r"(op2[0*op2_interleaved]));
asm( "addc.u32 %0, %0, 0;" : "+r"(tasm));
asm( "add.cc.u32 %0, %0, %1;" : "+r"(r[6]) : "r"(tasm));
asm( "madc.hi.u32 %0, %1, %2, 0;" : "=r"(tasm) : "r"(op1[6*op1_interleaved]), "r"(op2[0*op2_interleaved]));
asm( "mad.lo.cc.u32 %0, %1, %2, %0;" : "+r"(r[6]) : "r"(op1[6*op1_interleaved]), "r"(op2[0*op2_interleaved]));
asm( "addc.u32 %0, %0, 0;" : "+r"(tasm));
asm( "add.cc.u32 %0, %0, %1;" : "+r"(r[7]) : "r"(tasm));
asm( "madc.hi.u32 %0, %1, %2, 0;" : "=r"(tasm) : "r"(op1[7*op1_interleaved]), "r"(op2[0*op2_interleaved]));
asm( "mad.lo.cc.u32 %0, %1, %2, %0;" : "+r"(r[7]) : "r"(op1[7*op1_interleaved]), "r"(op2[0*op2_interleaved]));
asm( "addc.u32 %0, %0, 0;" : "+r"(tasm));
asm( "add.cc.u32 %0, %0, %1;" : "+r"(r[8]) : "r"(tasm));
asm( "madc.hi.u32 %0, %1, %2, 0;" : "=r"(tasm) : "r"(op1[8*op1_interleaved]), "r"(op2[0*op2_interleaved]));
asm( "mad.lo.cc.u32 %0, %1, %2, %0;" : "+r"(r[8]) : "r"(op1[8*op1_interleaved]), "r"(op2[0*op2_interleaved]));
asm( "addc.u32 %0, %0, 0;" : "+r"(tasm));
asm( "add.cc.u32 %0, %0, %1;" : "+r"(r[9]) : "r"(tasm));
asm( "madc.hi.u32 %0, %1, %2, 0;" : "=r"(tasm) : "r"(op1[9*op1_interleaved]), "r"(op2[0*op2_interleaved]));
asm( "mad.lo.cc.u32 %0, %1, %2, %0;" : "+r"(r[9]) : "r"(op1[9*op1_interleaved]), "r"(op2[0*op2_interleaved]));
asm( "addc.u32 %0, %0, 0;" : "+r"(tasm));
asm( "add.cc.u32 %0, %0, %1;" : "+r"(c0) : "r"(tasm) );
asm( "addc.u32 %0, 0, 0;" : "=r"(c1));
...
nvidia_word_t overflow;
asm ( "sub.cc.u32 %0, %0, %1;" : "+r"(r[0]) : "r"(nvidia_gfn_n[0]) );
asm ( "subc.cc.u32 %0, %0, %1;" : "+r"(r[1]) : "r"(nvidia_gfn_n[1]) );
asm ( "subc.cc.u32 %0, %0, %1;" : "+r"(r[2]) : "r"(nvidia_gfn_n[2]) );
asm ( "subc.cc.u32 %0, %0, %1;" : "+r"(r[3]) : "r"(nvidia_gfn_n[3]) );
asm ( "subc.cc.u32 %0, %0, %1;" : "+r"(r[4]) : "r"(nvidia_gfn_n[4]) );
asm ( "subc.cc.u32 %0, %0, %1;" : "+r"(r[5]) : "r"(nvidia_gfn_n[5]) );
asm ( "subc.cc.u32 %0, %0, %1;" : "+r"(r[6]) : "r"(nvidia_gfn_n[6]) );
asm ( "subc.cc.u32 %0, %0, %1;" : "+r"(r[7]) : "r"(nvidia_gfn_n[7]) );
asm ( "subc.cc.u32 %0, %0, %1;" : "+r"(r[8]) : "r"(nvidia_gfn_n[8]) );
asm ( "subc.cc.u32 %0, %0, %1;" : "+r"(r[9]) : "r"(nvidia_gfn_n[9]) );
asm ( "subc.u32 %0, %1, 0;" : "=r"(overflow) : "r"(c0) );
if (overflow!=0)
{
asm ( "add.cc.u32 %0, %0, %1;" : "+r"(r[0]) : "r"(nvidia_gfn_n[0]) );
asm ( "addc.cc.u32 %0, %0, %1;" : "+r"(r[1]) : "r"(nvidia_gfn_n[1]) );
asm ( "addc.cc.u32 %0, %0, %1;" : "+r"(r[2]) : "r"(nvidia_gfn_n[2]) );
asm ( "addc.cc.u32 %0, %0, %1;" : "+r"(r[3]) : "r"(nvidia_gfn_n[3]) );
asm ( "addc.cc.u32 %0, %0, %1;" : "+r"(r[4]) : "r"(nvidia_gfn_n[4]) );
asm ( "addc.cc.u32 %0, %0, %1;" : "+r"(r[5]) : "r"(nvidia_gfn_n[5]) );
asm ( "addc.cc.u32 %0, %0, %1;" : "+r"(r[6]) : "r"(nvidia_gfn_n[6]) );
asm ( "addc.cc.u32 %0, %0, %1;" : "+r"(r[7]) : "r"(nvidia_gfn_n[7]) );
asm ( "addc.cc.u32 %0, %0, %1;" : "+r"(r[8]) : "r"(nvidia_gfn_n[8]) );
asm ( "addc.u32 %0, %0, %1;" : "+r"(r[9]) : "r"(nvidia_gfn_n[9]));
}
rop[0*rop_interleaved]=r[0];
rop[1*rop_interleaved]=r[1];
rop[2*rop_interleaved]=r[2];
rop[3*rop_interleaved]=r[3];
rop[4*rop_interleaved]=r[4];
rop[5*rop_interleaved]=r[5];
rop[6*rop_interleaved]=r[6];
rop[7*rop_interleaved]=r[7];
rop[8*rop_interleaved]=r[8];
rop[9*rop_interleaved]=r[9];
}