Just had a very quick look over the code and I do my sieve in phases which might help speed things up a little more (warning my system does not have a hardware divide so I see the benefits very clearly, x86_64 might not see any speed-ups at all).
Phase 1. primes smaller than 2*primorial (I use 210).
A normal sieve with a fast exit eg.
if(!(psieve[j>>5] & ( 1U << (j & 0x1f)))) break;
Phase 2. The next "few hundred primes"
Add the "remainder to large" test. Doing this test early in the sieve slows the sieve as the test mostly fails which is why I do it later eg.
if(tmp & 0xffffffe0UL) continue;
Only when the remainder has a greater than 50% chance of passing the test does it becomes time efficient to have this test.
Phase 3. The last few hundred thousand primes.
I do this in line with the scanner but the main difference is a bulk check 32 candidates at a time eg.
if(!psieve[j>>5]) { j += 31; offset += 210*31; continue; }
This needs candidate density to be less that ~1 in 64 candidates which is why you need to sieve the "first few hundred" before you get benefit.
Regards and as always check my logic,
Perhaps I'm misunderstanding, but:
(a) Using the 40th primorial (plus or minus depending on which of the miners you're talking about) means that you never sieve factors that fit into a word anyway.
(b) The majority of time in my code is spent doing three things, in order:
- Fermat primality test (gmp)
- Calculating T_rounded_up % p (gmp)
- Sieving large primes that still occur multiple times in the maximum number of nonces (primes under 2^29).
Most of this time is actually spent asking one thing: if (offset < sieve_size)
which mostly fails with a sieve of 8M entries and a prime of, e.g., 100m.
My guess, though I might be wrong, is that a lot of the optimizations you're looking at start to become less dominant when you go for a really huge primorial. For example, almost _no_ time is spent in checking the actual sieve - as far as I can tell, there's basically zero benefit to trying to optimize finding candidates. The code spends somewhere between 1-2 seconds doing primality testing for each iteration through the sieve (8 million bits). The time to check each bit position is a few tens of microseconds of that 1-2 seconds.