Yes, I noticed the the "time to block" decreased to a third when using -m 5M. In my miner (for intel hardware though) I use all primes up to 2^32, so I was used to some larger numbers :-) Note that the time to block displayed my be serverly off. You need to apply a correction depending primorial you use, which will increase the likelyhood of finding a sextuplet, but OTOH you do miss some sextuplets if your primorial is bigger than 210 (and of course it is).
The first choice I made was to use 32bit everywhere (64bit arithmetic is expensive). This means each thread only searches at most (nonce -> nonce+2^32)
sieving with larger primes has low chance of pruning the candidate. It just becomes cheaper to stop sieving primes (diminishing returns) and start fermat testing.
I don't use the pnXX+97 step, but test all 5005 candidates per pn19, so I don't believe I should miss any, so I am not sure why a correction factor would be required.
Mertens 3rd theorem fits very well on this kind of sieves, it does give exactly the probability of not sieving a number if you sieve with all the primes up to n. You surely noticed the constant factor between p0/p1/p2... This factor can be calculated as the quotient from the 3rd mertens function with n as the sieve size and the prime density at the given difficulty, which is about 1/ln(2^(diff+256+8+1)). This factor raised to the sixt power (for p6) decreases significantly with the size of the primes used, so it might be worth using primes as big a possible, if your sieve is fast enough (but on an arm 2^32 is probably to big).
Yes I did, ~34-35 is the observed factor (@ current difficulty and sieving with the first 550k primes) and I use it outside of the miner to estimate time to block. In the miner I still use the original gatra code for the time estimate, it is ~10% of what it should be which isn't a big concern (ie low priority).
I think the sieve code is pretty fast, but it is only fast because I stick to the ARM's word size. A sieve of a few hundred thousand primes will be a natural limit for a 32bit implementation, until the difficulty increases (when fermat testing gets significantly more expensive) I don't see any benefit in removing the 32bit design limit.
The problem I found with the sieve after I eliminated all the modulo operations was that it is limited by memory bandwidth. To make more than one core do useful work on it, I needed to divide it into small chunks which fit into the caches.
Yes, memory bandwidth is a big issue in ARM as well. Not only does it use slower memory, hardware pre-fetch is not as advanced as x86, reading & writing to the same cache line has a penalty as does misalignment. My first few miners were all running at the same general speed no matter what I did, once I started minimizing memory access I was able to get some good gains.
Regards,
--
bsunau7