Alright, should be up in the pull requests on
https://github.com/primecoin/primecoin and is also available from
https://github.com/Chemisist/primecoin if you're itching to have a look. Compiled binaries will be available at
https://www.dropbox.com/sh/1y7mwwc4asgfqfb/Eo2sKCsZor as I make them and upload them
(This was originally just going to be a message to Sunny, but I’ll post it pretty much as is because I’m feeling lazy and I need to get back to work on my thesis!)
[msg_to_sunny]
Hey Sunny King-
In studying your algorithm, I’ve learned quite a bit about prime numbers and modulus arithmetic, thanks! Unfortunately, I’ve got to focus more on my thesis! I’ll be spending a little less time thinking about how to parallelize this complex algorithm, but I’ve collected my current thoughts on parallelization and I was hoping that you could answer some of these questions. Also, do you know if anyone else is working on a GPU implementation of your algorithm?
Also, while trying to figure out what the heck a multiplicative modulus inverse operation is I came across multiple people demonstrating that the modulus operator and division operators are very slow (10+ clock cycles) on nVidia GPUs (and I assume ATI as well). Is it possible to weave the required sieve without the modulus operator?
Timing your Weave algorithm, it seems that about 10-15% of the time is being spent in the outer loop with the remaining 85-90% of the time being spent on the inner loop. I’m stuck on how to parallelize this inner loop. Ideally, I would be able to launch threads which correspond to all values of nBiTwinSeq less than nChainLength but I now realize that the value of bnFixedInverse is path dependent and cannot be parallelized in the direction that I initially thought.
My current thoughts on how to parallelize the implementation of the Weave algorithm is a two-stage process and decomposes the nested loop into sequential loops. The values calculated in the outer loop, namely bnFixedInverse and bnTwoInverse can be calculated and stored in memory to be used in the inner loop. The rationale for doing this is that the calculation of these values for the 10^6 prime numbers stored in the vPrimes array are all independent of each other and therefore perfect for parallelization. Having calculated and stored all those values, each execution of the nested loop can (theoretically) be performed on individual threads also and since the algorithm is strictly switching bits from 0 to 1, I don’t think that race conditions which are usually so problematic in parallelization will be relevant.
Thinking about the relative time it takes to completely create the sieve (2-5 seconds depending on hardware) and the time it takes to test all primes found (approx. 5-20 ms, depending on the number of primes), it seems like the best implementation of parallelized code is going to have a single “master” thread testing the sieves woven in parallel by GPU’s or additional CPU cores. If the GPU can weave a sieve in about as long as it takes to test the primes for chain properties then the program should be structured to pair off physical GPUs with individual cores to achieve maximum performance. Assuming that weaving the complete sieve is slower than the weave testing (which it will probably always be) then creating and testing the incomplete sieve in m and n ms, respectively will be most efficient when m = n. This is essentially maximizing the time that we’re testing primes instead of spending the overwhelming majority of time weaving the “perfect” sieve. It is interesting to note that the sieve is woven to the point of only needing to test ~2000 primes after it has gone through less than 1% of the total vPrimes prime array.
The implementation of an evolving weave timer is primarily to force the sieve creation time to be equivalent to the sieve testing time. This shows a pretty large performance boost in pps.
A compiled binary is available at
https://www.dropbox.com/sh/1y7mwwc4asgfqfb/Eo2sKCsZor for an -O3 flagged linux compilation (64 bit)
Though, these are of course just the thoughts of a total hack
, let me know what you think!
Eric
[/msg_to_sunny]
(donations welcome
)