Crossposting for exposure;
If anyone would like to give it a shot, there's an extremely optimized sieve of Eratosthenes implementation here:
https://primesieve.googlecode.comFrom the software:
The best sieving performance is achieved with a sieve size of your
CPU's L1 data cache size (usually 32 or 64 KB) when sieving < 10^16
and a sieve size of your CPU's L2 cache size above.
primesieve uses the segmented sieve of Eratosthenes with wheel factorization, this algorithm has a complexity of O (N log log N) operations and uses O (sqrt N) space.
Segmentation is currently the best known practical improvement to the sieve of Eratosthenes. Instead of sieving the interval [2, n] at once one subdivides the sieve interval into a number of equal sized segments that are then sieved consecutively. Segmentation drops the memory requirement of the sieve of Eratosthenes from O(N) to O(sqrt N). The segment size is usually chosen to fit into the CPU's fast L1 or L2 cache memory which significantly speeds up sieving. A segmented version of the sieve of Eratosthenes was first published by Singleton in 1969 [1], Bays and Hudson in [2] describe the algorithm in more detail.
From the source, the sieve size is small (1 * 10^6) by default if I'm interpreting it correctly.
I'm wondering if a massively parallel GPU implementation with higher memory bandwidth also benefiting from the reduced memory requirements will see much better performance.
Also,
primesieve generates the first 50,847,534 primes up to 10^9 in just 0.4 seconds on a single core of an Intel Core i7-920 2.66GHz, this is about 50 times faster than an ordinary C/C++ sieve of Eratosthenes implementation and about 10,000 times faster than trial-division. primesieve outperforms my older ecprime (fastest from 2002 to 2010) by about 30 percent and also substantially outperforms primegen the fastest sieve of Atkin implementation on the web. Here is a list of other fast sieve of Eratosthenes implementations.
If the current implementation is the standard C++ one (I haven't looked at it thoroughly but it seems to be), whoever can race to implement this first may benefit immensely.
For those interested, there's a CUDA implementation here too:
http://www.mersenneforum.org/showthread.php?t=11900