Supercomputing: can you tell us if GPU could mine 10x faster then CPU in the nearest future ? Is it possible? Or other FGPA/ASIC stuff could be used to mine PrimeCoin?
Montgomery multiplication - Coarsely integrated operand scanning (CIOS) @ 256-bit
An
Nvidia GTX 580 (reference design ) is about
7x faster than an
AMD Phenom II X6 1100T using all six cores
Montgomery multiplication for GPU - must use hand optimized PTX (unrolled)
Montgomery multiplication for CPU - must use hand optimized x64 assembly (unrolled)
I would put the GPU at about 7x that of the CPU since modular multiplication is the bottle neck and not sieving.
FPGA's will only be useful for primorial searching, they are not cost effective for modular multiplication.
you are saying that you (or someone else) can create GPU miner which is 7x faster then CPU (for the same price) ? how long could it take?
Yes, that is correct when using the GPU and CPU below for the baseline.
For most desktop CPUs, it will be more than 7x faster. Also, the AMD Bulldozer and the Nvidia GTX 6xx series took a step backwards when it comes to integer arithmetic throughput (for multiplication). Intel's Sandy Bridge, Ivy Bridge, and Haswell processors are also very good. However, AMD's K10 series is still king for the CPUs, and the GTX Titan is still king for the GPUs.
For me, I am still at the proof of concept stage and it is looking very good.
I found 5 blocks (9-chain) in the last 4 hours while off-loading the Fermat tests to the GPU (single GTX 580). There is still a lot of work left to do before reaching the point where a single GTX 580 can aid in finding 3 blocks (10-chains) within 24 hours. The primorial search needs to be off-loaded to a second GPU, it is just as important as sieving for aiding in the search for 10-chains faster at 320-bit. I would estimate somewhere between 15-20 hours of work, but finding the free time for me at this time of the year is more difficult than doing the actual work. Early next year I will have time and the botnets can enjoy Primcoin mining for a couple of months longer before the knockout punch comes to them.
Baseline GPU: Nvidia GTX 580 (utilizing all 16 processors 512 ALUs)
Baseline CPU: AMD Phenom II X6 1100T (utilizing all 6 cores SIMD)Primecoin Miner off-load GPU (must follow these rules for multiprecision arithmetic implementation):
Minimize thread divergence.
Global memory access must be coalesced.
Use shared memory for data exchange.
Precision must be fixed at compile time: e.g. 320-bit.
Use Montgomery Reduction (CIOS).
Must use unrolled PTX coding for Montgomery Reduction (madc, mad.cc, addc, add.cc, etc).
Compile and bechmark with different values for maxrregcount.
Compile and bechmark with different grids, blocks, and threads organization.
Compile to .cubin format and profile the code.
Optimize the code using the profiling data.
If the above guideline is not followed, an x64 CPU will most likely outperform the GPU.