What is the performance cost of emulating 64 bit as 32 bit?
Does it double the cost? For example, does a 64 bit word emulated as 32 bit use 100% of the GPU while 32 bit words use 50%?
Ok, let me elaborate on this a little bit and give you some numbers for better estimates where we are and where we're going:
In my CPU/GPU combination, one CPU core puts 8% load on the GPU and that is a situation, where a fairly strong CPU meets a midrange GPU (a 2.8 - 3.7 GHz Skylake E3-Xeon firing at a Quadro M2000M - see
http://www.notebookcheck.net/NVIDIA-Quadro-M2000M.151581.0.html). At the moment it's quite possible with a stronger GPU (1080) that the CPU can put only 5-6% load to the GPU.
The current development version of the generator gives me 9 Mkeys/s for all 4 physical cores running, whereas the published version (the one you can download from FTP) gives 7.5 Mkeys/s.
Main difference is bloom filter search is done on GPU with the development version and also moving the final step of affine->normalization->64bytes to GPU resulting in an overall speed improvement of about 375000 keys/s per core.
Up to now, the GPU behaved like a "magical wand", putting bloom for it to work, didn't raise GPU load, but it raised the keyrate. This could be explained that the time the GPU needs to do the bloom filter search is basically the time the GPU would need to transfer the hashed data back to CPU (which does the bloom filter search on the current public version). Same with the affine transformation.
There is nothing left on the CPU except (heavily optimized) EC computations, so any further speed improvements need to push that to the GPU.
In terms of time, currently one 16M block takes around
6.25 seconds on my machine (if I let compute 8 blocks, it takes 50 seconds - to mitigate the startup cost).
So I thought I'd emulate what's going on on the CPU and move the code piece by piece. Going backwards, the step before the affine transformation is the Jacobi->Affine transformation, where you need to compute the square and the cube of the Jacobi Z coordinate and multiply the X with the former and the Y with the latter. All in all one Field element sqr and 3 FE mul operations.
Done that with my 128bit library (based on 64bit data types) on GPU and behold! GPU load went to 100% and the time per block went to
16 seconds! Uh. Operation successful, patient dead.
-> Back to the drawing board.
Now the same with 32bit data types is currently 12% GPU load and
5.4 seconds per block (per CPU core). So very promising, but I'm hitting a little/big endianness brainwarp hell, so I have to figure out how to do it more elegant.
Also, the new version will demand a more GPU-heavy approach before I can release it. As the bloom filter search is done on GPU, an additional 512MB of GPU memory is used per process. Running 4 processes on my Maxwell GPU with its 4GB VRAM is just fine (and as the memory can be freed from the CPU part of the generator, it takes only 100MB of host memory), but I experienced also Segmentation faults with the Keppler machines on Amazon cloud.
So the goal is really to have one CPU core being able to put at least 50% load on one GPU.
It's no small engineering feat, but at the moment LBC is the fastest key generator on the planet (some 20% faster than oclvanitygen) and I believe it is achievable to be twice as fast as oclvanitygen. That's my goal and motivation and currently I have yet to tap 65% of my GPU capacity to get there.
And am I wrong assuming that even 32 bit is emulated, specifically on Pascal/Maxwell chips? I read the white paper and it says it does half integers also.
I'm not familiar in detail with the specific hardware interna. At the moment I have a Maxwell chip for my testing and I will have a tendency to support newer architectures/chip families, than the old stuff. Another way to put it: I will not sacrifice any speed to support that "old" chip from 2009. ;-)
Sidenote:
If anyone wants to be at the true forefront of development
and have a great workstation-replacement notebook, buy a Lenovo P50 (maybe P51 to be slightly ahead), because that's what I am developing on and LBC will naturally be slightly tailored to it. E.g. it has also an Intel GPU, which I am using for display. So currently I can work with the notebook basically without any limitations, as the Intel Graphics are untouched and as I have the 4 logical cores for my interaction, I can watch videos, browse etc.) and the notebook is churning 9 Mkeys/s. Ok the fan noise is distracting, because normally, the notebook is fine with passive cooling.
Rico