Arulbero has sent me a new version of his ECC lib which may be (I haven't tested it yet, but I believe his tests/claims) 20% faster.
A little explanation (and update) for the LBC users:
LBC generator has 3 components:
1) ECC generator --> 2) sha256 + ripemd160 --> 3) bloom search
1) this component computes the generation from the private key of the corresponding public key. We have only a CPU version of this library.
2) each single public key produces 2 160-bit strings (addresses = hash of the public key), uncompressed and compressed addresses; Rico developed 2 optimized versions of this component, one for CPU and one for GPU. In the GPU case the output of the CPU is sent to the GPU.
3) check if the addresses generated in the step 2) are or not in the bloom filter (CPU version and GPU version).
For the systems GPU limited (GPU used at 90% or more) it is crucial to improve the component 2).
For the systems CPU limited (example: CPU used at 100% and GPU at 50%) it is important instead to improve the speed of the ECC generator, i.e. the rate at which the CPU feeds the GPU.
ECC generator is the component I have been working on in the past months.
I'm improving the ECC library as much as I can. At this moment the library is almost 30% faster than the current one. On my CPU Kabylake, it takes only 1,16 s to generate 16,7 M public keys against the 1,64 s of the latest release.
For a system cpu limited, the new library could take a +30% speedup (exploiting completely the GPU's speed).
In that case I think that the LBC generator will be faster than oclvanitygen on all GPUs (for now the LBC generator is faster only on middle/slow gpu, but not on fast gpu).
For a system gpu limited, the only advantage is that you could use 3 cores to get (almost) the same performance as 4 cores now.
For a system without GPU there is a small speedup, because in the CPU generator only a 10% of the cpu work is about the ECC arithmetic, about 65% is for the sha256/ripemd160 task and about 25% for the bloom search.