I've implemented this in python with hashlib as only dependency and yes, it's indeed working. I want to translate this to a opencl kernel but I figured out we have a core problem to solve first.
The problem is that address generating is pipelined, i.e. you cant multi-thread the problem since next worker need the work of last worker to start computing. Let's say you have some workers and a base public key. Then worker 1 does public key += base_ec_point and worker 2 has to wait for this until it can do public key += base_ec_point and so on... How do you solve this?
The trick is to remember that: pub_key + (N+M)*base_ec_point == pub_key + N*base_ec_point + M*base_ec_point
Parallelism is possible by computing:
- A row of sequential EC points (pub_key + k*base_ec_point) for k=0,1,2,...N
- A column of EC points i*N*base_ec_point for i=0,1,2,...
..and then add each possible combination of row and column member in parallel.
The trickiest part of implementing this in OpenCL is the bigint arithmetic. I'm actually testing a kernel right now. It's monstrous, easily dwarfing the various miner kernels in compiled size, and only moderately fast -- on my GTX 285, it can produce ~1.1Mkey/sec.