You'd probably need to perform string mangling on the GPU to keep the cores working as much as possible, because the transfer of each and every candidate phrase from CPU/system memory to the GPU could end up being a severe bottleneck. Perhaps a deliberately slower algorithm like warp may work better, since the GPU will spend most of its time calculating rather than transferring to/from the host... but how common are non SHA256 brainwallets? As you state, any passphrase based search (which is essentially a set of random keys) will be slower than a sequential search.
I also thought so for a long time.
not really if the size does not exceed 49152
size_t stackSize = 49152;
err = cudaDeviceSetLimit(cudaLimitStackSize, stackSize);
..For VanitySearch, having a smaller group size is better (This is a reason why I worked a lot on this DRS62 ModInv implementation). I can double the size of the group (I will definitely do it) but not more. The GPU kernel performs one group per thread and send back hash160 to the CPU. If the group size is too large, memory transfer and allocation become a problem. Divide and rule
![Wink](https://bitcointalk.org/Smileys/default/wink.gif)
VanitySearch restarts the kernel about 1000 times per second (!!!), and it works fine.
opencl unknow
..and this is one of the main technical reasons why gpu BF do not.
cpu BF checks huge dictionaries in a few hours. loading such volumes into gpu is problematic.
therefore gpu BF should work using the built-in generator, e.g. brute force seed.