I promised a bounty for the Nvidia miner even though all my Nvidia cards are in the cupboard
Post a BTC donation address tsiv so I can send my share (0.2BTC as listed)
Any chance somebody tried the Gtx580 or Gtx680 I have 2 of each which I might slap on a system if worthwhile
I added some donation addresses on the project readme on Github for wallets that I have atm, copy&paste:
BTC: 1JHDKp59t1RhHFXsTw2UQpR3F9BBz3R3cs
DRK: XrHp267JNTVdw5P3dsBpqYfgTpWnzoESPQ
JPC: Jb9hFeBgakCXvM5u27rTZoYR9j13JGmuc2
VTC: VwYsZFPb6KMeWuP4voiS9H1kqxcU9kGbsw
XMR: 42uasNqYPnSaG3TwRtTeVbQ4aRY3n9jY6VXX3mfgerWt4ohDQLVaBPv3cYGKDXasTUVuLvhxetcuS16
ynt85czQ48mbSrWX
I'm offering the 150 xmr pledge by Keyboard-Mash in the OP.
Overall I'm very satisfied with the program, and would gladly release partial/full bounty .. pending a quick answer for why I had to edit the registry to get the program to operate. Equipoise I will send you a few xmr outside of the bounty, thanks a lot for providing it! I would like to hear some more from tsiv.
It is actually explained on the project's front page on Github, I believe
The initial release had the entire algorithm stuffed into a single huge CUDA kernel. Having to do the whole slow algorithm in one go had a tendency to take just a bit over 2 seconds per kernel launch, with 2 seconds being the timeout for Windows getting impatient and going "hmmh, I haven't heard from the GPU in 2 seconds. Must've crashed, better reset the driver." The registry tweak works around the problem by increasing the time that Windows allows the GPU to be "unresponsive" aka stuck running a CUDA kernel.
This has been addressed in later releases, mainly by splitting the single huge kernel into smaller pieces and making parts of the hash faster. The slowest part is still quite slow, taking roughly 1.4 seconds with launch config 8x60 on a 750 Ti but it should stay well within the default 2 second window.
There is something more to be done about the -l MxN.
About the first number M:
"First of all, your thread block size should always be a multiple of 32, because kernels issue instructions in warps (32 threads). For example, if you have a block size of 50 threads, the GPU will still issue commands to 64 threads and you'd just be wasting them."
About the second number N:
You could find it by gradually increasing it until your card stop working (showing impossible hash rate 3474958.52 H/s) and then restart is needed for maximum performance (but not for testing), because without restart my hash rate is felling 2x compared to the same options before the crash.
The "magical numbers" for 650M seems to be -l 128x5
I realize the 8x60 or 8x40 make absolutely no sense, they're something I ran into while trying out different values. The reasonable values would be based on the number of SMM/SMX on the GPU and 32 or 64 threads per block would make a lot of sense. I can't tell exactly why performance takes a dive if you try 64x5 for example, it should be a very good value to start at. Might have something to do with the huge amount of random global memory access in the second major loop of the algo, trying to do more work in parallel bottlenecks at the memory access?
Good news is that I've since modified 2 of the 3 main loops to use 8 parallel threads per hash as opposed to the original 1 thread per hash. So essentially 8x60 leads to running 64 threads per block for those two loops. Still working on the last loops, it does seem a fair bit harder to make it more parallel.