see my git, same permut (starting with bmw 80)
before :
2017-01-26 19:16:55] CPU #1: 1.26 kH/s
[2017-01-26 19:16:55] CPU #2: 1.33 kH/s
[2017-01-26 19:16:55] CPU #0: 1.27 kH/s
[2017-01-26 19:17:28] timetravel block 387151, diff 0.419
now:
[2017-01-26 19:20:33] CPU #2: 78.91 kH/s
[2017-01-26 19:20:33] CPU #3: 76.86 kH/s
[2017-01-26 19:20:33] CPU #1: 74.50 kH/s
[2017-01-26 19:20:36] accepted: 2/2 (diff 0.007), 308.84 kH/s yes!
with a few lines changes on linux with an i5 4440 (with bmw512 as first algo) 2x bitbandi miner speed
Nice find. Calculating the next permutation on every hash is redundant, only need to when new work received.
It could probably be moved up another level or 2. Does it have to be thread specific? It seems each thread will
calculate the same chain. Maybe stratum thread can do it when new work received. I will follow up.
The hashrate is now about what I expect from an unoptimized 8 function chain, ie faster than x11.
And it's easy to see how all that fixed overhead would overwhelm any th erest of the algo. It's starting
to make sense.
I'll now try the drop in opts tpo see if they now behave as expected.
On a related note this is my favorite swap routine. It doesn't need a temp and being a macro it works with
almost any type and can modify both args without pointers.
#define swap_vars(a,b) a^=b; b^=a; a^= b;