What I don't get is why the AMD miner is so much worse than nvidia; and it has problems too.
I can't really judge it (yet), but it's probably badly implemented.
It's largely SPH code... really, really bad. About the same as the original darkcoin-mod.
EDIT: Idea! What's the padding made out of? Maybe I can shortcut the memory usage!
padding is constructed starting with a seed/copy of the 32 bytes of previousHashBlock (or hashPrevBlock as it is called in the code), and then there are a few bitshifts (if necessary) and then tens of thousands of multiplications while we are filling the space (moving backwards thru the block).
But we basically start with just those 32 bytes, and everything is derived from them.
Padding starts here:
https://github.com/spreadcoin/spreadcoin/blob/master/src/main.cpp#L1511 while (BlockData.size() % 4 != 0)
BlockData << uint8_t(7);
// Fill rest of the buffer to ensure that there is no incentive to mine small blocks without transactions.
uint32_t *pFillBegin = (uint32_t*)&BlockData[BlockData.size()];
uint32_t *pFillEnd = (uint32_t*)&BlockData[MAX_BLOCK_SIZE];
uint32_t *pFillFooter = std::max(pFillBegin, pFillEnd - 8);
memcpy(pFillFooter, &hashPrevBlock, (pFillEnd - pFillFooter)*4);
for (uint32_t *pI = pFillFooter; pI < pFillEnd; pI++)
*pI |= 1;
for (uint32_t *pI = pFillFooter - 1; pI >= pFillBegin; pI--)
pI[0] = pI[3]*pI[7];
BlockData.forsed_resize(MAX_BLOCK_SIZE);
First thing we do is fill up the Block from the left (right after the tx-section) with a few 0x07 bytes (only if necessary), just so that the current size of the blockdata size (header + txs) is exactly divisible by 4.
Maybe size is already divisible modulo 4, so we don't need to add any such bytes.
Then we define pointers pFillBegin, pFillEnd and pFillFooter, and then copy hashPrevBlock to the end (last 32 bytes) of this MAX_BLOCK_SIZE block, and then fill all the empty bytes in between moving backwards, 4 bytes per iteration.
Oh, and before we start we also turning these 32 bytes (or the 8 x 4 byte integers it consists of) "ODD", by doing this *pI |= 1 operation on them, so that they are not divisible by 2 anymore).
Then we just iterate backwards, in 4 byte steps, always taking pI[3]*pI[7] and writing the multiplication result into pI[0], we do that, until we reach the transaction section (or those 0x07 bytes we created earlier (if they were necessary))...
That's about it.
So this large padding section doesn't have any regularity or repetition if you were expecting that.
It's pretty messy & chaotic data.
Mr. Spread really wants your GPU to double-SHA256 a 200Kbyte datastructure. All the time.