SHA512((pbegin == pend ? pblank : (unsigned char*)&pbegin[0]), (pend - pbegin) * sizeof(pbegin[0]), (unsigned char*)&csoh);
for(int i=0;i<12;i++){
scrypt(csoh, HASHLENGTH, csoh, HASHLENGTH, 12, 1, 1, (unsigned char*)&csoh, HASHLENGTH);
for(int j=0;j<10;j++){
scrypt(csoh, HASHLENGTH, csoh, HASHLENGTH, 8, 1, 1, (unsigned char*)&csoh, HASHLENGTH);
SHA512(csoh, HASHLENGTH, (unsigned char*)&csoh);
}
}
scrypt(csoh, HASHLENGTH, csoh, HASHLENGTH, 10, 8, 1, (unsigned char*)&csoh, HASHLENGTH);
uint256 hash2;
SHA256((unsigned char*)&csoh, sizeof(csoh), (unsigned char*)&hash2);
This should still be easily implementable on the GPU, you just need to test and coordinate what the proper load balancing would be. Via the TMTO trick, you can reduce memory consumption of the larger memory hashes and subsequently take longer to do them; so, you pipe a large number of these threads along with a small number of threads doing the small memory hashes and you should still get a pretty solid GPU advantage.