@d3m0n1q_733rz is that different to the existing "atom" asm code in cgminer? does it need specific cpu support, and if so I'll need to have it as a separate optional assembly miner.
New windows build.
http://ck.kolivas.org/apps/cgminer-1.2.4-win32.zipNew Source tarball.
http://ck.kolivas.org/apps/cgminer-1.2.4-1.tar.bz2Both include the new dynamic feature. Disable for dedicated mining!
Discussed the other problem of TurdHurdur's (off the forum) and it turns out it was missing the kernel file because he had done "make install" which doesn't really work properly unless you run from the directory you install to. The files should be all together in the same directory.
Actually, the use of movntdqa does make this require SSE4.1 as far as I can tell. There's always the off chance that the movntdqa will be translated to movdqa if SSE4.1 isn't available, but I can't seem to test that beyond leaving out -msse4.1 from the CFLAGS which appears to work.
As far as being different to the existing "atom" asm code, I've moved commands around structurally to take better advantage of how the hardware prefetch works and changed the act of moving data from memory into the registers to avoid caching and increase throughput (movntdqa). So the code itself is still pretty much the same, just added a few small optimizations to take advantage of the hardware's capabilities. I recommend calling it SSE4_64_atom or simply SSE4_64.
Now, I want to point out that there is another (and possibly better) way of coding the math calculations to take advantage of SSE3's (and SSSE3's) horizontal math and do more addition calculations all at once. But I suck at coding myself and really only excel in debugging and modifying at the smaller scale. But you guys are free to figure it out. And the only reason I haven't played around with YMM (256) registers is because only the absolute newest processors seem to support them.
Try out the modified code and let me know if it works for anyone else besides myself.