We could change the randomize-input routine to take the multiplicator as the last argument. This would be trivial but allow for the calculation of a "midstate" which is retained over all iterations. Every iteration would just have to finish the "remainder" of the MD5 hash calculation, beginning at the midstate, to account for the unique multiplier.
I think that should help...also because we only need to get random ints far less frequently, and I'm sure the algo I have there is slow. But the built in random function is worthless.
All the following can be found here in this branch: https://github.com/OrdinaryDude/xel_miner/tree/fast-optimization.
xel_miner.c has some TODO FIXME's, where I changed the logic to not submit any PoW or Bounties (as they are found too fast). Note: the meaning of rc=1 and rc=2 has been swapped
Testing can be done with:
./xel_miner -k 19fafc1fa028af61d4bb603e1f9f06eca0bea765114115ad503b532588fbc83d --test-miner work.json --threads 1
(Note, it does not yet do anything useful ... its just hacky with statically compiled work package ... no on the fly compiling yet)
I have managed to boost the speed a lot with the linked library approach:
[09:49:30] DEBUG: Running ElasticPL Parser
[09:49:35] CPU0: 784.88 kEval/s
We are at almost 800000 Evals per second
All on one thread WITH pseudorandomInts and PoW SHA256 check ... 2 or more threads slow everything down!
Optimizations:Most importantly, as the profiler showed me, gen_rand_32 is the no.1 bottleneck! It does now just increment the "multiplicator" instead of filling the last two ints with random input in every iteration. It just does that once, now!
like this:
static int scanhash(int thr_id, struct work *work, long *hashes_done) {
...
mult32[6] = genrand_int32();
mult32[7] = genrand_int32();
while (1) {
// Check If New Work Is Available
if (work_restart[thr_id].restart) {
applog(LOG_DEBUG, "CPU%d: New work detected", thr_id);
return 0;
}
// Increment mult32
mult32[7]=mult32[7]+1;
if(mult32[7] == INT32_MAX){
mult32[7] = 0;
mult32[6] = mult32[6] + 1;
}...
...then...
C Flags:
-Ofast -msse -msse2 -msse3 -mmmx -m3dnow -fext-numeric-literals
Mangle State now works on 32 bit integers only, avoiding the costly 64bit rotation:
uint32_t vm_state1 = 0;
uint32_t vm_state2 = 0;
uint32_t vm_state3 = 0;
uint32_t vm_state4 = 0;
static const unsigned int mask32 = (CHAR_BIT*sizeof(uint32_t)-1);
static inline uint32_t rotl32 (uint32_t x, unsigned int n)
{
n &= mask32; // avoid undef behaviour with NDEBUG. 0 overhead for most types / compilers
return (x<>( (-n)&mask32 ));
}
static inline uint32_t rotr32 (uint32_t x, unsigned int n)
{
n &= mask32; // avoid undef behaviour with NDEBUG. 0 overhead for most types / compilers
return (x>>n) | (x<<( (-n)&mask32 ));
}
static int m(int x) {
int mod = x % 64;
int leaf = mod % 4;
if (leaf == 0) {
vm_state1 = rotl32(vm_state1, mod);
vm_state1 = vm_state1 ^ x;
}
else if (leaf == 1) {
vm_state2 = rotl32(vm_state2, mod);
vm_state2 = vm_state2 ^ x;
}
else if (leaf == 2) {
vm_state3 = rotl32(vm_state3, mod);
vm_state3 = vm_state3 ^ x;
}
else {
vm_state4 = rotr32(vm_state4, mod);
vm_state4 = vm_state4 ^ x;
}
return x;
}
Fill Integers uses memset to clear the memory
int fill_ints(int input[]){
memset((char*)mem, 0, 64000*sizeof(char));
for(int i=0;i<12;++i)
mem[i] = input[i];
vm_state1=0;
vm_state2=0;
}
The complete "linked" program looks like this:
Fill ints is called from xel_miner.c, and the vm_state's are accessed directly to verify POW. So does mem[] in case of a bounty.
#include
#include
#include
#include
#include
#include "miner.h"
int32_t mem[64000];
uint32_t vm_state1 = 0;
uint32_t vm_state2 = 0;
uint32_t vm_state3 = 0;
uint32_t vm_state4 = 0;
static const unsigned int mask32 = (CHAR_BIT*sizeof(uint32_t)-1);
static inline uint32_t rotl32 (uint32_t x, unsigned int n)
{
n &= mask32; // avoid undef behaviour with NDEBUG. 0 overhead for most types / compilers
return (x<>( (-n)&mask32 ));
}
static inline uint32_t rotr32 (uint32_t x, unsigned int n)
{
n &= mask32; // avoid undef behaviour with NDEBUG. 0 overhead for most types / compilers
return (x>>n) | (x<<( (-n)&mask32 ));
}
static int m(int x) {
int mod = x % 64;
int leaf = mod % 4;
if (leaf == 0) {
vm_state1 = rotl32(vm_state1, mod);
vm_state1 = vm_state1 ^ x;
}
else if (leaf == 1) {
vm_state2 = rotl32(vm_state2, mod);
vm_state2 = vm_state2 ^ x;
}
else if (leaf == 2) {
vm_state3 = rotl32(vm_state3, mod);
vm_state3 = vm_state3 ^ x;
}
else {
vm_state4 = rotr32(vm_state4, mod);
vm_state4 = vm_state4 ^ x;
}
return x;
}
int fill_ints(int input[]){
memset((char*)mem, 0, 64000*sizeof(char));
for(int i=0;i<12;++i)
mem[i] = input[i];
vm_state1=0;
vm_state2=0;
vm_state3=0;
vm_state4=0;
}
int execute(){
mem[m(6)] = m(m((7) ^ (m((4) * (m(rotr32((5),(m((m((mem[3]) * (2))) + (2)))%32)))))));
mem[m(2)] = m(m((m(((mem[2]) == (mem[m((1) + (1))]))?1:0)) << (6)));
mem[m(1)] = m(m(((mem[2]) != 0)?((mem[1]) * (mem[2])):0));
mem[m(2)] = m(m((mem[1]) - (mem[0])));
mem[m(3)] = m(m(rotl32((m(((mem[0]) != 0)?((mem[1]) % (mem[0])):0)),(m((m(((1) > (0))?1:0)) + (3)))%32)));
return m((m(((mem[0]) == (m((0) - (mem[2]))))?1:0))!=0?1:0);
}
Now we are left with the memset inefficiency!
The rest is looking fine:
EDIT!I am even faster now with 1 Million executions per second by dropping memset() and using aligned_alloc once! I see no point in nulling the memory at all!
#ifdef _WIN32
#define ALLOC_ALIGNED_BUFFER(_numBytes) ((int *)_aligned_malloc (_numBytes, 64))
#define FREE_ALIGNED_BUFFER(_buffer) _aligned_free(_buffer)
#elif __SSE__
// allocate memory aligned to 64-bytes memory boundary
#define ALLOC_ALIGNED_BUFFER(_numBytes) (int *) _mm_malloc(_numBytes, 64)
#define FREE_ALIGNED_BUFFER(_buffer) _mm_free(_buffer)
#else
// NOTE(mhroth): valloc seems to work well, but is deprecated!
#define ALLOC_ALIGNED_BUFFER(_numBytes) (int *) valloc(_numBytes)
#define FREE_ALIGNED_BUFFER(_buffer) free(_buffer)
#endif
int fill_ints(int input[]){
if(mem==0)
mem=ALLOC_ALIGNED_BUFFER(64000*sizeof(int));
for(int i=0;i<12;++i)
mem[i] = input[i];
vm_state1=0;
vm_state2=0;
}