Author

Topic: [ANN][XEL] Elastic Project - The Decentralized Supercomputer - page 232. (Read 450549 times)

legendary
Activity: 1260
Merit: 1168
It becomes even better:

Multithreaded version is now working solid. On a notebook I get

Code:
[17:03:18] Attempting to start 4 miner threads
[17:03:18] 4 mining threads started
[17:03:24] CPU2: 840.81 kEval/s
[17:03:24] CPU0: 840.78 kEval/s
[17:03:24] CPU1: 838.86 kEval/s
[17:03:24] CPU3: 845.26 kEval/s
[17:02:52] CPU0: ***** POW Accepted! *****

which is

3.363.000+ evals per second!

Bounties and POW are all correctly found and submitted ;-)

On my desktop rig I get 7M+.  Grin
The miner is highly experimental, and needs GCC for the dlmopen().

legendary
Activity: 1775
Merit: 1032
Value will be measured in sats
Now we're talking (desktop rig)

Code:
[12:17:03] DEBUG: Running ElasticPL Parser
[12:17:09] CPU0: 1901.46 kEval/s
[12:17:15] CPU0: 1919.21 kEval/s

Nice job bro!  Shocked
sr. member
Activity: 464
Merit: 260
Now we're talking (desktop rig)

Code:
[12:17:03] DEBUG: Running ElasticPL Parser
[12:17:09] CPU0: 1901.46 kEval/s
[12:17:15] CPU0: 1919.21 kEval/s

Great job EK!  I've got a busy week this week, but as I get time I'll incorporate your changes and see if I can replicate the logic for Windows.  These new speeds will crush the retargetting mechnism once and for all...maybe next on the todo list  Smiley
copper member
Activity: 2352
Merit: 1348
I have created a little Forum for Elastic (Childboard)



http://bitsend.info/forums/index.php#c8

or

http://bitsend.info/forums/index.php?board=21.0

Best Regards Christian

P.S. I can add a mod from here.

Push
legendary
Activity: 1260
Merit: 1168
Now we're talking (desktop rig)

Code:
[12:17:03] DEBUG: Running ElasticPL Parser
[12:17:09] CPU0: 1901.46 kEval/s
[12:17:15] CPU0: 1919.21 kEval/s
legendary
Activity: 1260
Merit: 1168
Coralreefer: Update!

Right now only suitable for linux, the C miner generates C code on the fly, compiles it and dynamically loads it into its memory to execute.
It uses a dirty hack with the ElasticToCCompiler.jar because your AST compiler is not yet ready. But in the long term it should be swapped! Also, linking in LLVM to avoid the gcc call using system() should be done!

Works flawlessly, but far from being cross-platform!

Code:
#include "miner.h"
#include
#include
#include
#include
void *hndl = 0;
bool compile_and_link(char* source_code){
FILE* tempfile = fopen("work_lib.spl", "w+");
fprintf ( tempfile , "%s", source_code );
fclose(tempfile);

FILE* tempfile2 = fopen("work_lib.c", "w+");
char partialcode[600000];
FILE *f = popen("java -jar ElasticToCCompiler.jar work_lib.spl", "r");
if(f==0){
printf("Cannot open spl file");
exit(1);
}
while (fgets(partialcode, 600000, f) != NULL) {
fprintf ( tempfile2 , "%s", partialcode );
}
fclose(tempfile2);

pclose(f);

system("sh ./build_shared.sh");

hndl = dlopen("./work_lib.so", RTLD_NOW);
if (!hndl) {
fprintf(stderr, "%sn", dlerror());
exit(EXIT_FAILURE);
}
fill_ints = dlsym(hndl, "fill_ints");
execute = dlsym(hndl, "execute");
vm_state1 = dlsym(hndl, "vm_state1");
vm_state2 = dlsym(hndl, "vm_state2");
vm_state3 = dlsym(hndl, "vm_state3");
vm_state4 = dlsym(hndl, "vm_state4");

return true;
}


void free_compiler(){
if(hndl!=0){
dlclose(hndl);
free(hndl);
}
}

https://github.com/OrdinaryDude/xel_miner/tree/fast-optimization

use:

Code:
./xel_miner -k 19fafc1fa028af61d4bb603e1f9f06eca0bea765114115ad503b532588fbc83d --test-miner work.json --threads 1
legendary
Activity: 1260
Merit: 1168
I think we will get a very fast and efficient miner here ;-)
The "60k" from the Java version starts looking more and more like a child's toy.
legendary
Activity: 1260
Merit: 1168
We could change the randomize-input routine to take the multiplicator as the last argument. This would be trivial but allow for the calculation of a "midstate" which is retained over all iterations. Every iteration would just have to finish the "remainder" of the MD5 hash calculation, beginning at the midstate, to account for the unique multiplier.

I think that should help...also because we only need to get random ints far less frequently, and I'm sure the algo I have there is slow.  But the built in random function is worthless.


All the following can be found here in this branch: https://github.com/OrdinaryDude/xel_miner/tree/fast-optimization.
xel_miner.c has some TODO FIXME's, where I changed the logic to not submit any PoW or Bounties (as they are found too fast). Note: the meaning of rc=1 and rc=2 has been swapped Grin


Testing can be done with:
Code:
./xel_miner -k 19fafc1fa028af61d4bb603e1f9f06eca0bea765114115ad503b532588fbc83d --test-miner work.json --threads 1
(Note, it does not yet do anything useful ... its just hacky with statically compiled work package ... no on the fly compiling yet)

I have managed to boost the speed a lot with the linked library approach:

Code:
[09:49:30] DEBUG: Running ElasticPL Parser
[09:49:35] CPU0: 784.88 kEval/s

We are at almost 800000 Evals per second


All on one thread WITH pseudorandomInts and PoW SHA256 check ... 2 or more threads slow everything down!

Optimizations:

Most importantly, as the profiler showed me, gen_rand_32 is the no.1 bottleneck! It does now just increment the "multiplicator" instead of filling the last two ints with random input in every iteration. It just does that once, now!


like this:

Code:
static int scanhash(int thr_id, struct work *work, long *hashes_done) {
...
mult32[6] = genrand_int32();
mult32[7] = genrand_int32();

while (1) {
// Check If New Work Is Available
if (work_restart[thr_id].restart) {
applog(LOG_DEBUG, "CPU%d: New work detected", thr_id);
return 0;
}

// Increment mult32
mult32[7]=mult32[7]+1;
if(mult32[7] == INT32_MAX){
mult32[7] = 0;
mult32[6] = mult32[6] + 1;
}...

...then...

C Flags:
Code:
-Ofast -msse -msse2 -msse3 -mmmx -m3dnow -fext-numeric-literals

Mangle State now works on 32 bit integers only, avoiding the costly 64bit rotation:
Code:
uint32_t vm_state1 = 0;
uint32_t vm_state2 = 0;
uint32_t vm_state3 = 0;
uint32_t vm_state4 = 0;
static const unsigned int mask32 = (CHAR_BIT*sizeof(uint32_t)-1);
static inline uint32_t rotl32 (uint32_t x, unsigned int n)
{
  n &= mask32;  // avoid undef behaviour with NDEBUG.  0 overhead for most types / compilers
  return (x<>( (-n)&mask32 ));
}
static inline uint32_t rotr32 (uint32_t x, unsigned int n)
{
  n &= mask32;  // avoid undef behaviour with NDEBUG.  0 overhead for most types / compilers
  return (x>>n) | (x<<( (-n)&mask32 ));
}
static int m(int x) {
   int mod = x % 64;
   int leaf = mod % 4;
   if (leaf == 0) {
       vm_state1 = rotl32(vm_state1, mod);
       vm_state1 = vm_state1 ^ x;
   }
   else if (leaf == 1) {
       vm_state2 = rotl32(vm_state2, mod);
       vm_state2 = vm_state2 ^ x;
   }
   else if (leaf == 2) {
       vm_state3 = rotl32(vm_state3, mod);
       vm_state3 = vm_state3 ^ x;
   }
   else {
       vm_state4 = rotr32(vm_state4, mod);
       vm_state4 = vm_state4 ^ x;
   }
    return x;
}

Fill Integers uses memset to clear the memory
Code:
int fill_ints(int input[]){
   memset((char*)mem, 0, 64000*sizeof(char));
   for(int i=0;i<12;++i)
    mem[i] = input[i];
  vm_state1=0;
  vm_state2=0;
}

The complete "linked" program looks like this:
Fill ints is called from xel_miner.c, and the vm_state's are accessed directly to verify POW. So does mem[] in case of a bounty.

Code:
#include
#include
#include
#include
#include
#include "miner.h"

int32_t mem[64000];

uint32_t vm_state1 = 0;
uint32_t vm_state2 = 0;
uint32_t vm_state3 = 0;
uint32_t vm_state4 = 0;
static const unsigned int mask32 = (CHAR_BIT*sizeof(uint32_t)-1);
static inline uint32_t rotl32 (uint32_t x, unsigned int n)
{
  n &= mask32;  // avoid undef behaviour with NDEBUG.  0 overhead for most types / compilers
  return (x<>( (-n)&mask32 ));
}
static inline uint32_t rotr32 (uint32_t x, unsigned int n)
{
  n &= mask32;  // avoid undef behaviour with NDEBUG.  0 overhead for most types / compilers
  return (x>>n) | (x<<( (-n)&mask32 ));
}
static int m(int x) {
   int mod = x % 64;
   int leaf = mod % 4;
   if (leaf == 0) {
       vm_state1 = rotl32(vm_state1, mod);
       vm_state1 = vm_state1 ^ x;
   }
   else if (leaf == 1) {
       vm_state2 = rotl32(vm_state2, mod);
       vm_state2 = vm_state2 ^ x;
   }
   else if (leaf == 2) {
       vm_state3 = rotl32(vm_state3, mod);
       vm_state3 = vm_state3 ^ x;
   }
   else {
       vm_state4 = rotr32(vm_state4, mod);
       vm_state4 = vm_state4 ^ x;
   }
    return x;
}

int fill_ints(int input[]){
   memset((char*)mem, 0, 64000*sizeof(char));
   for(int i=0;i<12;++i)
    mem[i] = input[i];
  vm_state1=0;
  vm_state2=0;
  vm_state3=0;
  vm_state4=0;
}

int execute(){

mem[m(6)] = m(m((7) ^ (m((4) * (m(rotr32((5),(m((m((mem[3]) * (2))) + (2)))%32)))))));
mem[m(2)] = m(m((m(((mem[2]) == (mem[m((1) + (1))]))?1:0)) << (6)));
mem[m(1)] = m(m(((mem[2]) != 0)?((mem[1]) * (mem[2])):0));
mem[m(2)] = m(m((mem[1]) - (mem[0])));
mem[m(3)] = m(m(rotl32((m(((mem[0]) != 0)?((mem[1]) % (mem[0])):0)),(m((m(((1) > (0))?1:0)) + (3)))%32)));
return m((m(((mem[0]) == (m((0) - (mem[2]))))?1:0))!=0?1:0);
}


Now we are left with the memset inefficiency!
The rest is looking fine:




EDIT!

I am even faster now with 1 Million executions per second by dropping memset() and using aligned_alloc once! I see no point in nulling the memory at all!

Code:
#ifdef _WIN32

#define ALLOC_ALIGNED_BUFFER(_numBytes) ((int *)_aligned_malloc (_numBytes, 64))
#define FREE_ALIGNED_BUFFER(_buffer) _aligned_free(_buffer)

#elif __SSE__
// allocate memory aligned to 64-bytes memory boundary
#define ALLOC_ALIGNED_BUFFER(_numBytes) (int *) _mm_malloc(_numBytes, 64)
#define FREE_ALIGNED_BUFFER(_buffer) _mm_free(_buffer)
#else
// NOTE(mhroth): valloc seems to work well, but is deprecated!
#define ALLOC_ALIGNED_BUFFER(_numBytes) (int *) valloc(_numBytes)
#define FREE_ALIGNED_BUFFER(_buffer) free(_buffer)
#endif


int fill_ints(int input[]){
  if(mem==0)
  mem=ALLOC_ALIGNED_BUFFER(64000*sizeof(int));
   for(int i=0;i<12;++i)
    mem[i] = input[i];
  vm_state1=0;
  vm_state2=0;
}
sr. member
Activity: 464
Merit: 260
We could change the randomize-input routine to take the multiplicator as the last argument. This would be trivial but allow for the calculation of a "midstate" which is retained over all iterations. Every iteration would just have to finish the "remainder" of the MD5 hash calculation, beginning at the midstate, to account for the unique multiplier.

I think that should help...also because we only need to get random ints far less frequently, and I'm sure the algo I have there is slow.  But the built in random function is worthless.
legendary
Activity: 1260
Merit: 1168
Thanks EK.  I don't have anything to send right now as I'm not familiar w/ working with libraries outside of VC, and what I did for this test within VC is a complete hack.

However, I commented out everything in scanhash except the running of the code in the DLL and I saw 4.5 MEval/s.  So I don't think its the linked library.  It looks like the randomizing of the inputs and the POW calcs still have a pretty big penalty.  But that can get cleaned up over time...it seems like the approach is still viable.


We could change the randomize-input routine to take the multiplicator as the last argument. This would be trivial but allow for the calculation of a "midstate" which is retained over all iterations. Every iteration would just have to finish the "remainder" of the MD5 hash calculation, beginning at the midstate, to account for the unique multiplier.

Not sure how much we save with this?

We need to check a profiler ... sometimes the bottleneck is where you would least expect it.

At least, when I run a SHA256 based bitcoin miner on my cmputer I get roughly 10M hashes per second. This example here is not significantly "more complicated". So there is some room yet to be exploited ;-)
sr. member
Activity: 464
Merit: 260
Thanks EK.  I don't have anything to send right now as I'm not familiar w/ working with libraries outside of VC, and what I did for this test within VC is a complete hack.

However, I commented out everything in scanhash except the running of the code in the DLL and I saw 4.5 MEval/s.  So I don't think its the linked library.  It looks like the randomizing of the inputs and the POW calcs still have a pretty big penalty.  But that can get cleaned up over time...it seems like the approach is still viable.
legendary
Activity: 1260
Merit: 1168
By the way, coralreefer: you are the author of that part of Elastic, that makes the whole thing efficient in the first place! Thank you very much for all your hard work so far! I am sure Lannister will be grateful as well, won't he?  Wink

Amazing stuff!
legendary
Activity: 1260
Merit: 1168
EK, I created a DLL for the code above and called it instead of the internal interpreter.  With POW calcs running I got 160kEval/s and when I turned POW off I got 400kEval/s.  I'm sure you're looking for quite a bit more, but at least its a step in the right direction.

First of all ... awesome work!

The speed, however, is strange ;-) I am on a Thinkpad T460p (I admit it has the i7 quadcore and 32GB ram, but it's still just a notebook) and I am executing that code with aroung 6 million evals per second.

Do you have something I could quickly compile on my linux machine ... maybe the VC compiler is a bit inefficient?
Or the DLL calling is too costly ... maybe a static library is better here? I would quickly check.

EDIT: I could get the cachegrind profiler up and running quickly to tell you where the bottleneck lies.
sr. member
Activity: 464
Merit: 260
EK, I created a DLL for the code above and called it instead of the internal interpreter.  With POW calcs running I got 160kEval/s and when I turned POW off I got 400kEval/s.  I'm sure you're looking for quite a bit more, but at least its a step in the right direction.
legendary
Activity: 1260
Merit: 1168
I fixed the order of mangle in the Java interpreter. Everything is fine now!


Code:
m[6]=4*(5>>>(m[3]*2+2));
verify 1==1;

Code:
beavis@methusalem ~/Development/elastic-pl (git)-[master] % java -jar ElasticToCCompiler.jar fuck.spl; gcc -O0 fuck.spl.c; ./a.out
Elastic-to-C Compiler Version 0.1:  Reading from file fuck.spl . . .
Elastic-to-C Compiler Version 0.1:  OK.
MANGLE 6
MANGLE 0
MANGLE 2
MANGLE 1073741825
MANGLE 4
MANGLE 4
MANGLE 1
MANGLE 1
MANGLE STATE: 6724 -4611686018158952447.

beavis@methusalem ~/Development/elastic-pl (git)-[master] % java -jar ElasticPL.jar fuck.spl
Elastic Programming Language Interpreter Version 0.1:  Reading from file fuck.spl . . .
[!] Stack usage exceeded: false
[!] AST depth: 8
[!] Worst case execution time: 11
MANGLE: 6
MANGLE: 0
MANGLE: 2
MANGLE: 1073741825
MANGLE: 4
MANGLE: 4
m[6]: 4
[!] Exit Stack Pointer: 0
MANGLE: 1
MANGLE: 1
[!] Mangle State: 6724 -4611686018158952447
[!] Bounty requirement met: true
beavis@methusalem ~/Development/elastic-pl (git)-[master] %

For testing purpose, the full C code as it was generated by the current ElasticToCCompiler.jar

Code:
#include
#include
#include
#include
#include

int32_t mem[64000];
uint64_t vm_state1 = 0;
uint64_t vm_state2 = 0;

uint64_t rotl64 (uint64_t x, unsigned int n)
{
  const unsigned int mask = (CHAR_BIT*sizeof(x)-1);
  n &= mask;  // avoid undef behaviour with NDEBUG.  0 overhead for most types / compilers
  return (x<>( (-n)&mask ));
}
uint64_t rotr64 (uint64_t x, unsigned int n)
{
  const unsigned int mask = (CHAR_BIT*sizeof(x)-1);
  n &= mask;  // avoid undef behaviour with NDEBUG.  0 overhead for most types / compilers
  return (x>>n) | (x<<( (-n)&mask ));
}
uint32_t rotl32 (uint32_t x, unsigned int n)
{
  const unsigned int mask = (CHAR_BIT*sizeof(x)-1);
  n &= mask;  // avoid undef behaviour with NDEBUG.  0 overhead for most types / compilers
  return (x<>( (-n)&mask ));
}
uint32_t rotr32 (uint32_t x, unsigned int n)
{
  const unsigned int mask = (CHAR_BIT*sizeof(x)-1);
  n &= mask;  // avoid undef behaviour with NDEBUG.  0 overhead for most types / compilers
  return (x>>n) | (x<<( (-n)&mask ));
}
int m(int x) {
  printf("MANGLE %d\n",x);
   int mod = x % 64;
   if (x % 2 == 0) {
       vm_state1 = rotl64(vm_state1, mod);
       vm_state1 = vm_state1 ^ x;
   }
   else {
       vm_state2 = rotr64(vm_state2, mod);
       vm_state2 = vm_state2 ^ x;
   }
    return x;
}

int execute();
int main(){
  execute();
  printf("MANGLE STATE: %lld %lld.\n",vm_state1,vm_state2);
}

int execute(){
vm_state1=0;
vm_state2=0;
mem[m(6)] = m(m((4) * (m(rotr32((5),(m((m((mem[3]) * (2))) + (2)))%32)))));
return m((m(((1) == (1))?1:0))!=0?1:0);
}

sr. member
Activity: 464
Merit: 260
Maybe a simple fix would be to run the mangle_state after each statement instead of each operator.  This would be easy to incorporate into the C parser...how hard would it be to do on the java side.
sr. member
Activity: 464
Merit: 260
We also have a slight problem in the "order" the expressions are performed.
In Java they are processed in a different order. This little program:

Code:
m[6]=4*3;
verify 1==1;

Mangles in this order:
6, 12, 12, 1, 1

When the code is compiled into C, this order applies:
12, 6, 12, 1, 1

The C function by the way is the following, when m() describes the mangle_function:

Code:
mem[m(6)] = m(m((4) * (3)));
return m((m(((1) == (1))?1:0))!=0?1:0);

Actually, I would expect that in C m(6) will be the first operation to be executed ... NO! Unfortunately not!
So we have to rethink the mangle function in the C-Compiler after all.


I agree that m[6] should have run first.  The interpreter should start with the lowest left node, then the right one, then moves up, so I'm not clear why the order is wrong.

          =
        /    \
    m[6]   *
            /   \
          3      4  

so, I would expect the order to be m[6], 3, 4, *, =

legendary
Activity: 1260
Merit: 1168
Sorry, you're perfectly right! My mistake ... its been a 16 hour working day already  Wink
sr. member
Activity: 464
Merit: 260

Code:
rotl32:
        movl    %esi, %ecx
        movl    %edi, %eax
        shll    %cl, %eax
        negl    %ecx
        shrl    %cl, %edi
        orl     %eax, %edi
        movl    %edi, %eax
        ret

My disassembly looks similar to this, but I say leave it as written in the blog.  I checked that it compiles fine using mingw so I'm good with it.

However, I believe rotl32 and rotr32 should be returning uint32_t not uint64_t.
legendary
Activity: 1260
Merit: 1168
We also have a slight problem in the "order" the expressions are performed.
In Java they are processed in a different order. This little program:

Code:
m[6]=4*3;
verify 1==1;

Mangles in this order:
12, 6, 12, 1, 1

When the code is compiled into C, this order applies:
6, 12, 12, 1, 1

The C function by the way is the following, when m() describes the mangle_function:

Code:
mem[m(6)] = m(m((4) * (3)));
return m((m(((1) == (1))?1:0))!=0?1:0);

Actually, I would expect that in C m(12) will be the first operation to be executed (deepest operation first) ... NO! Unfortunately not!
So we have to rethink the mangle function in the C-Compiler after all.
Jump to: