Author

Topic: Cuckoo Cycle proof-of-work: $5000 in bounties (Read 1802 times)

legendary
Activity: 990
Merit: 1108
There is a (v)movntdq(a) instruction (sse 4.1+ I think) that bypasses cache operations for faster mem writing if the data involved aren't going to be cached in any useful way. You might want to check whether it's useful for this case.

I spend most time updating random bits, which I access at word (32 bit) granularity.
I tried using the corresponding intrinsic _mm_stream_si32, but it is a 45% slowdown...
legendary
Activity: 1708
Merit: 1049
I tried to compile this but failed due to non avx2 use, so it did compile some of the files... so I run cuckoo28:

perf stat -d -d -d ./cuckoo28

Further perf stats indicate a lot of overhead here (the cmpxchg stalls the jump?): http://imgur.com/a/zwEmP
(data obtained by running perf top while ./cuckoo28 was running).

Given that this was on a single thread, couldn't this be done with plain unlocked moves? Also, in multithreaded applications, would it be possible to gamble faster unlocked moves trading data accuracy in order to produce a faster version with possible data corruption?

Indeed, for single threaded use (-t 1, which is the default) one should run the singlethreaded
./cuckoo28st instead. That one doesn't use atomic updates like the cmpxchg.

For multi-threaded use, increasing the number of threads also increases the likelihood of overlooking cycles
due to data-races. It's not clear though at how many threads the use of atomics becomes well-advised.

Aha, ok, yes much better now. I haven't got an avx2 cpu so it breaks down when I try to compile everything and I had just some of the files to check out. Btw, I see that there are some versions with very large ram windows, which obviously exceed cache levels. There is a (v)movntdq(a) instruction (sse 4.1+ I think) that bypasses cache operations for faster mem writing if the data involved aren't going to be cached in any useful way. You might want to check whether it's useful for this case.
legendary
Activity: 990
Merit: 1108
the recent boost in bitcoin value allows me to increase the bounties by 50%:
This seems to imply you are holding your bounties in bitcoin.
Since this is targeted at the bitcoin community, I assume you are also paying them in bitcoin.
Why, then, do you advertise the bounty amounts in dollars instead of in bitcoins?

Two reasons:
1) I want to limit my financial liability, and if people keep claiming bounties while bitcoin explodes in value, I might need to buy bitcoin at exorbitant prices.
2) I thought the dollar amounts sound more impressive:-)

But, on request, and while supplies last, I'd be happy to make a payout in bitcoin.
legendary
Activity: 990
Merit: 1108
I tried to compile this but failed due to non avx2 use, so it did compile some of the files... so I run cuckoo28:

perf stat -d -d -d ./cuckoo28

Further perf stats indicate a lot of overhead here (the cmpxchg stalls the jump?): http://imgur.com/a/zwEmP
(data obtained by running perf top while ./cuckoo28 was running).

Given that this was on a single thread, couldn't this be done with plain unlocked moves? Also, in multithreaded applications, would it be possible to gamble faster unlocked moves trading data accuracy in order to produce a faster version with possible data corruption?

Indeed, for single threaded use (-t 1, which is the default) one should run the singlethreaded
./cuckoo28st instead. That one doesn't use atomic updates like the cmpxchg.

For multi-threaded use, increasing the number of threads also increases the likelihood of overlooking cycles
due to data-races. It's not clear though at how many threads the use of atomics becomes well-advised.
full member
Activity: 224
Merit: 117
▲ Portable backup power source for mining.
the recent boost in bitcoin value allows me to increase the bounties by 50%:
This seems to imply you are holding your bounties in bitcoin.
Since this is targeted at the bitcoin community, I assume you are also paying them in bitcoin.
Why, then, do you advertise the bounty amounts in dollars instead of in bitcoins?
legendary
Activity: 1708
Merit: 1049
I tried to compile this but failed due to non avx2 use, so it did compile some of the files... so I run cuckoo28:

perf stat -d -d -d ./cuckoo28

Looking for 42-cycle on cuckoo28("",0) with 50% edges, 7 trims, 1 threads
Using 16MB edge and 32MB node memory, 1-way siphash, and 4-byte counters
initial load 3200%
round  1 partition loads U0 2022% V0  947%
round  2 partition loads U0  560% V0  373%
round  3 partition loads U0  267% V0  201%
round  4 partition loads U0  157% V0  126%
round  5 partition loads U0  103% V0   86%
round  6 partition loads U0   73% V0   63%
round  7 partition loads U0   55% V0   48%
nonce 0: 7 trims completed  final load 48%
   4-cycle found at 0:67%
0 total solutions

 Performance counter stats for './cuckoo28':

      79535.505140      task-clock (msec)         #    0.998 CPUs utilized          
             1,567      context-switches          #    0.020 K/sec                  
                 2      cpu-migrations            #    0.000 K/sec                  
             5,366      page-faults               #    0.067 K/sec                  
   146,892,319,199      cycles                    #    1.847 GHz                      (6.66%)
   104,304,503,009      instructions              #    0.71  insn per cycle          (13.35%)
     4,551,218,923      branches                  #   57.222 M/sec                    (13.35%)
       348,290,653      branch-misses             #    7.65% of all branches          (6.67%)
    16,992,792,470      L1-dcache-loads           #  213.650 M/sec                    (6.68%)
     1,043,305,676      L1-dcache-load-misses     #    6.14% of all L1-dcache hits    (6.67%)
     1,034,482,209      LLC-loads                 #   13.007 M/sec                    (6.67%)
       698,460,299      LLC-load-misses           #    0.95% of all LL-cache hits     (6.67%)
   145,872,608,950      L1-icache-loads           # 1834.056 M/sec                    (6.67%)
        11,774,481      L1-icache-load-misses                                         (6.66%)
    16,954,141,954      dTLB-loads                #  213.164 M/sec                    (6.67%)
       384,883,546      dTLB-load-misses          #    2.27% of all dTLB cache hits   (6.67%)
   104,269,293,009      iTLB-loads                # 1310.978 M/sec                    (13.34%)
            17,157      iTLB-load-misses          #    0.00% of all iTLB cache hits   (13.34%)
         9,380,542      L1-dcache-prefetches      #    0.118 M/sec                    (6.67%)
        L1-dcache-prefetch-misses                                  

      79.674461632 seconds time elapsed


(that's on a q8200 running @ 1.86 GHz). Ins/cycle are seemingly quite low - which is normal-ish for ram tasks, but still).

Further perf stats indicate a lot of overhead here (the cmpxchg stalls the jump?): http://imgur.com/a/zwEmP
(data obtained by running perf top while ./cuckoo28 was running).

Given that this was on a single thread, couldn't this be done with plain unlocked moves? Also, in multithreaded applications, would it be possible to gamble faster unlocked moves trading data accuracy in order to produce a faster version with possible data corruption?

Just some thoughts...
legendary
Activity: 990
Merit: 1108
Since the bounties are (partially) backed by bitcoin in the Cuckoo Cycle Bounty Fund,
the recent boost in bitcoin value allows me to increase the bounties by 50%:

performance improvement   4x           2x           sqrt(2)x
CPU                                    $6000     $3000     $1500
TMTO                                  $6000     $3000     $1500
GPU                                    $3000     $1500     $ 750

Happy bounty hunting...
legendary
Activity: 990
Merit: 1108
November 09, 2016, 03:52:37 PM
#1
Cuckoo Cycle is an instantly verifiable memory bound Proof-of-Work,
whose run time is dominated by memory latency.

I invite anyone to try claim one of the following bounties for performance doubling
of my C++/CUDA solvers:

CPU   $2000
TMTO $2000
GPU   $1000

or half the above bounties for a sqrt(2) performance improvement. Details at

https://github.com/tromp/cuckoo

Happy bounty hunting...
Jump to: