Cuckoo Cycle Speed Challenge; $2500 in bounties - page 3.

tromp

legendary

Activity: 1000

Merit: 1120

Quote from: tromp on July 29, 2014, 11:03:58 AM

Quote from: dga on July 29, 2014, 09:59:47 AM

Could you specify the exact program / command line arguments that must be compared against for each of those categories?
And how you expect the time to be measured, etc., etc.?

I will add relevant bounty Makefile targets that encompasses all the required runs, and reports total running time
for both my cuckoo_miner source and the alternate bounty_miner source.

The latest Makefile includes targets

Code:

bounty28:       cuckoo.h bounty_miner.h bounty_miner.cpp Makefile
        $(GPP) -o bounty28 -DSIZESHIFT=28 bounty_miner.cpp

bounty30:       cuckoo.h bounty_miner.h bounty_miner.cpp Makefile
        $(GPP) -o bounty30 -DSIZESHIFT=30 bounty_miner.cpp

bounty32:       cuckoo.h bounty_miner.h bounty_miner.cpp Makefile
        $(GPP) -o bounty32 -DSIZESHIFT=32 bounty_miner.cpp

cuda28: cuda_miner.cu Makefile
        nvcc -o cuda28 -DSIZESHIFT=28 -arch sm_20 cuda_miner.cu -lcrypto

cuda30: cuda_miner.cu Makefile
        nvcc -o cuda30 -DSIZESHIFT=30 -arch sm_20 cuda_miner.cu -lcrypto

cuda32: cuda_miner.cu Makefile
        nvcc -o cuda32 -DSIZESHIFT=32 -arch sm_20 cuda_miner.cu -lcrypto

cpubounty:      cuckoo28 bounty28 cuckoo30 bounty30 cuckoo32 bounty32 Makefile
        for c in 28 30 32; do for t in 1 2 4 8; do time for h in {0..9}; do ./cuckoo$$c -t $$t -h $$h; done; time for h in {0..9}; do ./bounty$$c -t $$t -h $$h; done;done; done

gpubounty:      cuckoo28 cuda28 cuckoo30 cuda30 cuckoo32 cuda32 Makefile
        for c in 28 30 32; do time for h in {0..9}; do ./cuckoo$$c -t 16 -h $$h; done; time for h in {0..9}; do ./cuda$$c -h $$h; done; done

Target cpubounty runs the required 3x4=12 time comparison tests for the first two bounties while target gpubounty runs the required 3 time comparison tests for the third bounty.

The output should indicate what cycles are found. Long cycles (length > 64) may be ignored. If your program doesn't find all short cycles, then the short cycle finding rate should be determined somehow and used to discount the speedup.

tromp

legendary

Activity: 1000

Merit: 1120

Quote from: tromp on July 29, 2014, 11:03:58 AM

I specified Xeon's because Core i7 doesn't offer that many threads. But those Xeons are rather prohibitively
priced, so I want to keep dollars out of that equation. However, we can consider your metric if a factor of 2x
or more is demonstrated for a GPU compared to a price/performance leading Core i7 running at 8 threads.

This news story

http://www.kitguru.net/components/cpu/anton-shilov/intel-core-i7-5960x-haswell-e-de-lidded-interesting-discoveries-found/

suggests that 8+ core Core i7 CPUs are not far off.
Even with an expected initial price premium, they are bound to be way more perfomance/price competitive than Xeons.

tromp

legendary

Activity: 1000

Merit: 1120

Quote from: dga on July 29, 2014, 09:59:47 AM

Could you specify the exact program / command line arguments that must be compared against for each of those categories?
And how you expect the time to be measured, etc., etc.?

I will add relevant bounty Makefile targets that encompasses all the required runs, and reports total running time
for both my cuckoo_miner source and the alternate bounty_miner source.
Please allow me a day or so to update the repository.

Quote

What operating system will the test be conducted on? What CPU(s) are acceptable for demonstrating the speedup?
Is a 2x speedup on some but not all CPUs adequate?

Tests are to be conducted on a common Linux distribution. Recent Intel Core i7 and Xeons are acceptable.
A 2x speedup on either of those is adequate.

Quote

For GPUs, I'd suggest a way of putting it that allows a little more flexibility in the hardware.
Perhaps cycles / second / hardware dollar?

I specified Xeon's because Core i7 doesn't offer that many threads. But those Xeons are rather prohibitively
priced, so I want to keep dollars out of that equation. However, we can consider your metric if a factor of 2x
or more is demonstrated for a GPU compared to a price/performance leading Core i7 running at 8 threads.

Quote

Also - am I correct that you've updated your algorithm to use the faster:

Code:

  2*nonce + {left or right side}  % HALFSIZE

where HALFSIZE is a power of two (expressed as a mask) instead of the previous %N0 and %N1 where N0 and N1 were not powers of two?

The main reason for that change is to avoid limiting Cuckoo Cycle to sizes of 2^32.
It is now well-defined for sizes up to 2^64.
I also wanted to get rid of the modulo operations which are not essential, cause some architectural bias,
slow down the critical path, and make the bipartite graph imperfectly balanced.
The goal is to make everything as simple and uniform as possible.

dga

hero member

Activity: 737

Merit: 511

Could you specify the exact program / command line arguments that must be compared against for each of those categories?
And how you expect the time to be measured, etc., etc.?
What operating system will the test be conducted on? What CPU(s) are acceptable for demonstrating the speedup?
Is a 2x speedup on some but not all CPUs adequate?

For GPUs, I'd suggest a way of putting it that allows a little more flexibility in the hardware.
Perhaps cycles / second / hardware dollar?

Also - am I correct that you've updated your algorithm to use the faster:

Code:

  2*nonce + {left or right side}  % HALFSIZE

where HALFSIZE is a power of two (expressed as a mask) instead of the previous %N0 and %N1 where N0 and N1 were not powers of two?

tromp

legendary

Activity: 1000

Merit: 1120

Quote from: tromp on July 23, 2014, 10:49:32 PM

Speedup Bounty:
===========
$500 for an open source implementation that finds 42-cycles twice as fast (disregarding memory use).

I should add that the current default settings

#define LOGNBUCKETS 8
#define BUCKETSIZE 256

in cuckoo_miner.h are chosen to maximize performance on an Intel Xeon.
For running optimally on a Core i7, these may have to be lowered a bit...

tromp

legendary

Activity: 1000

Merit: 1120

Cuckoo Cycle is the first graph-theoretic proof-of-work.

Proofs take the form of a length 42-cycle in a bipartite graph
with N nodes and N/2 edges, with N scalable from millions to billions and beyond.
This makes verification trivial: compute the 42*2 edge endpoints
with one initialising sha256 and 84 siphash24 computations, check that
each endpoint occurs twice, and that you come back to the
starting point only after traversing 42 edges.
A final sha256 can check whether the 42-cycle meets a difficulty target.

With siphash24 being a very lightweight hash function, this makes for
practically instant verification.

This is implemented in just 157 lines of C code (files cuckoo.h and cuckoo.c) at
https://github.com/tromp/cuckoo, where you can also find a whitepaper.

From this point of view, Cuckoo Cycle is a rather simple PoW.

Finding a 42-cycle, on the other hand, is far from trivial,
requiring considerable resources, and some luck
(for a given header, the odds of its graph having a 42-cycle are about 2.5%).

The algorithm implemented in cuckoo_miner.h runs in time linear in N.
(Note that running in sub-linear time is out of the question, as you could
only compute a fraction of all edges, and the odds of all 42 edges of a cycle
occurring in this fraction are astronomically small).
Memory-wise, it uses N/2 bits to maintain a subset of all edges (potential cycle edges)
and N additional bits (or N/2^k bits with corresponding slowdown)
to trim the subset in a series of edge trimming rounds.
Once the subset is small enough, an algorithm inspired by Cuckoo Hashing
is used to recognise all cycles, and recover those of the right length.

I'd like to claim that this implementation is a reasonably optimal Cuckoo miner,
and that trading off memory for running time, as implemented in tomato_miner.h,
incurs at least one order of magnitude extra slowdown.
I'd further like to claim that GPUs cannot achieve speed parity with server CPUs.

To that end, I offer the following bounties:

Speedup Bounty:

$500 for an open source implementation that finds 42-cycles twice as fast (disregarding memory use).

Linear Time-Memory Trade-Off Bounty:

$500 for an open source implementation that uses at most N/k bits while running up to 15*k times slower, for any k>=2.

Both of these bounties require N ranging over {2^28,2^30,2^32} and #threads ranging over {1,2,4,8},
and further assume a high-end Intel Core i7 or Xeon and recent gcc compiler with regular flags as in my Makefile.

GPU Speed Parity Bounty:

$1000 for an open source implementation for an AMD R9 280X or nVidia GeForce GTX 770 (or similar high-end GPU)
that is as fast as a high-end Intel Xeon running 16 threads. Again with N ranging over {2^28,2^30,2^32}.

Note that there is already a cuda_miner.cu, my attempted port of the edge trimming part
of cuckoo_miner.c to CUDA, but while it seems to run ok for medium graph sizes,
it crashes my computer at larger sizes, and I haven't had the chance to debug the issue yet
(I prefer to let my computer work on solving 8x8 connect-4:-)
Anyway, this could make a good starting point.

These bounties are to expire at the end of this year. They are admittedly modest in size, but then
claiming them might only require one or two insightful tweaks to my existing implementations.

I invite anyone who'd like to see my claims refuted to extend any of these bounties
with amounts of your crypto-currency of choice.

(I don't have any crypto-currencies to offer myself, but if need be, I might be able to convert
a bounty through a trusted 3rd party)

Happy bounty hunting!

-John

Topic: Cuckoo Cycle Speed Challenge; $2500 in bounties - page 3. (Read 6524 times)