Pages:
Author

Topic: Cuckoo Cycle: a new memory-hard proof-of-work system - page 2. (Read 10866 times)

legendary
Activity: 990
Merit: 1108
Quote
You can find the motivation for choice of cycle length in the whitepaper; I devoted a section to it.
While the Cuckoo Cycle paper does devote a section to this, the section lacks any hard guarantees on whether the parameter choices are correct or optimal.
Remind me again, which other Proof-of-Works have guarantees of parameter optimality,
and where to find proof of these?
newbie
Activity: 37
Merit: 0
Quote
You can find the motivation for choice of cycle length in the whitepaper; I devoted a section to it.
While the Cuckoo Cycle paper does devote a section to this, the section lacks any hard guarantees on whether the parameter choices are correct or optimal.  The graphs in the paper were derived from what seem like experimentation, but the problem with such an approach is it leaves open the possibility that more efficient and effective attacks exist.

There needs to be provable formulae establishing the relationships between parameter choices, the likelihood of cycles, and in turn the theoretical bounds that any attacking implementation could have.  Without this, it leaves open the possibility of novel attacks that may not have been identified yet, which is no basis for a currency.

Quote
There is more literature on cycle detection in graphs than on k-SUM, and very little
on k-SUM in cyclic groups as you consider.

I'm not advocating k-SUM as I think we've established it is a dead-end.  That said, while there is much literature on cycle detection in graphs, I haven't found much in terms of proofs that concretely establish the probability of n-length such cycles existing in a G(n,p) graph.  Would certainly appreciate information here if you are familiar with work in this area, and I think it would go a long way towards strengthening the fundamentals of Cuckoo Cycle as proposed.
legendary
Activity: 990
Merit: 1108
tromp - after our discussion on k-SUM, I went back and tried to categorize why I was uncomfortable with Cuckoo Cycle in its present state right now.  Basically it comes down to:

- It seems like the selection of 42 for the cycle length was based on experimentation

You can find the motivation for choice of cycle length in the whitepaper; I devoted a section to it.

but there doesn't seem to be a hard proof

Proofs are for theorems, not for parameter choices.

 I was originally looking at kSUM because the difficulty of kSUM is well known.  In the case of cycle finding, I'm not sure if the difficulty is proven.
There is more literature on cycle detection in graphs than on k-SUM, and very little
on k-SUM in cyclic groups as you consider.

- One way of proving this might be to instead establish the probability of a cycle of length k occurring in a randomly generated graph, and then tweaking parameters so the expectation is only 1 cycle is found.

My whitepaper has sufficient data on probability of cycles occurring. You don't need
expectation equal to 1. Any non-negligable constant less than 1 works just as well.
[/quote]

newbie
Activity: 37
Merit: 0
tromp - after our discussion on k-SUM, I went back and tried to categorize why I was uncomfortable with Cuckoo Cycle in its present state right now.  Basically it comes down to:

- It seems like the selection of 42 for the cycle length was based on experimentation, but there doesn't seem to be a hard proof for this (dga mentions this).  I was originally looking at kSUM because the difficulty of kSUM is well known.  In the case of cycle finding, I'm not sure if the difficulty is proven.

- One way of proving this might be to instead establish the probability of a cycle of length k occurring in a randomly generated graph, and then tweaking parameters so the expectation is only 1 cycle is found.

- Separately might it make sense to look deeper at k-cliques? Their probability seems to be much better understood, and might provide a more provably secure basis.

Quick starting point:

Assume G(n,p) - Erdős–Rényi graph
let pK = Probability k nodes are a clique p^(k choose 2)
let pNK = Probability no k-clique exists (1-pK)^(n choose k)
Probability at least 1 k-clique exists = 1-pNK

Thoughts?
legendary
Activity: 990
Merit: 1108
sr. member
Activity: 350
Merit: 250
any coins using this algo atm?
legendary
Activity: 990
Merit: 1108
So while your statement is true, it's not necessarily the right truth.  Is Cuckoo Cycle better or worse than adaptive scrypt-N?  Is it better than a variant of Invictus's Momentum PoW that has an adaptive N factor?

Neither of us knows the answer to this question.  There is a chance that it's stronger.  But there's also a very real chance that the requirement to find a cycle, instead of a simple collision, will mean that the approach can yield to something with sublinear memory requirements and, in fact, underperform compared to Momentum in its memory requirements.

It cannot "underperform compared to Momentum in its memory requirements"
because Cuckoo Cycle generalizes Momentum.

Momentum looks for collisions of a hash function mapping 2^{26} nonces to 50-bit outputs.
Each nonce corresponds to an edge in a bi-partite graph on nodes M union L, where
M is all possible 25-most-significant bits of hash output, and
L is all possible 25-least-significant bits of hash output,
and so each collision is a 2-cycle in this graph.

Thus, it is more likely that Momentum, being a special case of Cuckoo Cycle,
underperforms in its memory requirements.
legendary
Activity: 990
Merit: 1108
Bucketing at the size you're doing isn't about page faults, it's more about cache misses.  A cache miss causes a stalled cycle.

Also, you don't mean page faults:  You mean TLB misses.  Page faults are huge and rare.  A TLB miss, at various levels of the hierarchy, is common.

You're right about the TLBs. Bucketing reduces TLB load misses from 2.8 billion to 0.56 billion
(even while increasing TLB store misses from 1.7 million to 467 million).

My stalled cycles do not appear due to cache misses though.
The bucketed version has slightly more cache misses; 3.51 billion versus 3.14 billion.

Maybe the stalled cycles are also due to the TLB misses?!
dga
hero member
Activity: 737
Merit: 511
Next on the to-do list is another of Dave's suggestions:
to bucketize the array of nodes to work on, to improve
main-memory locality. That will help the edge trimming,
which works on only one edge endpoint at a time,
but won't help the original algorithm, which needs to work
on both edge endpoints at the same time.
So that may well achieve speed-parity...

And so it did! Reducing the number of page faults is a big gain.

Actually, the bucketing doesn't affect page faults.
Profiling shows that the large speedup is due to big reductions in
stalled-cycles-frontend and stalled-cycles-backend. Somehow
alternating siphash computations with main memory accesses causes
lots of stalls, while doing a bunch of hashes followed by a bunch of
main memory accesses is much better.
Damn; performance optimization is hard...

Bucketing at the size you're doing isn't about page faults, it's more about cache misses.  A cache miss causes a stalled cycle.

Also, you don't mean page faults:  You mean TLB misses.  Page faults are huge and rare.  A TLB miss, at various levels of the hierarchy, is common.

You can eliminate the TLB misses as a factor by allocating your data structures in hugepages.  see, for example,

http://lxr.free-electrons.com/source/tools/testing/selftests/vm/map_hugetlb.c

legendary
Activity: 990
Merit: 1108
Next on the to-do list is another of Dave's suggestions:
to bucketize the array of nodes to work on, to improve
main-memory locality. That will help the edge trimming,
which works on only one edge endpoint at a time,
but won't help the original algorithm, which needs to work
on both edge endpoints at the same time.
So that may well achieve speed-parity...

And so it did! Reducing the number of page faults is a big gain.

Actually, the bucketing doesn't affect page faults.
Profiling shows that the large speedup is due to big reductions in
stalled-cycles-frontend and stalled-cycles-backend. Somehow
alternating siphash computations with main memory accesses causes
lots of stalls, while doing a bunch of hashes followed by a bunch of
main memory accesses is much better.
Damn; performance optimization is hard...
legendary
Activity: 990
Merit: 1108
The new code is up at https://github.com/tromp/cuckoo,
as Makefile targets cuckooNN (for various value of NN).
The old code is now only useful for small instances and
relative easiness > 50% (at which edge trimming fails).

Apparently this is due to the use of C++ atomic datatypes.
Experiments using plain non-atomic datatypes instead,
show negligable differences, so that is now the default.

Current performance is listed in the README as

 6) running time on high end x86 is 9min/GB single-threaded, and 1min/GB for 20 threads.

The big open question is still how well a GPU can perform Cuckoo Cycle.

Most main memory accesses in (the current algorithm for) Cuckoo Cycle
are to different 64-byte segments (GPU cache line size), making latency
of prime importance. And that's where GPUs fare worse than CPUs...

legendary
Activity: 990
Merit: 1108
I haven't figured out how to achieve speed parity:-(

Suggestions for further improvement are more than welcome!

Next on the to-do list is another of Dave's suggestions:
to bucketize the array of nodes to work on, to improve
main-memory locality. That will help the edge trimming,
which works on only one edge endpoint at a time,
but won't help the original algorithm, which needs to work
on both edge endpoints at the same time.
So that may well achieve speed-parity...

And so it did! Reducing the number of page faults is a big gain.

Edge-trimming is 33% faster single-threaded, but multi-threading
speedups are very erratic, e.g. 7-11 threads are all worse than
6 threads, 12 is much faster but then slow down through
19 threads, and finally gain more than double at 20 threads.

The new code is up at https://github.com/tromp/cuckoo,
as Makefile targets cuckooNN (for various value of NN).
The old code is now only useful for small instances and
relative easiness > 50% (at which edge trimming fails).
legendary
Activity: 990
Merit: 1108
I haven't figured out how to achieve speed parity:-(

Suggestions for further improvement are more than welcome!

Next on the to-do list is another of Dave's suggestions:
to bucketize the array of nodes to work on, to improve
main-memory locality. That will help the edge trimming,
which works on only one edge endpoint at a time,
but won't help the original algorithm, which needs to work
on both edge endpoints at the same time.
So that may well achieve speed-parity...
legendary
Activity: 990
Merit: 1108
At John's request, I've taken a look at this and scribbled up my initial comments.
They might be wrong!  Please take them in the spirit of continuing the discussion, not as the final word.
http://da-data.blogspot.com/2014/03/a-public-review-of-cuckoo-cycle.html
I'll post an addendum to the blog post (and here) with anything that's shown to be wrong.

Dave has shown my algorithm is not as tmto-hard as I thought it was.
Excellent work, Dave!

For now, I amended my README at https://github.com/tromp/cuckoo with the following

UPDATE: Dave Anderson proposed an alternative algorithm on his blog

http://da-data.blogspot.com/2014/03/a-public-review-of-cuckoo-cycle.html

that uses significantly less memory than mine at potentially less than an order of magnitude slowdown. I hope to soon implement his algorithm and quantify the "tomato" (his pronouncable spelling of tmto, or time-memory trade-off).

Baseline:  My code uses 10x less memory and runs 2x slower (on a laptop instead of an i7).  At 6x less memory, it should have speed parity.  Could be optimized substantially - I sent you some very hacked up code.  Happy to post it publicly if anyone's interested, but it's trashy - I just wrote it as a proof of concept for the pre-filtering before invoking the cycle finder.  But it gets the point across.

A complete working implementation of Dave's scheme is now available on my github.
Makefile targets cuckooNN still use the old memory hungry algorithm,
while targets tomatoNN use his edge trimming (12 rounds by default).
I haven't figured out how to achieve speed parity:-(

Suggestions for further improvement are more than welcome!

-John
sr. member
Activity: 448
Merit: 250
Us boys over at MyriadCoin will keep an eye on this algorithm Wink
sr. member
Activity: 364
Merit: 250
I'm really quite sane!
I think this is really interesting, if only because I think the PoW problem hasn't been solved for good.
legendary
Activity: 990
Merit: 1108
To have 50% odds of at least one L-cycle edge, you need to sample about 1.5%
log(n) doesn't cut it.

To quote myself back:  "Not guaranteed, but ... can always start over."  In other words, you can settle for a 20% chance if you're willing to run the sampling test a few times, if it saves you enough memory.
That's why I mentioned 50%. Settling for 20% or 50% makes little difference.
You can settle for 0.1% and still the log(n) doesn't cut it.

The definition of the problem is to find a cycle, not the cycle.
Again, this makes no difference except for small constant factors, since you expect to
find about only 3 cycles on average.

Your writeup of Cuckoo Cycle, as of Mar 28 09:28, said:

"We introduce the very first proof-of-work system that is both verify-trivial and tmto-hard."

Period.  There were no qualifiers attached to that statement.  No conjectures, no conditionals, until section 10, where you note that it's a conjecture.  

Yes - if you said "We introduce the first Proof-of-Work scheme based upon graph-theoretical structures.  We hypothesize but have not proved that this PoW may lead to certain memory-hardness properties" I'd be totally fine with it, because it would be an accurate description of what you've done.

You're right. I was wrong to claim hardness without proof.
I rewrote the README and shall rewrite the paper as well when I have time,
possibly dropping the hardness hypothesis.

Create a bitvector representing the first half of the nonces.  Call it the "lower half".
Iterate on this bitvector using the full nonce state *multiple times* until you have eliminated all candidates in the "lower half" bitvector that you can relative to the full set of candidates in the lower half (some of which are being eliminated) and all candidate nonces in the "upper half" (which you're not eliminating because you don't have memory).
Then, in piecewise fashion, identify a size-appropriate subset of the specific incident edges on the lower half that are causing problems -- in other words, which nonces in the "upper half" are causing collisions in the "lower half" that prevent you from discarding some of the lower-half entries?
For that small subset (of size << 1/2N -- let's say 1/8th N), do the same process:  Eliminate them relative to your now-sparsified lower half and the complete upper half.  This should allow you to eliminate more of the lower-half edges.  Repeat until the lower-half is sufficiently sparse that you can represent it in reasonable compressed space, and move on to the upper half.

Thanks for elaborating on how to pass the incompressibility threshold.
Rephrasing, you paint half the edges red and half blue, and eliminate red leaf-edges.
Now remaining red edges are separated from the leaves by blue edges.
So you identify a small fraction of these separating blue edges that touch a remaining red edge,
and paint them purple. Identifying some leaf purple edges let's you eliminate some
more green edges. Choose different red edges to paint purple and repeat.
If necessary, try to identify blue edges separated by two red edges from a leaf
by recursing deeper.
dga
hero member
Activity: 737
Merit: 511
Does this mean anything?  I don't know -- but it tells me *immediately* that the problem requires a different kind of analysis than what you would do for Momentum!  For example, what if you started by sampling lg(n) edges and then growing and pruning that set through 42 rounds? With some reasonable probability, you'd extend into one of the cycles.  Not guaranteed, but it doesn't have to be guaranteed - I can always start over.

To have 50% odds of at least one L-cycle edge, you need to sample about 1.5%
log(n) doesn't cut it.

To quote myself back:  "Not guaranteed, but ... can always start over."  In other words, you can settle for a 20% chance if you're willing to run the sampling test a few times, if it saves you enough memory.

Remember, part of the game is that you have to take the entire context of the search process into account:  An attacker can change the random seed, as well, at some modest cost.  The definition of the problem is to find a cycle, not the cycle.  Again - time/memory.  There are a lot of different ways to try to exploit such a tradeoff.  This gets back to the issue I emailed you with about searching for a biased initial distribution.  Neither of us knows whether or not it would work.

Not at all.  The primes have an entirely different goal.  The point is that you're advertising cuckoo cycle as having a certain set of desirable properties.  Someone shopping for a PoW wants a certain set of properties - some people want the social/scientific good of primes.

So you're saying I mismarketed Cuckoo Cycle. I could have said.

Let's have a PoW based not on number-theoretical, but on graph-theoretical structures.
Existence of a certain small-size subgraph in a large random graph makes for an easily
verifiable PoW. We could look for cliques, or for ... cycles!

Then you'd be fine with it?!

Your writeup of Cuckoo Cycle, as of Mar 28 09:28, said:

"We introduce the very first proof-of-work system that is both verify-trivial and tmto-hard."

Period.  There were no qualifiers attached to that statement.  No conjectures, no conditionals, until section 10, where you note that it's a conjecture. 

Yes - if you said "We introduce the first Proof-of-Work scheme based upon graph-theoretical structures.  We hypothesize but have not proved that this PoW may lead to certain memory-hardness properties" I'd be totally fine with it, because it would be an accurate description of what you've done.

Yes.  It's non-trivial to "store this bit vector in a compact way" but it can be done.  See the SDSL link I provided, for example.

In the first round of full U and V trimming, you pass through a state of the bit vector
having about half 0s and half 1s. At that point it's completely incompressible.
Even with a 10% vs 90% distribution (that takes at least 2 full rounds to reach)
it would be practically infeasible to compress.


ok - one last spoonful and I'm done:

Create a bitvector representing the first half of the nonces.  Call it the "lower half".
Iterate on this bitvector using the full nonce state *multiple times* until you have eliminated all candidates in the "lower half" bitvector that you can relative to the full set of candidates in the lower half (some of which are being eliminated) and all candidate nonces in the "upper half" (which you're not eliminating because you don't have memory).

Then, in piecewise fashion, identify a size-appropriate subset of the specific incident edges on the lower half that are causing problems -- in other words, which nonces in the "upper half" are causing collisions in the "lower half" that prevent you from discarding some of the lower-half entries?

For that small subset (of size << 1/2N -- let's say 1/8th N), do the same process:  Eliminate them relative to your now-sparsified lower half and the complete upper half.  This should allow you to eliminate more of the lower-half edges.  Repeat until the lower-half is sufficiently sparse that you can represent it in reasonable compressed space, and move on to the upper half.

Is it awesome?  No.  Have I thought about it for more than 5 minutes?  No.  Could an expert come up with a drastically better algorithm?  Probably.

Stop trying to defend your algorithm for a few minutes and start trying to attack it.  Your goal in this should not be to try to refute individual objections, because that process can go on forever, fruitlessly.  Refute categories of objections based upon sound reasoning and theoretical analysis of the difficulty of the underlying problem!

I'm done.
legendary
Activity: 990
Merit: 1108
Let's have a PoW based not on number-theoretical, but on graph-theoretical structures.
Existence of a certain small-size subgraph in a large random graph makes for an easily
verifiable PoW. We could look for cliques

Actually, that could be quite interesting. Since proof size grows quadratically with clique size,
you'd want to limit it to cliques of size 10 (giving proofs of 45 nonces).
One can set the desired probability of finding such a clique (in the range 1% to 50%)
and figure out what number of edges in a large random graph makes that happen.

As with Cuckoo Cycle, edge trimming could be successfully applied:
repeatedly count vertex degrees (using 4-bit counters) and remove all
edges incident to a vertex of degree < k-1, where k is the clique size.
Perhaps this even keeps up a high enough reduction rate that
you can afford to keep going until either nothing or a clique is left...

We could make the following little table of "scientific" coins/PoWs:

                              chain-structure         cluster-structure
number-theoretic       primecoin                riecoin
graph-theoretic         cuckoo cycle            cliquecoin
legendary
Activity: 990
Merit: 1108
Given L=42 in the cuckoo cycle problem with N < 2^32,
if a graph G contains an L-cycle,
a random sample of x% of the edges in the graph will contain at least two edges in that L-cycle with probability > 999,999 / 1,000,000.
Math edit:  Replaced 10% with x% -- there is some x for which this is true. Smiley
You can go check the math yourself - it's just a binomial with p=x%.

You need about 33% for 1 in a million odds of less than 2 L-cycle edges in your sample.

Does this mean anything?  I don't know -- but it tells me *immediately* that the problem requires a different kind of analysis than what you would do for Momentum!  For example, what if you started by sampling lg(n) edges and then growing and pruning that set through 42 rounds? With some reasonable probability, you'd extend into one of the cycles.  Not guaranteed, but it doesn't have to be guaranteed - I can always start over.

To have 50% odds of at least one L-cycle edge, you need to sample about 1.5%
log(n) doesn't cut it.

Not at all.  The primes have an entirely different goal.  The point is that you're advertising cuckoo cycle as having a certain set of desirable properties.  Someone shopping for a PoW wants a certain set of properties - some people want the social/scientific good of primes.

So you're saying I mismarketed Cuckoo Cycle. I could have said.

Let's have a PoW based not on number-theoretical, but on graph-theoretical structures.
Existence of a certain small-size subgraph in a large random graph makes for an easily
verifiable PoW. We could look for cliques, or for ... cycles!

Then you'd be fine with it?!

Yes.  It's non-trivial to "store this bit vector in a compact way" but it can be done.  See the SDSL link I provided, for example.

In the first round of full U and V trimming, you pass through a state of the bit vector
having about half 0s and half 1s. At that point it's completely incompressible.
Even with a 10% vs 90% distribution (that takes at least 2 full rounds to reach)
it may be too impractical to compress.


Pages:
Jump to: