Pages:
Author

Topic: Zcash Equihash PoW Released - page 2. (Read 7767 times)

hero member
Activity: 672
Merit: 500
April 18, 2016, 04:30:33 AM
#15
The GPU uses some coalescing to minimize latency and maximize utilization of memory bandwidth. Isn't that likely to merge accesses and thus amortize?

Excuse me we are getting in area of hardware that I am not well researched.
Coalescing patterns are variable across implementations a common point across most hardware is occupancy ratio or hardware-based, context-switch-free multi-threading of sort. Intel calls it HyperThreading®, it's guaranteed to work on two threads whereas GPUs accomodate as many threads they can fit. In my experience is considerably more consistent and easy to use than exploiting coalescing patterns.
sr. member
Activity: 420
Merit: 262
April 18, 2016, 02:26:37 AM
#14
What about multiple processors sharing the same memory to amortize the memory electrical cost?

If they share the same memory bank then they likely cause more row switches and there is negligible amortization.

The GPU uses some coalescing to minimize latency and maximize utilization of memory bandwidth. Isn't that likely to merge accesses and thus amortize?

Excuse me we are getting in area of hardware that I am not well researched.
legendary
Activity: 990
Merit: 1108
April 17, 2016, 08:56:29 PM
#13
Cuckoo spends only 1/3 of its time doing hash computations, which take little power compared to DRAM switching rows.
So the possible gain by ASIC hashing is less than 33%.

Sorry not trying to make more work for you, but would you happen to have a reference for that claim? I am curious how you measured that? I remember you told smooth and I that you don't have a Kill-A-Watt meter to measure GPU power consumption.

I didn't measure that. I estimated that doing 90 64-bit operations (one siphash call) is cheaper than operating the (typically 16k)
sense amplifiers to read a new bitline. But I could be wrong about that...
I'm assuming a Cuckoo size much larger than 16k of course.

Quote
What about multiple processors sharing the same memory to amortize the memory electrical cost?

If they share the same memory bank then they likely cause more row switches and there is negligible amortization.

Quote
I thought Wagner's algorithm sorted. I'll have to reread the paper more carefully.

That puzzled me too. I don't see a need for sorting, just for binning to find collisions on the next chunk of n/(k+1) bits...,
and I will only believe sorting is more performant when I see it (i.e. when optimised code for that is made public).

Quote
If lookup is in the static RAM cache, then ratio of computation to memory power consumption should increase.

Their stated goal is a 1GB memory footprint, so we can rule out fitting in cache.
legendary
Activity: 1456
Merit: 1000
April 17, 2016, 06:51:43 PM
#12
I'm interested to give this a test drive and get this working on some rigs.
sr. member
Activity: 420
Merit: 262
April 17, 2016, 01:49:58 PM
#11
1. You didn't address my electrical efficiency point, which I think is probably the most damning. I very much doubt that the memory power consumption will be even 1/10 of the computational power consumption. Thus an ASIC will be at least 10 times more power efficient, Don't I remember pointing out the same issue with your Cuckoo hash? What was your retort again?

Cuckoo spends only 1/3 of its time doing hash computations, which take little power compared to DRAM switching rows.
So the possible gain by ASIC hashing is less than 33%.

Sorry not trying to make more work for you, but would you happen to have a reference for that claim? I am curious how you measured that? I remember you told smooth and I that you don't have a Kill-A-Watt meter to measure GPU power consumption.

What about multiple processors sharing the same memory to amortize the memory electrical cost?

2. I also very much doubt it will be latency bound (even after the 10X speed up of the computation), because optimized sorting doesn't have to be random access.

Again, we must wait to see the final parameters and optimized implementations. It seems likely they'll pick
n=144 and k=5, which means they need to find collisions on 16 bits at a time. That is just a lookup in a 64K entry table;
I don't expect to see any real sorting there. It's less clear how the actual items are represented, and if they are moved
themselves, or by pointer.

I thought Wagner's algorithm sorted. I'll have to reread the paper more carefully.

If lookup is in the static RAM cache, then ratio of computation to memory power consumption should increase.

Yeah we need to analyze the specifics, which is why I am thinking their white paper is myopic. They didn't present any of this analysis afaics on quick perusal. Perhaps they just assumed a memory bandwidth bound.
legendary
Activity: 990
Merit: 1108
April 17, 2016, 09:47:39 AM
#10
1. You didn't address my electrical efficiency point, which I think is probably the most damning. I very much doubt that the memory power consumption will be even 1/10 of the computational power consumption. Thus an ASIC will be at least 10 times more power efficient, Don't I remember pointing out the same issue with your Cuckoo hash? What was your retort again?

Cuckoo spends only 1/3 of its time doing hash computations, which take little power compared to DRAM switching rows.
So the possible gain by ASIC hashing is less than 33%.

Quote
2. I also very much doubt it will be latency bound (even after the 10X speed up of the computation), because optimized sorting doesn't have to be random access.

Again, we must wait to see the final parameters and optimized implementations. It seems likely they'll pick
n=144 and k=5, which means they need to find collisions on 16 bits at a time. That is just a lookup in a 64K entry table;
I don't expect to see any real sorting there. It's less clear how the actual items are represented, and if they are moved
themselves, or by pointer.
sr. member
Activity: 420
Merit: 262
April 17, 2016, 03:20:02 AM
#9
Additionally I was thinking that this Equihash can be trivially sped up on an ASIC because the Equihash algorithm is not memory latency bound and thus is bound on the sorting and computation speed and/or the memory bandwidth, which can be optimized with specialized hardware.

There is no publicly available optimized code for Equihash, so we don't really know to what extent it
is bandwidth or latency bound. There may be subtle trade offs between the two. We really need to see the memory behaviour of actual running optimized code.

1. You didn't address my electrical efficiency point, which I think is probably the most damning. I very much doubt that the memory power consumption will be even 1/10 of the computational power consumption. Thus an ASIC will be at least 10 times more power efficient, Don't I remember pointing out the same issue with your Cuckoo hash? What was your retort again?

2. I also very much doubt it will be latency bound (even after the 10X speed up of the computation), because optimized sorting doesn't have to be random access.

For any memory bandwidth bound, there is this:

legendary
Activity: 3444
Merit: 1061
April 17, 2016, 03:02:46 AM
#8
There is no publicly available optimized code for Equihash, so we don't really know to what extent it
is bandwidth or latency bound. There may be subtle trade offs between the two. We really need to see the memory behaviour of actual running optimized code.

Even once zcash gets around to publishing more optimized cpu and gpu code,
it won't be clear for a while whether further optimizations are possible, considering that the mining
algorithm is rather nontrivial.

I'm somewhat worried by this statement on their optimization page at https://github.com/zcash/zcash/issues/857

Quote
We don't have to do every possible optimization, but enough that it's feasible to run a miner with the right parameters.

Leaving optimizations on the table means that some miners will be able to run more efficiently than others.

calling our miner developers in the house  Cool
legendary
Activity: 990
Merit: 1108
April 16, 2016, 04:47:54 PM
#7
Re: Zcash Equihash PoW Released
If you look on page 115 of Bernstein's Post Quantum Cryptography, it confirms a sqrt(N) speedup for Wagner's solution to the Generalized Birthday Problem for the hypothetical quantum computer.

You seem to have have misinterpreted the book. Page 115 says
Quote
If both lists have the same same size sqrt(N) , this means that the merging can be done in sqrt(N) operations instead of N , which is the same speed-up that can be achieved by Grover’s
algorithm. Thus, even with a quantum computer we can not expect to get attacks for FSB or CFS

That says there is *no* known quantum speedup for the Generalized Birthday Problem.

Quote
Additionally I was thinking that this Equihash can be trivially sped up on an ASIC because the Equihash algorithm is not memory latency bound and thus is bound on the sorting and computation speed and/or the memory bandwidth, which can be optimized with specialized hardware.

There is no publicly available optimized code for Equihash, so we don't really know to what extent it
is bandwidth or latency bound. There may be subtle trade offs between the two. We really need to see the memory behaviour of actual running optimized code.

Even once zcash gets around to publishing more optimized cpu and gpu code,
it won't be clear for a while whether further optimizations are possible, considering that the mining
algorithm is rather nontrivial.

I'm somewhat worried by this statement on their optimization page at https://github.com/zcash/zcash/issues/857

Quote
We don't have to do every possible optimization, but enough that it's feasible to run a miner with the right parameters.

Leaving optimizations on the table means that some miners will be able to run more efficiently than others.
sr. member
Activity: 420
Merit: 262
April 16, 2016, 12:53:56 PM
#6
Re: Zcash Equihash PoW Released

https://z.cash/blog/why-equihash.html

Well some Monero folks seem to very interested, so I decided to take a look because I suddenly remembered something from Iota's whitepaper:

http://188.138.57.93/tangle.pdf#page=20 (4.3 Resistance to quantum computations)

If you look on page 115 of Bernstein's Post Quantum Cryptography, it confirms a sqrt(N) speedup for Wagner's solution to the Generalized Birthday Problem for the hypothetical quantum computer. So if Equihash is using a large list of values to be sorted, then the speedup could be so great that a quantum computer could rewrite the entire block chain quite easily by redoing the past proof-of-work exponentially faster than it was originally done.

It appears that a memory hard algorithm such as Cryptonite would not have this problem.

I am just rushing and have not reread the Equihash white paper carefully, so I may have a mistake in my analysis.

Additionally I was thinking that this Equihash can be trivially sped up on an ASIC because the Equihash algorithm is not memory latency bound and thus is bound on the sorting and computation speed and/or the memory bandwidth, which can be optimized with specialized hardware.

Have I missed something in my haste Huh Surely the Z.cash folks are not this myopic.

The following assumption from the Equihash white paper seems to be wrong:

Quote
For fixed memory size, memory chips with bandwidth
significantly higher than that of typical desktop memory (such
as DDR3) are rare. Assuming that a prover does not have
access to memory with bandwidth higher than certain
Bw
max
,
we can efficiently bound the time-memory (and thus the time-
area) product for such implementations.

...

To the best of our knowledge, the highest reported band-
width in commercial products does not exceed 200 GB/s
(Playstation 4 has 172 GB/s bandwidth [5]), whereas the
desktop DDR3 can have as high as 17 GB/s bandwidth [3].
Thus we conclude that the highest advantage a prover can
get from parallel hardware does not exceed the factor of 15.
This may mean that our proof-of-work is GPU- and desktop-
friendly, whereas it is also ASIC-resistant in the sense that an
ASIC implementation does not yield smaller time-area product.

I am thinking faster memory bandwidth can be obtained by reading and writing to multiple, redundant independent memory banks simultaneously that are interleaved differently to have disjoint collisions on banks.

Or even if not faster, then perhaps orders-of-magnitude more electrically efficient, since the electric consumption will be primarily in the computation and not in the memory. And computation can be radically optimized on an ASIC.

As I predicted when I wrote in the Monero thread about this the other day, it appears the white paper didn't consider electricity as the most important factor. Duh.
sr. member
Activity: 420
Merit: 262
April 15, 2016, 01:21:50 PM
#5
I can't right now. Sorry. tromp is also qualified, but he is opinionated about the value of asymmetric proof-of-work versus memory hard hash algorithms.
legendary
Activity: 1256
Merit: 1009
April 15, 2016, 01:20:17 PM
#4
Interesting..

Please elaborate.

I understand very little about PoW and I understand my capacity enough to know that learning will be less fruitful than reading from educated observers.  Wolf0, tbtb_need_war, claymore, Gregory Maxwell, etc would all be able to give good insight
legendary
Activity: 1414
Merit: 1001
To weird to live To rare to die
April 15, 2016, 12:44:17 PM
#3
Well on page 6 part of the equation is infected with HIV and a first glance with out reading everything  it seems they propose a band aid to solve the problem
legendary
Activity: 1582
Merit: 1001
April 15, 2016, 11:29:57 AM
#2
Interesting..

Please elaborate.
Pages:
Jump to: