Pages:

Author

Topic: The Ethereum Paradox - page 5. (Read 99921 times)

iamnotback

sr. member

Activity: 336

Merit: 265

Quote from: BitUsher on June 17, 2016, 04:13:17 PM

Look at how they pumped this undertested and buggy shit!!!

https://twitter.com/stephantual/status/743093798302015488

"Holding so much energy, the Colossus of DAO is able to withstand all threats" -

Incredibly unethical!

The theme of overpowering and disrupting the corrupt order of the world is a very idealistic and powerful meme.

Anyone marketing a DAO should be incorporating that meme.

Why unethical? Just because Tual can't code worth a shit doesn't mean the DAO as a concept is dead.

illodin

hero member

Activity: 966

Merit: 1003

Quote from: AlexGR on June 17, 2016, 03:17:28 PM

Quote from: BitUsher on June 17, 2016, 02:09:51 PM

https://twitter.com/el33th4xor/status/743881413355773952

Ouch. The "Stalker Attack" is real. I'm told that the DAO hacker voted YES on every split proposal, enabling him to raid children DAOs.

BitUsher

legendary

Activity: 994

Merit: 1035

Look at how they pumped this undertested and buggy shit!!!

https://twitter.com/stephantual/status/743093798302015488

"Holding so much energy, the Colossus of DAO is able to withstand all threats" -

Incredibly unethical!

AlexGR

legendary

Activity: 1708

Merit: 1049

Quote from: BitUsher on June 17, 2016, 02:09:51 PM

https://twitter.com/el33th4xor/status/743881413355773952

Ouch. The "Stalker Attack" is real. I'm told that the DAO hacker voted YES on every split proposal, enabling him to raid children DAOs.

BitUsher

legendary

Activity: 994

Merit: 1035

https://twitter.com/el33th4xor/status/743881413355773952

Ouch. The "Stalker Attack" is real. I'm told that the DAO hacker voted YES on every split proposal, enabling him to raid children DAOs.

r0ach

legendary

Activity: 1260

Merit: 1000

tromp

legendary

Activity: 1000

Merit: 1120

Quote from: iamnotback on June 11, 2016, 01:11:04 AM

I forgot to thank tromp for his very valuable interaction. Thanks John.

You're welcome, Shelby.

hv_

legendary

Activity: 2548

Merit: 1055

Clean Code and Scale

Quote from: iamnotback on June 11, 2016, 01:11:04 AM

I forgot to thank tromp for his very valuable interaction. Thanks John.

Thx to you both. Always great readings!

iamnotback

sr. member

Activity: 336

Merit: 265

I forgot to thank tromp for his very valuable interaction. Thanks John.

iamnotback

sr. member

Activity: 336

Merit: 265

Quote from: tromp on June 10, 2016, 10:25:38 AM

Quote from: iamnotback on June 10, 2016, 09:15:52 AM

we could also consider enumerate multiple nonces and discarding those that don't match a subset of page rows, i.e. process the cuckoo table in chunks with multiple passes.

As mentioned before, this is exactly what higher values of PART_BITS achieve, as you can see in cuckoo_miner.h and cuda_miner.cu

Each trimming round uses 2^PART_BITS passes with the number of counters reduced by the same factor. With sufficient reduction, you end up using only one row in each memory bank.

What I meant before is that it isn't the same tradeoff as the h/w threads because as you said it increases the number hashes computed:

Quote from: iamnotback on June 09, 2016, 02:41:57 AM

Quote from: tromp on June 09, 2016, 02:09:40 AM

Alternatively, the algorithm parameter PART_BITS allows for a reduction in the number of counters in use at the same time, which is what your proposal essentially amounts to. Setting this to 21 will require only 2^8 counters, one per memory bank. But now your hash computations have increased by a factor 2^21, over 2 million.

No that is not equivalent to increasing the number of h/w threads and syncing them to pause until 2^13 of them are queued up to read a shared memory page/row.

The point is the ASIC can balance some of the two strategies to find the optimum. The CPU can't effectively leverage either strategy.

tromp

legendary

Activity: 1000

Merit: 1120

Quote from: iamnotback on June 10, 2016, 09:15:52 AM

we could also consider enumerate multiple nonces and discarding those that don't match a subset of page rows, i.e. process the cuckoo table in chunks with multiple passes.

As mentioned before, this is exactly what higher values of PART_BITS achieve,
as you can see in cuckoo_miner.h and cuda_miner.cu

Each trimming round uses 2^PART_BITS passes with the number of counters reduced by the same factor.
With sufficient reduction, you end up using only one row in each memory bank.

iamnotback

sr. member

Activity: 336

Merit: 265

Quote from: tromp on June 10, 2016, 08:28:47 AM

Quote from: iamnotback on June 10, 2016, 07:30:00 AM

The ASIC can choose to use DRAM instead and amortize the power consumption over the row buffer.

Here's a possible concrete design of what you are hinting at:

The ASIC will have as many (pipelined) siphash circuits as are needed to max out the memory
bandwidth of all the DRAM you're gonna hook up to it.
For each row on each memory bank, the ASIC has a queue of in-row counter-indices.
Each entry in the queue thus represents a "stalled thread".
The siphash circuits fill up these queues at a mostly uniform rate.
Once a queue fills up it is flushed to one of the memory controllers on the ASIC that
will perform the corresponding atomic updates on the memory row.

This looks pretty efficient, but all those queues together actually take up a lot of memory,
which the ASIC needs to access randomly.

So it seems we've just recreated the problem we were trying to solve.
But we can aim for the total queue memory to be maybe 1% of the total DRAM.

There's a lot of parameters here and many different ways to optimize with
combinations of speed, energy cost, and hardware cost...

Right we can't queue any where near 2^14 counter indices per row, else nothing gained. Good point.

If the compute units are much more efficient and faster than we can fully leverage within the memory bandwidth and optimal counter indices queue limits, we could also consider enumerate multiple nonces and discarding those that don't match a subset of page rows, i.e. process the cuckoo table in chunks with multiple passes. So that decreases the effective memory size for each pass. The edge pruning would need to be done after all passes. Can't the pruning be done without random access?

Basically what I am driving at is we can't rely on the row buffer load power cost being greater than only 1 hash. We'd need wider margins to be very sure of ASIC resistance, such as 1024 row buffer loads for 1 hash. Wink

tromp you are a smart guy, so please don't go publishing my idea before I get to market. You might be able to deduce what I am up to based on the prior paragraph.

You'll receive proper credit as you deserve it. This is your invention.

Don't ruin your future fame by deducing and publishing what I am up to prematurely, given the liberal hints I have given you and then ruin the ability to leverage in the market with first mover advantage. Some shitcoin take the idea and run with it, then the impact will be diluted. Please for the love of God, don't have start having delusions that Blockstream is your friend. Do you enjoy being used.

tromp

legendary

Activity: 1000

Merit: 1120

Quote from: iamnotback on June 10, 2016, 07:30:00 AM

The ASIC can choose to use DRAM instead and amortize the power consumption over the row buffer.

iamnotback

sr. member

Activity: 336

Merit: 265

Quote from: tromp on June 10, 2016, 06:37:41 AM

Quote from: iamnotback on June 10, 2016, 05:20:05 AM

Quote from: tromp on June 10, 2016, 04:40:21 AM

You need millions to find 2^12 of them wanting to access a single row.
With 2^15 threads you expect 1 access per row.

Correct on 2^15 but not on the conclusion.

With 2^27 you expect 2^12 accesses per row. That's hundreds of millions.

That would give you 2^12 hashes per row buffer load. That isn't necessary to break ASIC resistance. 2^18 (256k threads) gives 8 hashes per row buffer load and might be enough to make it computation bound to a range of perhaps 100-to-1 improvement of power consumption on the ASIC.

Or in any case, 2^20 (million threads) gives 32 hashes per row buffer.

And remember my use case is 2^20 counters not 2^29, so in my use case the ASIC only needs 1000s of threads to make it computation bound up to the limits of the maximum efficiency advantage of the ASIC.

Quote from: tromp on June 10, 2016, 06:37:41 AM

Quote from: iamnotback on June 10, 2016, 05:20:05 AM

A CUDA gpu can apparently do 671 million threads

Not in hardware. It can only run a few thousand in hardware (simultaneously).
So the 671 have to run in thousands of batches, one after the other.

Yeah I also stated in the prior post that the GPU isn't likely optimized enough to cause major advantage. The ASIC is the big concern.

Quote from: tromp on June 10, 2016, 06:37:41 AM

Quote from: iamnotback on June 10, 2016, 05:20:05 AM

I don't think it is safe to assume that ASICs can't be designed to support millions of very efficient threads for this very customized computation

Your ASIC would require an insanely huge die and be orders of magnitude more expensive than the DRAM it tries to optimize access to.

Certainly not in my use case of 2^20 counters.

And in the 2^29 counters use case, please remember I said that you don't need millions of compute units. The compute units can be shared by the non-stalled threads and even these can be pipelined and queued. So you don't need millions of compute units circuits. The millions is only the logical threads.

Quote from: tromp on June 10, 2016, 06:37:41 AM

Quote from: iamnotback on June 10, 2016, 05:20:05 AM

and again sharing the computation transistors amongst only the threads that aren't stalled, thus not needing millions of instances of the compute units.

How would you know what to stall without doing its computation first??

That is irrelevant. Think it out. It is a statistical phenomenon. Not all the computations are computed monolithically followed all the syncs. These stages are happening in parallel for 2^8 memory banks (row buffers), so we can have many threads stalled yet some not. And even the computations can be pipelined and queued. Let's not conflate power consumption goal with speed goal.

It is very very difficult to defeat the ASIC. I think I know how to do it though.

Quote from: tromp on June 10, 2016, 06:37:41 AM

Quote from: iamnotback on June 10, 2016, 05:20:05 AM

Also remember I need for instant transactions to have a very fast proving time, which thus means at 2^20 counters it could be trivially parallelized losing ASIC resistance with only thousands of threads.

For such small instances, latency is not that relevant as it all fits in SRAM cache.

Latency is a speed bound concern. Our concern herein is power consumption bound. That is what I am getting you to focus on these being orthogonal issues.

The ASIC can choose to use DRAM instead and amortize the power consumption over the row buffer. So I am not yet sure if that would give the ASIC an advantage. Although I've read that SRAM is 10X faster and 10X more power consumption (thus being neutral relative to the DRAM), I am wondering if the ASIC can coalesce (sync) all the reads in the row buffer for DRAM, if the DRAM might not have a significant power advantage thus rendering it computation bound for power consumption.

Do you have any idea? I was planning to study the equations for memory power consumption in the reference I cited upthread.

Note I think on Android it may be possible to turn off the cache.

tromp

legendary

Activity: 1000

Merit: 1120

Quote from: iamnotback on June 10, 2016, 05:20:05 AM

Quote from: tromp on June 10, 2016, 04:32:15 AM

In fact siphash-2-4 is already overkill for my purposes, with siphash-4-8 being close to cryptographically secure. Your attack is about as conceivable to me as P being equal to NP, in which case there is no cryptographically secure hash function anyway.

Fact? Based on what published cryptanalysis.

I meant to say: In fact I believe that

Quote from: tromp on June 10, 2016, 04:40:21 AM

You need millions to find 2^12 of them wanting to access a single row.
With 2^15 threads you expect 1 access per row.

Correct on 2^15 but not on the conclusion.

With 2^27 you expect 2^12 accesses per row. That's hundreds of millions.

Quote

A CUDA gpu can apparently do 671 million threads

Not in hardware. It can only run a few thousand in hardware (simultaneously).
So the 671 have to run in thousands of batches, one after the other.

Quote

I don't think it is safe to assume that ASICs can't be designed to support millions of very efficient threads for this very customized computation

Your ASIC would require an insanely huge die and be orders of magnitude more expensive than the DRAM it tries to optimize access to.

Quote

and again sharing the computation transistors amongst only the threads that aren't stalled, thus not needing millions of instances of the compute units.

How would you know what to stall without doing its computation first??

Quote

For such small instances, latency is not that relevant as it all fits in SRAM cache.

iamnotback

sr. member

Activity: 336

Merit: 265

Quote from: tromp on June 10, 2016, 04:32:15 AM

Quote from: iamnotback on June 09, 2016, 02:09:30 PM

I am just arguing why not be safer than sorry when so much is at stake? Why would you recommend adding unnecessary risk even where you think you are omniscient and there is no risk?

I only need good statistical properties of the hash function, not cryptographic security.

Pseudorandomness ("good statistical properties") is tied into the cryptographic security:

Quote from: tromp on June 10, 2016, 04:32:15 AM

Fact? Based on what published cryptanalysis.

Quote from: tromp on June 10, 2016, 04:32:15 AM

Quote from: iamnotback on June 09, 2016, 02:09:30 PM

What I am saying is that the edge trimming eliminates most of the nodes from consideration and then you have a hash table representing a sparse array for the remainder of the nodes which aren't the pruned/trimmed leaf edges.

Yes, totally correct. But in my view the term "bucket" denotes a fixed size unordered container, and my current algorithms use no such thing.

Agreed 'bucket' would not be an ordered array element. That is a more precise use of the term 'bucket'. We programmers who started hacking since the 1970s and 1980s may be a little bit more loose (employing the general English definition) with our terminology (using 'bucket' to refer to an array element as an array bucket) as compared to someone who came up predominantly in an academic setting and especially someone employing technical English not as their native language wherein they would likely be more strict about definitions. I stand corrected on the more precise terminology.

Quote from: tromp on June 10, 2016, 04:32:15 AM

Quote from: iamnotback on June 09, 2016, 02:09:30 PM

Well my wink is about using maximum nonce counts, i.e. edge fraction M/N > 1/2.

My algorithms break down in that case, since the expected number of cycles becomes non-constant, causing the basic algorithm to overlook an increasing fraction of them, and edge-trimming fails altogether to remove a large fraction of edges.

Exactly.

In a few months, you'll know what I was referring to when it is published.

Quote from: tromp on June 10, 2016, 04:40:21 AM

Quote from: iamnotback on June 09, 2016, 06:18:37 PM

My calculation was 2^15, that isn't a million.

You need millions to find 2^12 of them wanting to access a single row.

With 2^15 threads you expect 1 access per row.

Correct on 2^15 but not on the conclusion. We both agree the number of memory banks is irrelevant for this concern. Memory banks only determine how many row buffers are active simultaneously, which impacts maximum speed and whether we are memory bandwidth or memory latency bound on speed of execution.

So for the power consumption issue, if we have 2^29 counters (2-bits each) and 2^14 counters per memory page/row, then 2^16 (65536) h/w threads means we can stall threads (perhaps not permanently so as to account for variance outliers) and be roughly assured statistically of roughly 2 accesses per row. That already doubles the hash computation per memory page row.

Thus we don't need millions of h/w threads to convert the algorithm to computation bound on power consumption, and thus making it not ASIC resistant.

At 2^18 (262144) h/w threads we have 8 times more computation per memory page row than the CPU (and the computation will be orders-of-magnitude more power efficient on the ASIC).

A CUDA gpu can apparently do 671 million threads, although this are probably not synced on memory row buffers as we would need here, although the statistical spread might be sufficient without syncing. I think if you had the Kill-A-Watt meter what you would have observed is that by increasing threads, the speed remained topped out at memory bandwidth, but the power consumption of the GPU decreases as threads are increased beyond a certain threshold (assuming GPU threads are very efficient, but that might not be the case).

I don't think it is safe to assume that ASICs can't be designed to support millions of very efficient threads for this very customized computation and again sharing the computation transistors amongst only the threads that aren't stalled, thus not needing millions of instances of the compute units. The GPU may already be doing this, although maybe not since it is probably optimized for performance and not power consumption where there is a choice, although once it has hit memory bandwidth bound one might hope the GPU would be optimizing for minimizing power consumption by maximizing row buffer coalescing, but this is perhaps too costly to implement in the general case of GPU (but I am arguing not in the case of an ASIC with a specific custom computation circuit). Apparently GPU memory coalescing is not powerful enough to do what we require.

Also remember I need for instant transactions to have a very fast proving time, which thus means at 2^20 counters it could be trivially parallelized losing ASIC resistance with only thousands of threads.

Actually how ironic that the 2-bit counters instead of the original basic algorithm, makes the ASIC resistance worse, because more counters fit in a row. Remember I wrote this about "bit array" in my 2013 rough draft paper on memory hard proof-of-work.

tromp

legendary

Activity: 1000

Merit: 1120

Quote from: iamnotback on June 09, 2016, 06:18:37 PM

My calculation was 2^15, that isn't a million.

You need millions to find 2^12 of them wanting to access a single row.

With 2^15 threads you expect 1 access per row.

tromp

legendary

Activity: 1000

Merit: 1120

Quote from: iamnotback on June 09, 2016, 02:09:30 PM

I am just arguing why not be safer than sorry when so much is at stake? Why would you recommend adding unnecessary risk even where you think you are omniscient and there is no risk?

I only need good statistical properties of the hash function, not cryptographic security.
In fact siphash-2-4 is already overkill for my purposes, with siphash-4-8 being
close to cryptographically secure.
Your attack is about as conceivable to me as P being equal to NP,
in which case there is no cryptographically secure hash function anyway.

Quote

Yes, totally correct. But in my view the term "bucket" denotes a fixed size unordered container,
and my current algorithms use no such thing.

Quote

Well my wink is about using maximum nonce counts, i.e. edge fraction M/N > 1/2.

My algorithms break down in that case, since the expected number of cycles becomes non-constant,
causing the basic algorithm to overlook an increasing fraction of them,
and edge-trimming fails altogether to remove a large fraction of edges.

iamnotback

sr. member

Activity: 336

Merit: 265

Quote from: tromp on June 09, 2016, 05:55:11 PM

Quote from: iamnotback on June 09, 2016, 12:33:30 PM

The point is that if you have enough threads running, the likelihood is there are 2^12 threads wanting to read from the same row, thus your algorithm is no longer bounded on memory latency rather on memory bandwidth and moreover that there isn't 1 hash per memory row buffer load but instead 2^12 hashes, and thus actually your algorithm is computation bound and will be 1000s times more power efficient on the ASIC.

I stated earlier

Quote

To avoid latency, you could increase the number of memory banks by a factor of 2^7,
but this similarly increases die area, hardware costs, and energy usage.

Your above proposal of simultaneously running millions of hardware threads and needing to sync them
incurs much worse increases in die area (to the point of not fitting on any single die),
hardware costs, and energy usage.

My calculation was 2^15, that isn't a million.

Did you really design these threads already. I bet they can use a minimal number of transistors each by reusing some components for threads that are stalled.

Multiple CPUs can access the same memory. Perhaps the same could be done if multiple dies are needed.

Afaik, the hardware cost is rather insignificant relative to the relative power consumption efficiency cost when it comes to mining economics for ASIC farms.

tromp

legendary

Activity: 1000

Merit: 1120

Quote from: iamnotback on June 09, 2016, 12:33:30 PM

I stated earlier

Quote

To avoid latency, you could increase the number of memory banks by a factor of 2^7,
but this similarly increases die area, hardware costs, and energy usage.