Pages:
Author

Topic: Proof of Storage to make distributed resource consumption costly. (Read 24007 times)

hero member
Activity: 543
Merit: 501
How do you use that method to tell that a host is storing say, 2^12 plots (1GB)? There's no way for you to verify that the value presented by the host is the optimal value within the set of data that they are storing, except for you to somehow incentivize their finding of the optimal value. And then you have to do probabilistic analysis of repeated solutions to determine if they're scanning over the whole set each time.

As a method of consensus, this seems vulnerable because H(previous-block | block height) is something that the miner can control. If the miner is controlling 10% of the plots, but has enough time each time a block is found to grind over 100 potential values for `previous-block`, then the miner is very likely to find at least one value of `previous-block` that results in a scoop in one of the miners plots having an unexpectedly low time to produce the next block. Depending on how long it takes to scan the plots for the lowest value scoop, a miner could potentially try an enormous set of values for `previous-block`, until something is found that heavily favors the miner's plots.

The miner is then able to artificially inflate the number of plots that it appears to be storing, but only for blocks where it controls the value of `previous-block`. This means that a miner never has any incentive to use a chain produced by someone else, because they will not be able to grind over the value of `previous-block` and will be mining at a significant disadvantage compared to mining on a history composed of exclusively their blocks.
newbie
Activity: 21
Merit: 0
This seems like a great idea. Has there been any work towards implementing it in core or elsewhere?

One implication though is that honest services which provide monitoring and statistics of the network as a whole will become much more costly. Solutions might be public key authentication of such monitoring services or else for servers to accept "read only" connections (could be http-like or normal tcp socket that is never read from).
legendary
Activity: 1232
Merit: 1084
It looks like the burst system is resistant to the Bloom filter attack?

They generate "plots".  These are 256kB blocks of data generated using hash(account-id | plot-nonce) as seed and hashing it over and over to expand to the 256kB.  They also mix the array with the last 32 bytes of the array.

Then, they are broken down into 4096 "scoops" of 32 bytes each.

Based on the hash(previous-block | block height), a scoop index is selected (0 - 4095).

The miner has to scan through all scoops of the same index.  I assume they store all scoops with the same index together.

The scoop is run through hash(previous-block | scoop) and then divided by the target.

This gives the number of seconds to wait until the next block (lower is better).  If you get a result of 30 seconds, then you can create a block 30 seconds after the previous block.

The more plots you have, the more likely you will get the lowest result.

This means that miners have to read 1/4096 of the data stored on the disk in order to extract the scoops.

There is no way to Bloom filter, since all scoops concatenated with info from the previous block before being passed through a hash function.  This means the value of each scoop changes on a per-block basis.
sr. member
Activity: 462
Merit: 250
FWIW, I asked the burstcoin dev about this thread, and he said he hadn't seen it before.
newbie
Activity: 25
Merit: 0
Could someone contrast Proof of Storage with Proof of Stake, in particular with regard to physical resource consumption?

The proposal here is that proof of storage allows peers to prove that they have expended some resources on the current connection.  It isn't like POW for determining which fork has the most support.

Each full node only accepts a maximum number of incoming connections.  If each node limited the number of incoming connections to 100, then a node with 100 unique IP addresses could consume all the spare connections for every node.  Each node would see 100 connection attempts from 100 different IPs and would accept them all.

Proof of storage means that connecting to a node requires that you allocate drive space to every connection.  The random data you have to store is different for each connection, so you can't reuse it.

It doesn't even need to be that much.  If an attacker wants to connect to 10,000 open nodes and make 100 connections to each, then it needs 1 million times the per-connection storage requirement.

thx, very specific.
hero member
Activity: 543
Merit: 501
An idea for a publicly verifiable way to do this:

Start with a random seed, and build a large set by taking H(index | seed) until the set is the target size. Have the client sort and store the set.

Instead of presenting a hash and requiring an index, you present H(new random seed), and have the client present the index that is numerically closest to the presented hash.

The server can easily verify that the hash is in the set, but has no way of knowing if the client has produced the index that's actually numerically closest.

But, statistically, the server can tell if the client is close to what would be expected. If you repeat the challenge a dozen or a hundred times (just hash the new random number repeatedly to get more challenges), the server can do statistical analysis and get a very good idea of whether the client is checking close to 100% of the values or not, with it being extremely unlikely that an honest client will be a victim of this statistical analysis due to bad luck. IE, if you are storing 100 indices and the search space is 1000, you'd expect to be about 10 off with each result, and you can put bounds on a standard deviation. Of course here the search space is 256 bits or whatever the hash size is.

Then, if the client is doing massive hashing instead of storing the file, you can make the challenge sequential. The client finds the closest value, takes H(challenge|closest hash) and uses that as the next target value. If you're not storing the hashing, you have to recompute the table again for each challenge. Still doesn't stop you from using massive parallelism to compute the table, but at least now you have to compute the table multiple times per challenge.

edit: haven't discussed where these random seeds are coming from, if you can't get a secure source of random numbers then someone could potential manipulate the seeds such that they need significantly less computation or storage to pass the statistical analysis checks. Or, if they are trying to make an honest client fail, they could manipulate the random numbers such that the client is guaranteed to get a statistically unlikely result.


===========


edit2: I'm attempting to explain this process more clearly.

The goal is to force the client to store some volume of data in a publicly verifiable way, such that the client's responses can be verified with a trivial amount of computation.

This method assumes access to secure public random numbers, which can be considered a weakness.

The client creates a data set d_i = H(i|data_seed) for i...N, such that the data set is of the desired size. (1 GB for example). The client then sorts the data set by the numerical value of the hashes.
The server can then create a publicly verifiable challenge c = H(query_seed). The client is to return the data d_i which is numerically closest to c.

The server and the third party cannot verify that the client has returned the d_i which is actually numerically closet to c, however they can verify that the presented d_i is valid, and they can calculate how numerically close the solution is to c.
Statistically, c has an expected distance from the closest d_i.
Over the course of many challenges, the average distance of the client's responses should converge to this expected distance.
If the client is not storing the entire data set, or is recalulating only a small portion of the data set each query, the average distance of its responses will converge to a distance greater than the expected distance, and this can be used to detect dishonesty.

So, a client will always be able to provide a response, however without checking the entire search space d_0..N before each response, the client will have responses that are not as good as they should be. A third party can use statistical analysis to detect this after a calculatable number of queries (within a calculatable certainty).
staff
Activity: 4200
Merit: 8441
Wait...  I thought that was exactly what we were talking about.  
Could you explain in more detail what you mean by a 'publicly verifiable version'?
It was, you're not crazy.

The reason I mentioned it is because someone recently contacted me about using this kind of technique to make things like having tor hidden service ID's concurrently costly... e.g. you could still connect to as many peers as you want, but you'd still need to use one identity across all of them. It's a different application, but would work with the original proposal just by using a constant value for the peer challenge.

That also breaks the basic version too.
Yes, thats not news though. All of these so far allow for a choice of time/memory tradeoff, it's still maximally fast if you keep everything. The invocation of the sequential function was to try to diminish the available level of choice.
legendary
Activity: 1232
Merit: 1084
Here's how it works; each band of the table stores a bloom filter that selects hashes which result from 't' in that band, plus the starting (t, R) pair.  When Bob gets H(R), he goes to see which bloom filter it matches, then starts on that band with the starting (t, R) pair and does successive squares until he gets R such that H(R) matches - at which point he knows the corresponding t.   

That also breaks the basic version too.

Hash(seed | index) could also be spread over bands.

A tree of Bloom filters would be pretty fast.  The first Bloom filter would tell you which group of Bloom filters to check.

That means that you need only a few bits per entry.
legendary
Activity: 924
Merit: 1129

You don't ask them for (2^(2^t)) (mod n),  you give them H((2^(2^t)) (mod n)) and ask them for t.  

You're absolutely right.   Instead of having Alice give Bob R and ask for t, you want her to give him H(R) and ask for t.  When Bob doesn't have R he can't start successively squaring R -  therefore he has no successor function for R, thus no way to construct a rainbow table.


Nope, darn.  There's still a Bloom filter shortcut.  Bob can still construct a rainbow table if he uses Bloom filters rather than successor functions to select the band.

Here's how it works; each band of the table stores a bloom filter that selects hashes which result from 't' in that band, plus the starting (t, R) pair.  When Bob gets H(R), he goes to see which bloom filter it matches, then starts on that band with the starting (t, R) pair and does successive squares until he gets R such that H(R) matches - at which point he knows the corresponding t.   

legendary
Activity: 924
Merit: 1129
Downside of this is you can't do a publicly verifiable version, e.g. where you have just one piece of storage per identity but one piece per identity,peer pair (which was what I wanted in the motivation for this post; but there are other applications, e.g. like making a hidden service identity costly).


Wait...  I thought that was exactly what we were talking about. 

Could you explain in more detail what you mean by a 'publicly verifiable version'?

I was thinking of a system where each peer has to keep one table in memory per peer that it's connected to.  The table, in turn, is constructed from a 'seed' which lasts as long as that peer doesn't change its key.  Now, I had assumed that Alice's key for each connection would be different, but it doesn't have to be...  Alice doesn't have to use a different key for each connection; would it suit your purposes if she didn't?  In principle, Alice could publish 'n' while keeping its prime factors secret.  People would still have to construct an 'Alice' table in order to bring up a connection to the 'Alice' node. 

legendary
Activity: 924
Merit: 1129
If you don't play it straight, there is probably a way to use a hash function or 'salt' to defeat rainbow tables; I'll have to think about it.

You don't ask them for (2^(2^t)) (mod n),  you give them H((2^(2^t)) (mod n)) and ask them for t.  

You're absolutely right.   Instead of having Alice give Bob R and ask for t, you want her to give him H(R) and ask for t.  When Bob doesn't have R he can't start successively squaring R -  therefore he has no successor function for R, thus no way to construct a rainbow table.


You could still extract some parallelism by first computing T sequentially and then saving way-points along the way, though that requires you to store elements of n which are large compared to the hashes... but probably reasonable enough for extracting some parallelism. Hm.


I see it.  Say you've got a nice GPU that lets you do a thousand operations in parallel.  So instead of storing the _whole_ table you store a thousand widely separated starting points for your search.  Then instead of searching a million-element table when you get a request, you have a thousand threads that each work on building a thousand-element table until one of them finds the answer.   It uses a thousand times the processing power of a straight rainbow-table implementation, but it can be just as fast as a thousand-element rainbow table. 

But here is a good way to defeat that.

We're dealing with arbitrary 't', right? What if Alice starts out by saying that she will never ask for any 't' except a 't' that's evenly divisible by some factor X?  You can skip storing the result of all but one out of every X squaring operations. The result is that constructing the million-element table now costs a factor of 'X' more nonparallelizable operations.  Say X is 4300.  Now Bob can save a thousand starting points in his million-element table, but each thread is going to have to do 4300 squarings to get the 'next' element in its thousand-element subtable.  You can favor lookups over table construction by essentially any work factor. 

Of course this makes it compute-expensive to initiate a connection as well as storage-expensive to keep one going.  But if the connections last longer than an hour, that probably isn't a problem.

staff
Activity: 4200
Merit: 8441
If you don't play it straight, there is probably a way to use a hash function or 'salt' to defeat rainbow tables; I'll have to think about it.
Right. The way to do that is to basically use the idea here.

You don't ask them for (2^(2^t)) (mod n),  you give them H((2^(2^t)) (mod n)) and ask them for t.  I don't see how to construct a rainbow table out of the hashed form because there is no useful function H(R) -> R  that maps to R's along your solution path. You could still extract some parallelism by first computing T sequentially and then saving way-points along the way, though that requires you to store elements of n which are large compared to the hashes... but probably reasonable enough for extracting some parallelism. Hm.

Downside of this is you can't do a publicly verifiable version, e.g. where you have just one piece of storage per identity but one piece per identity,peer pair (which was what I wanted in the motivation for this post; but there are other applications, e.g. like making a hidden service identity costly).
legendary
Activity: 924
Merit: 1129
Use Rivest's time lock puzzle.  It's asymmetric in that the person with the key can random-access it but the person without the key has to compute it linearly.  Here's a linky:

http://people.csail.mit.edu/rivest/lcs35-puzzle-description.txt

Mathematically, it works like this:  You're trying to compute R=(2^(2^t)) mod n for some arbitrary 't' and an 'n' which is a product of 2 large primes.  If you actually know the primes, you can do this quickly; if you don't, the fastest way to get the result is to either (a) compute 't' successive squarings modulo n, or (b) look it up in a dictionary because you already computed those successive squarings.  This is important because the successive-squarings operation cannot be done in parallel. It is strictly linear computation.

Your storage-requirement then looks like this:  Alice transmits 'n' to Bob.  Bob builds a table of (t, R) pairs and an index by 'R'.  Whenever Alice wants proof that Bob is still holding this table, she picks a 't' at random, calculates the corresponding 'R' which takes her constant time, then transmits 'R' and asks Bob what 't' it corresponds to.  Bob must either look up the 't' value in his table, or recompute the table until he hits it.

With that description played straight, Bob could build a rainbow table to trade off space for computation: if he stores every millionth (t, R) pair, he can just start squaring when he gets Alice's 'R' and within a million squaring operations he'll get a (t, R) pair that's in his table, whereupon he can subtract the number of squaring operations he did to find the 't' that Alice started with.  You can reduce Bob's benefit from rainbow tables by asking for many different 't', but  the annoying part of this is that the rainbow table approach *is* parallelizable.  If Alice is asking for fifteen different 't', Bob can have fifteen different threads all being simultaneously useful in searches for a 't'.  

If you don't play it straight, there is probably a way to use a hash function or 'salt' to defeat rainbow tables; I'll have to think about it.
legendary
Activity: 1232
Merit: 1084
Constant, not linear.

Right.

Quote
The reason to reduce parallelism in the non-precomputed search is, of course, based on the notion that any ridiculously fast hardware for bypassing this would likely work by searching most of the space concurrently.

Ah, I see.  It is still burning server resources though, so isn't completely ineffective.

Quote
An interesting question is can it be changed in the other direction— rather than simplified (thanks!), slightly complexified to decrease the parallelism in the client further?— I don't think the log() behavior in the server hurts one way or another, though I agree if its not doing anything important simpler is better!

Maybe the reply could include multiple indexes.

The server sends a value and then the client has to find the matching exact index, but also some inexact matches.  The client would sort the values into bins which all have the same 16 bit prefix.  The server sends the full value and the client replies with the set of matches and also the index that has the exact match.

The average of the set size would have to be kept higher than some threshold.

Each set could be stored in a different file.  This would favour disk storage, since they could all be read quickly as a unit.
staff
Activity: 4200
Merit: 8441
I am not clear why a tree is required.  "index maps to hash(seed | index)" seems to work too?  That is fast and just as irreversible and it allows the server to compute entries in linear time, rather than log time.
Constant, not linear.

I'm not sure now if there was a reason I thought I needed the tree-structured PRNG, or if I'd just worked myself into a design corner before coming up with the working idea. If there is, I don't see it now. My original goal had been to find a function that was linear-time and mostly-sequential (e.g. not fast on PRAM) the first time for the client, but fast the second time... and always fast (log or better) for the server. The tree imposes some upper bounds on the non-precomputed search parallelism, but fairly weak ones— the non-precomputed search must do 2x the work and that work has geometrically increasing parallelism.

The reason to reduce parallelism in the non-precomputed search is, of course, based on the notion that any ridiculously fast hardware for bypassing this would likely work by searching most of the space concurrently.

An interesting question is can it be changed in the other direction— rather than simplified (thanks!), slightly complexified to decrease the parallelism in the client further?— I don't think the log() behavior in the server hurts one way or another, though I agree if its not doing anything important simpler is better!
legendary
Activity: 1232
Merit: 1084
I am not clear why a tree is required.  "index maps to hash(seed | index)" seems to work too?  That is fast and just as irreversible and it allows the server to compute entries in linear time, rather than log time.

The client then stores a value to index map.  Each sub-file would contain values that have the same prefix.

It wouldn't even necessarily have to do sorting, since the files could just act as bins.

The process could be







// If the server doesn't disconnect, then it defeats the purpose, since an attacker can stall at this point
// The server should also disconnect the client if it takes more than a few seconds to send S2









Periodically




The key is to make it so that it is faster to read the hard drive than to re-compute the table for each request.

This means that the hash function should take a while, but that converts it into a proof of work system.

The timeout between requesting a lookup and the reply would have to be tweaked.  

The more often the lookup is requested by the server, the higher the incentive to not just recompute.

The server might give a "well behaved" client a few chances before disconnecting.
legendary
Activity: 1232
Merit: 1084
Could someone contrast Proof of Storage with Proof of Stake, in particular with regard to physical resource consumption?

The proposal here is that proof of storage allows peers to prove that they have expended some resources on the current connection.  It isn't like POW for determining which fork has the most support.

Each full node only accepts a maximum number of incoming connections.  If each node limited the number of incoming connections to 100, then a node with 100 unique IP addresses could consume all the spare connections for every node.  Each node would see 100 connection attempts from 100 different IPs and would accept them all.

Proof of storage means that connecting to a node requires that you allocate drive space to every connection.  The random data you have to store is different for each connection, so you can't reuse it.

It doesn't even need to be that much.  If an attacker wants to connect to 10,000 open nodes and make 100 connections to each, then it needs 1 million times the per-connection storage requirement.
hero member
Activity: 686
Merit: 501
Stephen Reed
Could someone contrast Proof of Storage with Proof of Stake, in particular with regard to physical resource consumption?

For Bitcoin, I am becoming interested in coding, testing, and documenting various alternatives to Proof of Work, which I believe is unsustainable, e.g. ultimately in conflict with the Energy Charter Treaty's Article 19. http://en.wikipedia.org/wiki/Energy_Charter_Treaty#Energy_efficiency
full member
Activity: 126
Merit: 108
Andrew Miller
I implemented this in a few lines of python, mostly to convince myself it works https://gist.github.com/amiller/7131876
The test runs in codepad: http://codepad.org/Nzp2Vdsk
Pages:
Jump to: