Author

Topic: Using a DHT to reduce the resource requirements of full nodes. (Read 2614 times)

staff
Activity: 4284
Merit: 8808
Whenever you hear the term DHT what you should be hearing is "total farce of attack resistance failure"— existent DHT systems  and proposals are trivially vulnerable to attack in a multitude of ways and are fragile enough that they tend to perform poorly or fail outright with alarming frequency even when not attacked. They are generally an overcomplicated under-performing solution which get invoked in ignorance to every distributed systems problem because they're the first distributed systems tool people have heard of (sadly, "blockchain" is seems to be stealing this role), much as "neural network" has infested lay understanding of machine learning, or perhaps in other times "XML" was treated as a magical solution for inter-working serialization in places where it made little sense.

The few DHT's that exist which are proposed to be attack resistant in any serious way— things like CJDNS's routing or Freenet— work by imposing a 'social' network link topology on the network which is required (by the security assumptions) to by largely sybil proof.  ... A pretty strong requirement.

Fortunately, this is neither here nor there because the requirements of the Bitcoin system are almost but not completely unlike the services provided by a DHT.  DHT's provide users with random access to unordered data. In Bitcoin there is no access which resembles a hash table access.

To verify a block we must confirm that the inputs for the transactions it contains are spendable— that they they were previously created in the same chain and have not (yet) been spent— for this all nodes require the same data, not random data. We do not even require the full prior transactions, just the TXouts (potentially reducing tens of kilobytes of data to tens of bytes of data).

If we do not have this data, but could verify it if it were handed to us (e.g. if we'd been tracking a committed UTXO set root hash) our peer could provide it for us along with the block.  So long as we have _any_ peer willing to give us the block we'd have a guaranteed way to obtain the required data— immediately eliminating most of the DHT attack weaknesses (in the literature sometimes systems with properties like this are called D1HTs).

Unfortunately, obtaining just in time data comes with a large bandwidth overhead: If you're storing nothing at all then any data you receive must come with hash-tree fragments proving their membership: With convention hash trees each txin requires on the order of a 768 bytes of proof data... and with current technology bandwidth is far more limited than storage, so this may not be a great tradeoff.  One possibility here is that with some minor changes nodes could randomly verify fractions of blocks (and use information theoretic PIR to hide from peers which parts they are verifying), and circulate fraud notices (the extracted data needed to prove to yourself that a block is bad) if they find problems.  This may be a good option to reduce bandwidth usage for edge clients which currently verify nothing, but its not helpful overall (since one hop back from the edge the host must have the full block)... I'd say it would be a no brainer but getting the rarely executed fraud proof codepaths correct may be too much of an engineering challenge (considering the level of failure alt implementations have had with the consensus rules in Bitcoin as is).

Managing storage also does not have any need for an kind of sophisticated DHT— since 0.8 the reference client separates storage of the UTXO and block data. The UTXO, stored in the chainstate directory, is the exclusive data structure used for verifying new blocks.  The blocks themselves are used only for reorganizations and for feeding up new nodes that request them— there is no random access used or required to transaction data.  If you remove the test for block data in init.cpp the node will happily start up with all the old blocks deleted and will work more or less correctly until you call a getblock/etc rpc that reads block data correctly or until a newly initializing peer requests old block data from you.  The chainstate data is currently about 450MB on disk, so with that plus some recent blocks for reorganization you can already run a full verifying node.   The task of initializing a new node requires verifying the historic blocks so to accommodate that in a world with many pruned nodes we'd want to add some data to the addr messages, but it a couple of ranges of served blocks is sufficient— no need for routing or other elaboration. And block ranges match the locality of access so there is no overhead  (beyond a couple bytes in the addr messages).

The change to the reference client to separate out the UTXO in this way in the "ultraprune" patchset was no accident— it was a massive engineering effort specifically done to facilitate the transition to a place where no single node was required to store all the historical data without any compromise to the security model. You can see this all discussed in detail on the bitcoin-development list on and off going back years.

I don't think that any of this is actually incompatible with what you were _actually_ thinking, — you're no fool and your intuition of what would actually work seems more or less the same as mine. But the fuzzy armwave about just scattering block data to peers with no mind to locality or authentication is a common trope of people who have no clue about the security model and whom are proposing something which very much won't actually work, so I think it's important to be clear about it!  ... that keeping pruned validation data locally and not storing blocks historically is already the plan of record, and it's not a DHT, and the DHT proposals are usually completely misguided.

Quote
First for efficiency reasons it would make sense for all DHT nodes to retain a "full local" cache of recent blocks.  How "recent" is recent probably needs some research but the goal would be to reduce the overhead in the event of reorgs.  I think keeping "one block day" (144 blocks) of the tip of the blockchain would be a good compromise between storage and efficiency.

Sipa's analysis about a year ago was that there is an exponential tail in block access frequencies follow an exponentially decaying pattern to ~2000 blocks back, and after that have uniform access probability (further back than 2000 and the requesting host requests all of them in order), so it would be prudent to try to have hosts store as much as the first 2000 or so as they can, and then use additional space for a random range or two in the history.  I'd consider 144 to be a sane minimum because with less than that long reorganizations may cause severe problems.
donator
Activity: 1218
Merit: 1079
Gerald Davis
The major problem of running a full node is bandwidth, not storage. 23.8GB is nothing for modern harddrives, and is affordable even with SSD

Agreed and maybe this would be a non-issue if full nodes just had better peer selection and bandwidth throttling features.   While a DHT can't reduce the bandwidth requirements (they would actually increase by some percentage) the bandwidth requirement of each node could be reduced if there are more nodes.  Today you have a very binary choice.  You either are a full node at a cost of ~25GB plus potentially higher bandwidth requirements or you use a lite client and don't support the network at all.  My assumption is there are some users who would be willing to support the network to a limited degree.  

One thing which may make it more clear is just to look at an abstract simplified view of syncing peers.   Lets put all nodes into one of two categories; they are either currently up to date or they are a new node which needs to bootstrap from the genesis block.  For a group of "current" peers the cost per node to remain current is not dependent on the number of peers, it is dependent on the rate of new information (new txns) and the protocol efficiency in distributing that new information.  The cost to bootstrap a single new node is dependent on the number of peers that can assist and the protocol efficiency.

So lets ignore protocol efficiency for just a second and look at the raw data which needs to be propagated.  A new 1MB block every 600 seconds is 13kbps.  Excluding protocol overhead the network is adding 13kilobits of new information every second.  The average node will need to both receive and send that new information so we are looking at 26kbps plus protocol overhead.  Now that is just the per node cost of a group of nodes to remain current.  Lets say one new node joins the group.  The new node will need to receive 20GB of information.  To do that in one day requires the new node to receive 230 kbps.  This load is amortized over the number of nodes which can assist in that process.  Depending on the ratio between bootstrapping nodes and node which can assist that could potentially be higher than the cost to remain synced.  

You are right if a DHT was used and it didn't increase the number of nodes at all then it would increase not decrease the per node bandwidth cost.  However if the use of a DHT leads to more nodes that can assist bootstrapping new node then the bandwidth cost would be reduced.

So it boils down to this question:
Would reducing the storage requirements of a node by 80% to 90%, smarter load balancing of peers and including bandwidth limiting options for nodes significantly increase the number of nodes?  If the answer is no then nothing is gained.  If the answer is yes then security can be improved and the per node bandwidth cost can be reduced.
legendary
Activity: 1792
Merit: 1111
The major problem of running a full node is bandwidth, not storage. 23.8GB is nothing for modern harddrives, and is affordable even with SSD
donator
Activity: 1218
Merit: 1079
Gerald Davis
DHT nodes, would store block header & set of txn hashes.  As originally proposed this would require a new message for relaying a reduced block.  An alternative which may work better would be to have a new inventory type merkletree.  This would allow DHT nodes to use the existing getheaders message and then request the txn hashes using a subsequent message (getmerkletrees) after the longest chain has been downloaded and verified.  Support this does not need to be limited to DHT nodes.  Full nodes could be updated to support this as well and it would be optimal if they did.  This would have an added advantage of reducing the new block propagation time as well.

Ultimately the the distinction of full node could be deprecated.  Full nodes and DHT nodes would simply be "nodes" which identify what subset of the txn set they retain.  Some nodes would retain <100% of the "archive txn set", some would retain 100% and some (SPV+ nodes) would retain 0% (but they would retain blockheaders and potentially merkle trees).  However for a bootstrapping peer there is little difference between a single node which has the full txns set or a set of nodes which individually do not but combined do.  To avoid overloading a single noode, bootstrapping peers would optimally use multiple nodes anyways even if they have access to a node which has all the necessary information.
donator
Activity: 1218
Merit: 1079
Gerald Davis
i dont think you can enforce that on a protocol level. what if many nodes just decide(!) to just keep the uxto and throw away the rest?
this could lead to archive nodes which would get paid by new companies (miners, "banks"?) trying to entering the market - and close everybody else out

That could happen right now.  Nothing requires a bitcoin user to run a full node.  Exchanges, eWallets, lite clients, and SPV nodes all make it possible for a user to not support the network.  Still forming a cartel on freely available information is unlikely.  As long as there is a single copy there is a free alternative to the cartel. 

Say today there are 10,000 full nodes.   That means there are 10,000 independent copies of the txn set.  Now imagine those 10,000 nodes were replaced with x DHT nodes that combined had 10,000 complete copies of the txn set.  From a decentralization standpoint nothing has changed.   I would state that lowering the requirements of helping the network will increase the amount of users willing to help.

Today you have two options
a) run a full node (and not a KB less)
OR
b) use a SPV or other method which doesn't support the network at all

Under a DHT model you would still have the exact same two options plus
c) run a DHT node and maintain an independent copy of some portion (>0% and <100%) of the txn set.  Plus a complete copy of the UTXO and a complete copy of the reduced blocks.

Quote
but i dont like phrases like "i dont believe there ever will be no copy of the blockchain" Wink
Well you PERSONALLY can ensure that is always true by always maintaining a full copy.  A DHT network doesn't change that.  If someone under a DHT model is unwilling to support the network by carrying part of the blockchain, well they probably even less likely to support the network by running a full node.  The same risk applies and if you feel it is a credible risk well then I wouldn't be holding any Bitcoins as the protocol will fail if there is a ever a point in time where there is no complete (decentralized or otherwise) copy of the blockchain exists. 
sr. member
Activity: 278
Merit: 254
Have you analyzed the impact on required bandwidth including the overhead of the added mechanisms?

I found the cost of storing the complete block chain to be less than $2.00 in disk storage, and this is a one time cost that's good for the life of a hard drive.  Unfortunately I pay many times this much each month for DSL Internet service that has marginal uplink bandwidth to run a full node. For me, at least, disk space is effectively free compared to bandwidth.

donator
Activity: 1218
Merit: 1079
Gerald Davis
I have a suspicion that transactions are not uniformely queried, especially if new blocks are published there is a good chance the DHT nodes holding these transactions might be DDoSed...

Good points to consider but I believe they can be circumvented.

First for efficiency reasons it would make sense for all DHT nodes to retain a "full local" cache of recent blocks.  How "recent" is recent probably needs some research but the goal would be to reduce the overhead in the event of reorgs.  I think keeping "one block day" (144 blocks) of the tip of the blockchain would be a good compromise between storage and efficiency.  For similar reasons DHT nodes would also retain the full UTXO and the full memory pool.  It would not be possible to DDOS the propagation of new blocks (or at least no easier than doing so currently with full nodes) as all DHT nodes would have a copy of the UTXO and memory pool meaning that DHT nodes would not need to rely on the DHT to validate new blocks.   

In theory one could DDOS the bootstrapping of new nodes by blocking a subset of the nodes in the network.  One would need to DDOS all nodes which contain that subset plus all full nodes.  Remember full nodes could operate transparently on the DHT network by simply being a node whose DHT share is 100%.  One way to look at it is using a bittorrent definition.  Bittorrent looks at the availability of a torrent by the number of complete copies available (including all subsets) so 12.8 would mean that combined all seeds and peers have 12.8 copies.  Bitcoin right now only have "seeds" so the availability is SUM(1.00* num_full_nodes).  Under a DHT system the availability would be SUM(average_dht_share * num_DHT_nodes).  While the occurrence that nodes query an individual txn will vary the point of a DHT is to distribute that load uniformly.    A given txn may be queried at a higher frequency but as an example txns with a TxId beginning with an "e" are probably not queried significantly more or less than txn beginning with an "f".

Quote
Also this could have privacy implications (DHT nodes knowing who queries for which TXIDs).

For full nodes bootstrapping from the DHT there is no privacy concern.  For SPV nodes there are potential privacy concerns however there are already privacy concerns with SPV nodes.
sr. member
Activity: 266
Merit: 250
i really like that idea.
but how to ensure that really old transactions are available forever?
they are only needed be new upcoming nodes which should mean most nodes dont even bother with storing anymore

The DHT protocol would assign nodes a subset based on the TxId and thus old txns would be just as likely to have coverage as newer ones.  So the question becomes "how can you be sure the DHT protocol will always retain a copy of any individual txn (regardless of age).  In reality as a fallback it would make sense for some nodes to still be "full nodes".  They could be transparently compatible with the DHT nodes in that they use the same protocol but their local cache is always 100%. 


i dont think you can enforce that on a protocol level. what if many nodes just decide(!) to just keep the uxto and throw away the rest?
this could lead to archive nodes which would get paid by new companies (miners, "banks"?) trying to entering the market - and close everybody else out

There will always be a desire for some users to maintain a full copy of the blockchain.  I don't think there will ever be a case where no full copy of the blockchain exists.  The issue is that SPV nodes are (to borrow Bittorrent terminology) are leechers and they aren't even leechers which still contribute they would be download only leechers.  This creates a perverse set of disincentives.  As the number of full nodes (as a % of total users) falls the load on each full node increases, leading to less full nodes which further increases the load of the remaining full nodes.  The use of a TxDht would hopefully increase the number of "almost full" (probably need a better name) nodes.  This provides robust support for SPV nodes.   


i do understand your point and absolutely feel the same. a blockchain only with txids has some charme.

but i dont like phrases like "i dont believe there ever will be no copy of the blockchain" Wink


Speaking of SPV nodes, this type of structure would also allow SPV nodes to support the network on a limited basis.  SPV nodes would simply maintain a smaller DHT share and they would only maintain txns which are "deep enough" in the blockchain (as they can't fully verify blocks).  Remember the security of a txn comes from its hash and if that is broken well full nodes are compromised as well.  This (with some other protocol enhancements) could even lead to SPV+ type nodes which maintain only the UTXO and can provide even more support to the network.

The raw spent txns make up the majority of the blockchain and beyond initial bootstrapping even full nodes rarely "use" many of these txns more than once.  There is very little need for all nodes to continually maintain a full copy of all txns BUT it is also important that a node be able to obtain a copy of a given txn if needed.  That makes storing spent txns in a DHT a perfect use case for a DHT but it could be expanded to include the UTXO as well.  As the UTXO grows, individual outputs  could be scored based on the probability it will be used in the near future (based on txn output, dust limit, current fees, and age).  Older txns would be dropped from the local cache of some DHT nodes.  In the event that 5 year old, 1 satoshi output is used in a future block, those nodes could trustlessly request a copy of it from the DHT.

++1
donator
Activity: 1218
Merit: 1079
Gerald Davis
The tx hash is a large proportion of the size of the tx itself, and many tx must be gathered from far and wide to assemble a single block for initial verification. Why not split up storage by block hashes?

That is an alternative however a single verified block is of little use.  A bootstrapping node will need to verify all txns before it is synced with the network.  Optimally the node would download all block headers & txn hashes.  It would then request entire sets of txns from DHT peers (i.e. request from a particular DHT peer all txn hashes with the prefix 0x00e0 to 0x00ef).  This would allow the node to use multiple peers (potentially dozens or hundreds if the bootstrapping peer has sufficient resources) to simultaneously download the txn cache in parallel.

I do agree that txn hashes do add some overhead.   If we look at the full blockchain the average txn is 461 bytes.  This means a txn hash set is ~7% of the full blockchain but it does mean ~7% additional overhead.  It isn't necessary for txn hashes and block hashes to have the same hash length.  A hash collision for bitcoin txns is not useful, an attacker would need a preimage attack.  This means even 160 bit or 128 bit hashes would provide sufficient security and that would reduce the overhead to 3% to 4%.  I doubt Bitcoin will be changing to support smaller txns hashes but it could be done in a backwards compatible manner.  Still this is something for altcoins to keep in mind.
legendary
Activity: 2618
Merit: 1007
I have a suspicion that transactions are not uniformely queried, especially if new blocks are published there is a good chance the DHT nodes holding these transactions might be DDoSed...

Also this could have privacy implications (DHT nodes knowing who queries for which TXIDs).
full member
Activity: 144
Merit: 100
The tx hash is a large proportion of the size of the tx itself, and many tx must be gathered from far and wide to assemble a single block for initial verification. Why not split up storage by block hashes?
donator
Activity: 1218
Merit: 1079
Gerald Davis
i really like that idea.
but how to ensure that really old transactions are available forever?
they are only needed be new upcoming nodes which should mean most nodes dont even bother with storing anymore

The DHT protocol would assign nodes a subset based on the TxId and thus old txns would be just as likely to have coverage as newer ones.  So the question becomes "how can you be sure the DHT protocol will always retain a copy of any individual txn (regardless of age).  In reality as a fallback it would make sense for some nodes to still be "full nodes".  They could be transparently compatible with the DHT nodes in that they use the same protocol but their local cache is always 100%. 

There will always be a desire for some users to maintain a full copy of the blockchain.  I don't think there will ever be a case where no full copy of the blockchain exists.  The issue is that SPV nodes are (to borrow Bittorrent terminology) are leechers and they aren't even leechers which still contribute they would be download only leechers.  This creates a perverse set of disincentives.  As the number of full nodes (as a % of total users) falls the load on each full node increases, leading to less full nodes which further increases the load of the remaining full nodes.  The use of a TxDht would hopefully increase the number of "almost full" (probably need a better name) nodes.  This provides robust support for SPV nodes.   

Speaking of SPV nodes, this type of structure would also allow SPV nodes to support the network on a limited basis.  SPV nodes would simply maintain a smaller DHT share and they would only maintain txns which are "deep enough" in the blockchain (as they can't fully verify blocks).  Remember the security of a txn comes from its hash and if that is broken well full nodes are compromised as well.  This (with some other protocol enhancements) could even lead to SPV+ type nodes which maintain only the UTXO and can provide even more support to the network.

The raw spent txns make up the majority of the blockchain and beyond initial bootstrapping even full nodes rarely "use" many of these txns more than once.  There is very little need for all nodes to continually maintain a full copy of all txns BUT it is also important that a node be able to obtain a copy of a given txn if needed.  That makes storing spent txns in a DHT a perfect use case for a DHT but it could be expanded to include the UTXO as well.  As the UTXO grows, individual outputs  could be scored based on the probability it will be used in the near future (based on txn output, dust limit, current fees, and age).  Older txns would be dropped from the local cache of some DHT nodes.  In the event that 5 year old, 1 satoshi output is used in a future block, those nodes could trustlessly request a copy of it from the DHT.
sr. member
Activity: 266
Merit: 250
i really like that idea.
but how to ensure that really old transactions are available forever?
they are only needed be new upcoming nodes which should mean most nodes dont even bother with storing anymore
donator
Activity: 1218
Merit: 1079
Gerald Davis
Many times in the past people have suggested using a DHT (either by name or by reinventing the concept) to store the entire blockchain but this breaks the trust-free nature of blockchain verification.  To verify blocks a node must have the full block history.  With the TxIds from the longest chain however you can not be spoofed if you ask an arbitrary node for the details of that txn.   To create a different txn with the same TxId would require a preimage attack on the hash and that is considered infeasible as long as SHA-256 is cryptographically strong.   This would allow txns to be stored in a distributed trust free manner using a DHT (Distributed Hash Table).  The primary reason would be to reduce the minimum storage requirements of each node and allow nodes to support the network on a partial basis.   Currently there is an all or nothing requirement.  You are either a full node (cost is ~25GB) or you do not help the network in any way (SPV nodes do not improve the security of the network).  DHT of Transactions, a Distributed Transaction Table (DTT).  The required changes not be significant and could be done in a backwards compatible manner.

1) Block structure for DHT nodes would be changed such that it contains just the block header and tx hashes (TxId).
2) A new block message would be created which allows relaying this "reduced block" (also useful in reducing the orphan cost of blocks).    
3) The full transaction would be stored in the DTT.  Individual nodes would store a subset of the DTT.
4) Current "full nodes" could be converted into DHT nodes by simply maintaining a 100% share of the DHT.

The naive solution would be for individual nodes to drop all txns which aren't in their share and use the DTT as needed.  This would be rather inefficient as the probability of a node needing a particular txn (either internally or because it was requested by a peer) is not consistent for all txns.   Certain txn would always be maintained in the local cache those would include:
a) Txns in "recent" blocks.  In the event of a reorg updating the UTXO requires knowing the details of the txns of the block being orphaned.
b) Txns in the UTXO.  This is txns which have at least one unspent output.  The UTXO is needed to validate new txns and new blocks.
c) Txns in the memory pool.  This is the set of txns which are valid and known to the node but not yet included in a block.  If block messages include txns hashes only the memory pool will be needed to validate a block.
d) Txns involving the user's keys.  This isn't a requirement but a user has a vested interest in ensuring copies of his txns are maintained.  If nothing else this may be psychologically useful to improve trust in the DHT concept.

Understand however the integrity of txns comes from their txn hash (TxId) so a "badly optimized" DHT would have suboptimal performance but not reduced security.

Some rough numbers of the current storage requirements
Full raw blockchain = ~18.9 GB
Undo chain = ~2.5 GB (can this be reduced? seems undo information past say the last 144 blocks would be of little value)
Blockchain Indexes = ~2.0 GB (I haven't decoded these but find it interesting it is so large relative to the blocks)
Chainstate (UTXO) = ~400 MB (compressed format contains unspent outputs & txn metadata only)
Memory Pool = <1MB (unconfirmed txns, 538 at time of post)
Total: ~23.8 GB

The current blockchain stats
Number of blocks: 307,394
Number of transactions: ~41,182,000
Number of unspent txns: 3,347,562  (a txns with at least one unspent output)
Number of unspent outputs: 11,648,626

Breaking this down
Size of hash-only blocks: 1,320 MB (includes headers & txn hashes)
Size of the UTXO: 400 MB (unspent outputs only)
Size of the Unspent Txns in entirety: ~1,500 MB

So the local cache storage requirements for a DHT node would be: 1.7 GB (or 2.8 GB if full txns of UTXO elements are retained).  If we assume the average node maintains a 5% DHT share of the remaining txns (bulk of the block bodies) that would be another 1GB.  DHT nodes would also keep a local copy of all memory pool but that is a rounding error on storage requirements.

This wouldn't reduce the bootstrap time or (or amount of bandwidth used) for bootstrapping full nodes.  As a full node you would still need to download 100% of the txns of 100% of the blocks to verify the longest chain is the longest valid chain.  However once bootstrapped the ongoing resource requirements would be reduced.  This would work best if the protocol contained a block message which just sent blockheader & txns hashes.  Currently when a full node receives and verifies a new block which extends the longest chain it stores the full block and removes the now spent txns from the UTXO.  DHT nodes would instead record the reduced block (header & hashes only), then saved its DHT share of the spent txns and discard the rest.  To reduce the overhead from reorgs and provide better coverage for syncing nodes needing only the tip of the blockchain it may make sense for DHT nodes to retain all txns for the most recent x blocks (144? one block day?).

If a structure like this was used for all nodes then "full nodes" would simply be nodes participating in the txn DHT that retain a 100% copy of the full txn set.  The best comparison would be to the status of a "seeder" in the bittorrent protocol.  Since all DHT nodes would retain a full copy of the UTXO these nodes could still support SPV clients.  SPV clients could actually support the network by retaining a portion of the txn set.  Retaining older txns would be especially beneficial in that they could reduce the load on full nodes when the network bootstraps new full nodes.
Jump to: