Why not just store blocks at regular intervals? What is the benefit of making it (pseudo) random?
I have X gigabytes today that I want to contribute to the Bitcoin network. I'd like it to use all of that X as soon as possible.
But I don't want to be in a situation where as the network grows the distribution of blocks stored becomes non-uniform. E.g. because I used up my space and then more blocks came into existence I'm over-heavy in old blocks... or where my locality becomes poor (creating connection overhead for peers having to fetch only a few blocks per connection).
I don't want to have to rely on trying to measure my peers to find out what sections are missing and need storing, since an attacker could falsely make some areas of the chain seem over represented in order to make them unrepresented.
Eventually, if sparse nodes are to be useful for nodes in their IBD you'll want to learn about what ranges nodes support before you try connecting to them, so I don't want to have to signal more than a few bytes of data to indicate what ranges I'm storing, or have to continually keep updating peers about what I'm storing as time goes on... as that would require a lot of bandwidth.
I want to be able to increase or decrease my storage without totally changing the blocks I'm storing or biasing the selection. (though obviously that would require some communication to indicate the change).
So those are at least the motivations I've mostly been thinking about there.
Do you know whether someone also intents to work on the later ideas you mention below? I don't want to duplicate any work, but I'm interested in working on this set of features since I believe it is both very useful and also interesting.
I don't think anyone is coding on it. There has been a bunch of discussion in the past, so at least for my part I have an idea of what I want to see done eventually (the stuff I outlined).
Pieter posted on bitcoin-development about service bits for this... though that wasn't talking about sparse blocks support, but just bits indicating that you can some amount of the most recent blocks. (He also did some measurements which showed that the request probability became pretty uniform after about 2000 blocks deep, obviously recent blocks were much more frequently requested due to nodes being offline for just a few days/weeks).
Yes, I know that the caching idea has lots of problems. I also thought about randomly removing blocks so that all nodes together will have the full history available even if each node only has a fraction - but I didn't go as far as you did with the "predicatable" block ranges. I like this idea very much! If you are trying to bootstrap and currently have no connected peers which have a particular block, would you randomly reconnect to new peers until you get one? Or implement a method to explicitly ask the network "who has this block range"?
My thinking is that initially you'll have to connect to a node to find out what ranges it has (e.g. we'll just add a handshake for it to the p2p protocol) and you'd just disconnect peers that weren't useful to you while also learning who has what (e.g. if you need blocks 1000-2000 but find 100000-110000 you'll know where to go later), but latter the address message should be revised so that it could carry the data with it. So your peers will tell you about what nodes are likely to have the blocks you need.
My opinion is that the undo files are a good starting point - they are only used for chain reorgs, so the networking issues are not present with them. If you keep just the last couple of blocks in them (or to the last checkpoint), you should have exactly the same functionality as currently, right? So why not remove them before the last checkpoint without turning off the "network" service flag, or even do this always (not only when enabled explicitly)? This seems like the logical first step to me (unless there are problems I'm missing with that approach).
We're (At least Sipa and I) planning on removing checkpoints almost completely after headers first is merged. They are really damaging to people's reasoning about the security model, and headers first removes practically actual use of them. (Instead they'd be converted into a "chainwork sufficiency" number, e.g. how much work should the best chain have, and something to put the wallet into safe-mode and throw an error if a large reorg happens).
Undo files take up comparatively little space, all of them right now amount to only about two and a half gigs... and unlike the blocks, if you do want them again you can't just go fetch them.
In general I'm concerned about proscribing any particular depth beyond which a node should refuse to reorg— any at all, or at least not any that is remotely short. The risk is that if someone does create a reorg at that limit they can totally split the consensus. (e.g. let half the nodes get one block that crosses the threshold then announce the reorg). Doubly so when you consider corner cases like the initial block download, e.g. where you get a node stuck on a bogus chain (though that would be part of the idea behind a sufficiency test).
Is this the O(n) approach you mention? I don't think that O(n) with n being the block height is too bad.
Yep, thats basically it. The only reason I don't really like O(n) is perhaps your addr database has 100,000 entries— 3/4 of which are bogus garbage put out by broken or malicious nodes. Now you need to do a fair amount of work just to figure out which peers you should be trying.
However, I think that it is in theory possible to choose the ranges for Ni as well as the values for Si in such a way that you get O(log n):
Ni randomly in [2^i, 2^(i+1)),
Si = 2^i
By tuning the base of the exponential or introducing some factors, it should be possible to tune the stored fraction of blocks (I've not done the math, this is just a rough idea). I don't know whether exponentially rising sizes of the ranges have any detrimental effect on the statistical properties of which blocks are kept how often by the network. Maybe something more clever is also possible.
Oh interesting. I got stuck trying to come up with O(1) that I didn't think about log shaped solutions. I'll give that some thought...
I think in general we want to be able to claim that if the node seeds are uniformly distributed, that we expect the block distribution to approach uniform at all times as the node count approaches infinite, regardless of how the size is set. In practice I expect (and hope) that there will be a fair number of nodes with really large size settings (e.g. all the blocks) which will help even out any non-uniformity— really, in my mind most of the reason for supporting sparse nodes is to make sure everyone can contribute at whatever level they want, and to make sure that those who do choose to contribute a lot of storage aren't totally saturated with requests.