The state of anonymous file-sharing (and anonymous Web hosting) is very poor. The most commonly-used solution is Tor hidden services, but those have terrible security. They are weak to intersection, timing, and DoS attacks. Plus, Tor is fundamentally centralized, relying on a fixed set of Tor directory authorities to manage the network. I have no doubt whatsoever that the NSA & friends could
easily find the true IP address of any Tor hidden service. I think that they only hold off on doing so in most cases because they like to build a false sense of security while holding that tool in reserve.
The ultimate solution to this is IMO to switch from a network architecture of "point-to-point" to a network architecture of "distributed data-store". Instead of having clients talk to a server somewhere (even behind 7 proxies), you should have the "server" upload their data to some "anonymous cloud", and then have clients download the data from that cloud, without ever needing to have any sort of connection to the server machine. This nicely addresses the most serious attacks against Tor: intersection & timing attacks against the server are much more difficult, since the server does not need to be online or sending data at the same time as the client, and DoS attacks are handled by the system itself.
Freenet and GNUnet are distributed data-store systems. Freenet even has a number of websites and social networks which function on the data-store model. It is possible to redo nearly every website under this model, though it is a major change.
But one major problem with Freenet and GNUnet is that their security (especially in Freenet's case) is
ad hoc: they basically jam the system with a bunch of obfuscation and hope that it works. I have no confidence whatsoever in their security as a result. They're both probably especially vulnerable to sybil attacks when used in their opennet modes. They're also very slow, and they would probably fail to provide censorship-resistance if seriously challenged.
What I propose instead is something like the following system. (Note however that this is only a half-baked idea...)
Data storesThere are a handful of data-store-servers, each internally centralized. The job of one of these data stores is to maintain a key-value data-store, provide it for people to download either in full, via something like rsync, or via a
private information retrieval (PIR) scheme. When PIR is used, it allows clients to download one or more keys from the server without giving the server any information about what keys were downloaded, providing the client with perfect anonymity even when the entire connection is observed by an attacker.
Data store descriptorsClients will download "data-store descriptors" describing a number of data-stores. Eg:
Data-store alpha
Public key: xxx
IPs: a.b.c.d, e.f.g.h
Download-Cost: 1 mSatoshi/B
Upload-Cost: 5 mSatoshi/B
Data-store beta
...
It is not important that clients have some
particular combination of data-stores. They can download as many of these descriptors as they want, whenever they come across them. The core software for this system might come with some built-in, but more could be added by the user.
Data-stores can charge for uploads and downloads. This can be done perfectly anonymously using blinded bearer certificates, or less-perfectly via eg. Bitcoin-Lighting.
Uploading dataYou want to upload song.mp3.
1.Encrypt it with a random key.
2. Break it into fixed-size chunks, say 16kB in size.
3. Choose at least 3, but maybe more, data-stores that you know about.
4. Download all or a large random selection of recently-uploaded data on each of the chosen data-stores.
5. For each of your chosen data-stores, randomly classify each as either Original or Derived, but at least one must be Original.
6. Assume that you're using exactly 3 data-stores. Let your data be D, and the data at each of the data-stores be X, Y, and Z. Between 0 and 2 of X, Y, and Z will already be known. Randomly select the not-yet-known values so that D = X+Y+Z. For example, if you chose Y as Derived and X&Z as Original, you would randomly choose X&Z such that X+Z = D-Y. Prepare to upload the new data block(s) to the data-store(s). (You can use any reversible operation to combine the data; maybe addition isn't ideal.)
7. Repeat steps 5-6 for each block of data.
8. Create and prepare to upload your metadata block, which will have a table like:
Block# Store1_Key Store2_Key Store3_Key
1 xyz abc def
2 123 456 789
...
If your table is more than the block size, you can put a pointer to a continuation block at the end of it. (Or structure it as a tree.)
Finally, you should upload all of the blocks that you have prepared to upload, but you should do it in a random order and spread out over time. The more time you put between each block, the more difficult it will be to connect the blocks together.
Then you'll get a CHK URI that you can give to people which looks something like:
CHK@store1+store2+store,key1,key2,key3,decryption_key
eg. CHK@alpha+beta+gamma,SVbD9~HM,nzf3AX45,yFCBc-A4,bA7qLNJR7IXRKn6uS5PAySjIM6azPFvK~18kSi6bbNQ
PIR schemes don't give anonymous uploading natively, so there will need to be some onion routing thing between you and the server. But higher latency is OK here, and there are alternatives to Tor's naïve onion routing such as Riffle, so I think that this can be made very anonymous.
Downloading dataYou were given a URI like the one above which leads to song.mp3.
1. You need to have previously downloaded descriptors for all of the data-stores in the URI.
2. From each data-store, download the listed keys using the anonymous PIR scheme, and add the data together. This will get you the first block which lists all of the others
3. Download all of the other blocks in the same way.
4. Once you have all of the blocks, concatenate them together and decrypt them with the decryption key in the URI.
Plausible deniability and censorship-resistance for data-storesThe key advantage of this scheme compared to things like Freenet is the plausible deniability and censorship-resistance for the ones storing the data. On Freenet, if you're running a node and someone gives you a CHK that they say is a copyright violation or whatever, it is technically possible for you to expunge that CHK from your node, and so maybe you could be forced to do so. Same for Tor hidden-service DHT participants.
But for a data-store in this scheme, if someone gives you a CHK that they demand be removed, they can say that some data in your data-store is being
used by that CHK, but they can't say whether that data
belongs to that CHK. The data may have been uploaded by someone else entirely, and if you delete it, you may break the original CHK which is totally legitimate, as well as any others which subsequently used that data. It's like creating new content by pasting together words cut out from a newspaper. I suspect that this aspect will make the system totally immune to DMCA takedowns and similar.
Because of this plausible deniability and censorship-resistance, an increased level of centralization can be accepted. You can more reasonably have a few dozen extremely fast, powerful data-stores rather than thousands of nodes on home Internet connections. This eliminates sybil attacks (on nodes) and improves the speed of the system. And while there are
few data-stores, they are not an integral part of the system as a whole (ie. they don't "vote" or anything), and they can be fairly easily replaced if necessary.
Extra thoughtsIt's maybe not necessary for each block's components to be stored on
separate data-stores.
A CHK will stop working if any of its data-stores goes down. I wonder if, instead of addition, you could use an error-recovery scheme such that you only need 3 of 4 components of each block, or something like that.
Data-store blocks might have an expiration, but it should be either uniform across the data-store or very coarse-grained.
Data-store keys should be short, do not need to be unpredictable, and do not need to be user-definable. Data-stores might assign sequential keys starting at 0, and fill in gaps as blocks expire.