Hey everyone,
I would love to hear your feedback on this idea for using Bitcoin as the basis for a file library.
I put a pretty version of this document as a gist:
https://gist.github.com/paulkernfeld/4278533bf83887f6f0ee67765c66d54d.
I am well aware that some people disagree with using Bitcoin as a data storage medium. I'm not particularly interested in having that discussion in this thread, though.
Exandria: A decentralized public file library
There are many great file repositories online: WikiLeaks and The Internet Archive are two prominent examples. Even these two notable sites are centralized, however. This means that censorship-happy governments can and do find ways to attack these sites by going after their founders, servers, or DNS records.
This article describes Exandria, a design for a file library with [decentralized ownership](paulkernfeld.com/2016/02/19/world-writable.html). Any owner can vote to add or remove content, making Exandria extremely difficult to censor.
Properties:
- Searching: users can search filenames
- Downloading: the ability to download files
- Censorship-resistant: it's hard to censor files
- Spam-resistant: it's hard to spam the system
- Zero-configuration: a new user can join the network instantly, without an invite
Many systems have some of these properties, but not all of them. If you know of another system that does or tries to do this, I would love to hear about it!
Overview
--------
* The Bitcoin blockchain is used as the canonical data source, containing links to file metadata.
* One writes to the library by burning bitcoins.
* Anyone can read from the library.
* All data is transferred via P2P networks.
Data Model
==========
The file library is represented as a mutable weighted set of files.
* Set: an unordered collection of files.
* Mutable: the library can change over time (i.e. files can be added and removed).
* Weighted: each file has a weight, to represent that some files are more important than others.
Each file has a title, an extension, and a reference to the content of the file.
Here's an example of what this set might look like. Note that the "magnet" field contains a hash of the content of the file, which doubles as a way to locate the file.
```json
{
"name": "Moby Dick",
"ext": "epub",
"magnet": "magnet:?xt=urn:sha1:YNCKHTQCWBTRNJIV4WNAE52SJUQCZO5C",
},
{
"name": "Scott Joplin - The Entertainer",
"ext": "mp3",
"magnet": "magnet:?xt=urn:sha1:NAE52SJUQCZO5CYNCKHTQCWBTRNJIV4W",
},
...
```
Delta Encoding
--------------
Delta encoding is a way to represent a set using an append-only log. Changes to the set are represented as additions and removals.
A metadata stream "add" entry might look like this:
```json
{
"op": "add",
"value": {
"etx": "epub",
"magnet": "magnet:?xt=urn:sha1:YNCKHTQCWBTRNJIV4WNAE52SJUQCZO5C",
"title": "Moby Dick"
}
}
```
Each add entry will have a hash, which will be a hash of the serialized JSON of `value`. Multiple identical add entries will have the same hash, although it's possible for two add entries to be functionally identical but to have different hashes, e.g. if the JSON has different whitespace.
A metadata "remove" entry simply refers to the hash of an entry that has been added:
```json
{
"op": "remove",
"hash": "c3f0fe269c05c3438d57c40526bd016d"
}
```
Weights
-------
Since this is a weighted multi-set, we need some way to make weights work with delta encoding. We can say that each "add" operation increments the weight of an element and each "remove" operation decrements the weight of the element. An element is only a member of the final set if it has a positive weight.
This delta encoding:
```json
{
"op": "add",
"value": "cat",
"weight": 5
},
{
"op": "add",
"value": "cat",
"weight": 8
},
{
"op": "add",
"value": "hat"
"weight": 4
},
{
"op": "remove",
"hash": "80c7384a6339a053baee278cb13e578c", // the hash of "hat"
"weight": 6
}
```
...results in this set ("hat" has weight -2, so it's excluded):
```
{
"value": "cat",
"weight": 13
}
```
Architecture
============
[diagram 1]
The protocol consists of three layers:
1. A layer which stores identities and commitments in the Bitcoin blockchain
2. A layer for storing streams of file metadata
3. A layer that can retrieve a file given a magnet link, using BitTorrent
Identities and commitments
==========================
Unlike in many other systems, identities are not stored in a centralized database. Instead, they are stored on the Bitcoin blockchain.
Each identity is just a cryptographic key pair. Identity public keys are recorded onto the Bitcoin blockchain.
You only need an identity to write to Exandria; you can read from it without an identity.
There are no moderators or administrators; all users are of the same type.
Identity weights
----------------
In order to become a contributor to the library, the user must [burn](
https://en.bitcoin.it/wiki/Proof_of_burn) some bitcoins to prove their commitment. This prevents spamming.
Each identity has an associated "weight," which is the total amount of bitcoins burned by that identity. This is used to relatively prioritize the contributions of each identity; the files posted by an identity with a higher weight will be ranked higher in search results.
Writing
-------
In order to create a new identity or increase the commitment of an existing identity, the user burns some bitcoins.
Reading
-------
In order to download a list of all identities, a client can look through the Bitcoin blockchain. This can be optimized to work with [simplified payment verification](
https://en.bitcoin.it/wiki/Thin_Client_Security).
Implementation
--------------
Burns will use a [burn stream](
https://github.com/paulkernfeld/burn-stream).
Each message will be a single commitment. It will include a version byte (`0x00`) and then a public key.
Metadata Streams
================
A file library requires storing a lot of data, but storing data on the Bitcoin blockchain is very expensive. Therefore, only identities are stored on the blockchain, and file metadata is stored separately, in metadata streams.
Each identity has exactly one metadata stream, which is a distributed append-only log. The stream is identified and discovered by the public key of the identity, and the contents of the log must be signed by the identity's private key. This paradigm is already used in [secure-scuttlebutt](
https://github.com/ssbc/secure-scuttlebutt) and [ppspp](
https://tools.ietf.org/html/rfc7574).
A metadata stream contains the add and remove operations allowing the user to make modifications to the global set of files.
The "weight" of an add or remove entry is measured in bitcoins. It is the weight of the identity divided by the total number of entries in the identity's metadata stream.
Searching
---------
In order to search, the client will need to first download and index all metadata entries from all identities. Then, the client can perform the search locally. As new entries come in, the client will need to update its search index.
Retrieving Files
================
Each metadata entry will contain the a magnet link to a file, as well as the extension of the file. Given this, the file can be retrieved using BitTorrent and saved to the user's file system.
Issues
======
Incentives
----------
People with a legitimate interest in the store must be willing to burn Bitcoins, so this scheme relies on the goodwill of people.
Here are some parties who might participate and affect the content of the library:
- Individuals spending money on their hobbies
- Non-profits donating money to fight censorship
- Governments spending money to censor information
- Corporations or lobby groups spending money on advertising
Censorship is inevitable
------------------------
If we don't allow items to be removed from the file library, will we have a library without censorship?
Not really. Even without removal, content can be effectively censored by creating spam entries with the correct metadata but incorrect magnet links.
Therefore, instead of trying to eliminate censorship, the goal of this design is to create a library with a good signal:noise ratio.
Future upgrades
===============
Eigentrust
----------
An [Eigentrust](
https://en.wikipedia.org/wiki/EigenTrust)-style trust graph would allow identities to testify for or against each other. The trust graph will make it so that control over content is given to a majority.
Without a trust graph, a single bad actor can cause a lot of disruption by deleting popular content. With a trust graph, the community can vote to silence bad actors. Of course, this is a type of censorship, but censorship is inevitable and community censorship is probably the best kind of censorship.
Each node in the trust graph is an identity. Each edge is directed and has a weight between -1 and 1. The starting weight for each node is the node's weight from burning bitcoins. The end weight of each node is computed by taking the eigenvalue of the graph.
[diagram 2]
Subjective trust filters
------------------------
Users could add custom client-side trust filters to improve the signal:noise ratio. For example, I could make a filter that downweights content from known scammers, and upweights content from sources that I find most interesting.
Since these filters will be subjective, it's possible for many such filters to exist. Trust filters would be shared out-of-band.
Search scalability
------------------
If each metadata JSON is limited to 1000 bytes, then we'll be able to store one million search records in 1 GB. This is reasonable for a personal computer, but at some point, the size of the metadata itself is going to become prohibitive.
Eventually, a client should be able to search the metadata without storing all metadata locally. This is challenging, because the client must quickly retrieve data from other computers and verify that that data is valid.
One potential solution would be to maintain a distributed and easy-to-verify trie of search terms, much like Bitcoin's [ultimate [Suspicious link removed]pression](
https://bitcointalksearch.org/topic/ultimate-blockchain-compression-w-trust-free-lite-nodes-88208) proposal. This would let nodes download information from the trie when performing a search.
Another potential solution would be to have nodes only download identities (not file metadata) on startup, giving them information about which identities are trustworthy. Then, nodes could perform searches by querying trusted identities, using an RPC-style setup.
Additional metadata
-------------------
Many types of metadata could be added to files, e.g. book author, film/song length, or image resolution. This could add a lot of complexity, since not all types of file have well-standardized metadata specifications.