I've just did a crawl through Blockchair's data dump repository at,
https://gz.blockchair.com, the total size of all the data on the site, including blocks, txs, outputs, inputs, etc. from all the chains is (as of today) about 2.7 terabytes. Here is the command I used to measure it in bytes:
wget --mirror --no-host-directories -e robots=off --reject html -l 0 --spider https://gz.blockchair.com 2>&1 | grep -E -o 'Length: [0-9]+' | awk '{sum += $2} END {print sum}' it only takes a few hours to run.
It seems to be a better alternative to using the Blockchair API proper, which seems to just randomly ban IP addresses without a paid API key.
Now since the data files are all in CSV format, just with tabs separated by spaces, I was wondering what is the best way to compress all this data, per chain at least? I know that CSV is a very inefficient representation as there's already megabytes of TAB characters, and there's no reason to store those either, so it's not like just compressing this with XZ or LZMA is the best solution.
Nevertheless it looks like all this stuff can be distributed at a reasonable size via Bittorrent - and even can be used to accelerate crypto applications so that they just need to fetch today's data online - if the dumps are compressed enough (per chain - don't want to mix up chain stuff). I would've liked to try it myself, but unfortunately this project needs 2x the disk space I have available right now.