gz.blockchair.com data dumps - better way to store it?

ABCbits

legendary

Activity: 2870

Merit: 7490

Crypto Swap Exchange

Quote from: NotATether on June 27, 2023, 06:11:59 AM

Now since the data files are all in CSV format, just with tabs separated by spaces, I was wondering what is the best way to compress all this data, per chain at least? I know that CSV is a very inefficient representation as there's already megabytes of TAB characters, and there's no reason to store those either, so it's not like just compressing this with XZ or LZMA is the best solution.

Looking at this experiment (https://softwarerecs.stackexchange.com/a/49216) which compress 219GB of JSON to 6.8-32GB on various algorithm, common compression algorithm seems good enough. But it seems method mentioned by @witcher_sense is better option when you need to query the data.

witcher_sense

legendary

Activity: 2464

Merit: 4415

🔐BitcoinMessage.Tools🔑

Use Apache Parquet instead of .csv or .tsv files: https://www.databricks.com/glossary/what-is-parquet

Quote

Characteristics of Parquet

Free and open source file format.
Language agnostic.
Column-based format - files are organized by column, rather than by row, which saves storage space and speeds up analytics queries.
Used for analytics (OLAP) use cases, typically in conjunction with traditional OLTP databases.
Highly efficient data compression and decompression.
Supports complex data types and advanced nested data structures.

Here is a python script to convert tsv to parquet with pandas: https://stackoverflow.com/questions/26124417/how-to-convert-a-csv-file-to-parquet

NotATether

legendary

Activity: 1568

Merit: 6660

bitcoincleanup.com / bitmixlist.org

I've just did a crawl through Blockchair's data dump repository at, https://gz.blockchair.com, the total size of all the data on the site, including blocks, txs, outputs, inputs, etc. from all the chains is (as of today) about 2.7 terabytes. Here is the command I used to measure it in bytes: wget --mirror --no-host-directories -e robots=off --reject html -l 0 --spider https://gz.blockchair.com 2>&1 | grep -E -o 'Length: [0-9]+' | awk '{sum += $2} END {print sum}' it only takes a few hours to run.

It seems to be a better alternative to using the Blockchair API proper, which seems to just randomly ban IP addresses without a paid API key.

Now since the data files are all in CSV format, just with tabs separated by spaces, I was wondering what is the best way to compress all this data, per chain at least? I know that CSV is a very inefficient representation as there's already megabytes of TAB characters, and there's no reason to store those either, so it's not like just compressing this with XZ or LZMA is the best solution.

Nevertheless it looks like all this stuff can be distributed at a reasonable size via Bittorrent - and even can be used to accelerate crypto applications so that they just need to fetch today's data online - if the dumps are compressed enough (per chain - don't want to mix up chain stuff). I would've liked to try it myself, but unfortunately this project needs 2x the disk space I have available right now.

Topic: gz.blockchair.com data dumps - better way to store it? (Read 107 times)