Now that you have an established site and case use for AWS services, I can get you $1,000 in AWS credits for your own account, if you are interested. I'm in training to be certified as a cloud consultant.
It was the Bitcointalk forum that inspired us to create Bitcointalksearch.org - Bitcointalk is an excellent site that should be the default page for anybody dealing in cryptocurrency, since it is a virtual gold-mine of data. However, our experience and user feedback led us create our site; Bitcointalk's search is slow, and difficult to get the results you need, because you need to log in first to find anything useful - furthermore, there are rate limiters for their search functionality.
The aim of our project is to create a faster website that yields more results and faster without having to create an account and eliminate the need to log in - your personal data, therefore, will never be in jeopardy since we are not asking for any of your data and you don't need to provide them to use our site with all of its capabilities.
We created this website with the sole purpose of users being able to search quickly and efficiently in the field of cryptocurrency so they will have access to the latest and most accurate information and thereby assisting the crypto-community at large.
Old process:
time cat <(gunzip -c addresses_sorted.txt.gz) daily_updates/*.txt | sort -uS80% | gzip > test1.txt.gz
real 90m2.883s
Faster new process:
time sort -mu <(gunzip -c addresses_sorted.txt.gz) <(sort -u daily_updates/*.txt) | gzip > test2.txt.gz
real 51m26.730s
Old code:
time cat <(gunzip -c addresses_in_order_of_first_appearance.txt.gz) daily_updates/*.txt | nl | sort -uk2 -S40% | sort -nk1 -S40% | cut -f2 | gzip > newchronological.txt.gz
real 194m24.456s
New:
time comm -13 <(gunzip -c addresses_sorted.txt.gz) <(sort -u daily_updates/*.txt) > newaddresses.txt
real 8m4.045s
time cat daily_updates/*.txt | nl | sort -uk2 -S40% | sort -nk1 -S40% | cut -f2 > all_daily_addresses_chronological_order.txt
real 1m14.593s
cat all_daily_addresses_chronological_order.txt newaddresses.txt | nl -nln | sort -k2 -S80% > test.txt
real 0m36.948s
cat test.txt | uniq -df1 | sort -nk1 -S80% | cut -f2 > test2.txt
real 0m7.721s
Combined:
time cat <(cat <(cat ../daily_updates/*.txt | nl | sort -uk2 -S40% | sort -nk1 -S40% | cut -f2) <(comm -13 <(gunzip -c ../addresses_sorted.txt.gz) <(sort -u ../daily_updates/*.txt)) | nl -nln | sort -k2 -S80%) | uniq -df1 | sort -nk1 -S80% | cut -f2 > newaddresses_chronological.txt
real 9m45.163s
Even more combined:
time cat <(gunzip -c addresses_in_order_of_first_appearance.txt.gz) <(cat <(cat <(cat daily_updates/*.txt | nl | sort -uk2 -S40% | sort -nk1 -S40% | cut -f2) <(comm -13 <(gunzip -c addresses_sorted.txt.gz) <(sort -u daily_updates/*.txt)) | nl -nln | sort -k2 -S80%) | uniq -df1 | sort -nk1 -S80% | cut -f2) > new.alladdresses_chronological.txt
real 19m34.926s
-m, --merge
merge already sorted files; do not sort
-m, --merge
merge already sorted files; do not sort
5
3
7
2
9
4
3
5
3
7
2
9
4