We can read n from disk line by line and compare it to the current position in k.
Yes. In fact, just 2 days ago (on another forum) I was pointed at the existence of "
sort -mu":
-m, --merge
merge already sorted files; do not sort
This does exactly what you described. I haven't tested it yet, but I assume it's much faster than "regular"
sort.
Update: I'm testing this
now.
However, the bigger problem remains: updating 1.5 billion unique addresses in chronological order. Those lists are unsorted, so for example:
Existing long list with 12 years of data:
New daily list:
The end result should be:
It can be done by
awk '!a[$0]++', but I don't have that kind of RAM. I'm not sure how efficient this is for large datasets, it might also run into the problem of having to read
30 quadrillion bytes. Either way, I can't test it due to lack of RAM.
I ended up with
sort -uk2 | sort -nk1 | cut -f2.
I can think of another option that might work: if I use the sorted list to get the new addresses, I can get those out of the daily update while keeping the chronological order. This way I only have to deal with two 20 MB files which is easy. After this, all I have to do is add them to the total file.
I don't remember if I offered you this before but I can host this data for you if it's not too big
You did (more or less):
If network bandwidth is a problem I'm able to host this on my hardware if you like.
So I guess you missed my reply too:
I'm more in need for more disk space for sorting this data, but I haven't decided yet where to host it.
(I can throw up to 300GB for this project). I can also set up an rsync cron job to pull updates from your temporary location too.
It is a good offer
Currently, disk space isn't the problem. I am looking for a webhost that allows me to
abuse the disk for a few hours continuously once in a while. Most VPS providers aren't that happy when I do that, and my (sponsored) AWS server starts throttling I/O when I do this.
I'm (again) short on time to test everything, but after some discussion on another forum I created a RamNode account. This looks promising so far. If I can pull that off by automating everything, it's not that expensive to use it a couple hours per month only.