You make the argument that your input size is sufficiently small such that having exponential complexity is okay, and you may have a point.
Since I got no response to my question above, I'll go with 2 versions:
- All addresses ever used, without duplicates, in order of first appearance.
- All addresses ever used, without duplicates, sorted.
I don't see how sorting would be exponential for any of these lists..
All addresses ever used, without duplicates, sorted.
- We already have a list with all the addresses ever used sorted by address (length n).
- We have a list of (potentially) new addresses (length k).
- We sort the list of new items in O(k log k).
- We check for duplicates in the new addresses in O(k).
- We then read the big list line by line while simultaneously running through the list of new addresses and comparing the values in O(n + k). In this case we can directly write the new file to disk line by line; only the list of new addresses is kept in memory.
Resulting in O(n + k log k + 2k). In this particular case one might even argue that n > k log k + 2k, therefore O(2n) = O(n) However, it's late here and I don't like to argue.
You only need enough memory to keep the new addresses in memory and enough disk space to keep both the new and old version on disk at the same time.
The 'All addresses ever used, without duplicates, in order of first appearance' list could be created in pretty much the same way.
I'll see if I can whip some code together.
File hosting
Have you considered releasing the big files as torrents with a webseed? This will allow downloaders to still download from your server and then (hopefully) continue to seed for a while; taking some strain of your server.
You might even release it in a RSS feed so that some contributors could automatically add it to their torrent clients and start downloading with e.g. max 1 Mb/s and uploading with >1Mb/s, this will quickly allow the files to spread over the peers and further move downloads away from your server.