That's not really what's happening. My script is scraping the website, sending thousands of requests, so to a website operator it looks like a ddos attack. It may even have been cloudflare blocking the ip automatically. If he was blocking archiving, they would just block incoming requests from archive.today, which probably uses multiple IP addresses.
tl;dr They are blocking my scraping of the site, not the archiving. You can see all archived urls at
https://archive.today/https://hashtalk.org/* (notice how many are from the last hour? That's a script for you.)
Run them through a caching web-proxy. Apache/squid/etc on client-side, Opera's stuff on the server-side.
And maybe add some short pauses to slow down the requests. DDOS speed is not a good idea, although you probably don't have the band width to bother a real website. Do you?
The bandwidth I have is from a single ec2 instance on amazon free tier.
The problem I have is that a lot of urls are the same page, but have different urls. (For example,
https://hashtalk.org/topic/11695/zencloud-maintenance-mode/ and
https://hashtalk.org/topic/11695/zencloud-maintenance-mode/578 and
https://hashtalk.org/topic/11695/zencloud-maintenance-mode/63?page=34). I need all the posts, and each one only gets 10 or so, but there's a lot of overlap, which complicates things. What I need now is 1: a script that strips the urls to take away the numbers and page at the end and 2: a script that removes duplicate lines from a file. I'd also want a script that changed a single url to a list of urls (like given
https://hashtalk.org/topic/11695/zencloud-maintenance-mode/ as input, it would return
https://hashtalk.org/topic/11695/zencloud-maintenance-mode/1?page=1,
https://hashtalk.org/topic/11695/zencloud-maintenance-mode/1?page=2, etc,. There seem to be around 15 for each page, so perhaps get the total number of posts from curl and divide by 15 to get the number of pages needed, then add 5 to be sure.
If I had those, I would actually be able to only archive what's needed instead of overloading archive.today's servers. (I noticed them down before, and now my instance can't access it. Time to change ips again.)
Anyone with moderate hacking skills want to help out?
As for the specific advice offered so far: I don't think I'm bad enough to bother hashtalk, and so far my second ip address hasn't been banned yet, even though it's been running longer than the first ip. Right now I have too many requests to slow them down. If I had the right scripts to minimize the number of requests needed, I could do that. I gave a pretty good description of what is needed above, and I'll talk more about it soon. I don't want a caching webproxy, because if a page changes and has a new link, I want to catch that. Perhaps I can set it up to only cache for 5 minutes? Then when scrapy fetches a page it already fetched within the last 5 minutes, it would find no new links. How could I set that up?
What I really need is to fix the scrapy script to not include dups. I want each thread to show up only once in scrapy, and have my own script generate all the urls needed for each thread. Or have the scrapy script generate them. There's no need for scrapy to fetch more than one page per thread; all the info needed is on each page. What seems to happen is that scrapy is wasting time on a single long thread, and not getting the whole site. Is there a way I can tell scrapy not to spend too much time on hashtalk.org/topic/somenumbers? For any 2 urls with the same prefix up to the numbers, scrapy should return only once an hour.
Anyone who can help, I'll give you 1 xpy in a month
(This will of course be edited out as soon as someone helps.)