Those numbers still don't add up. Less than one post per thread?
Sorry, I keep instinctively writing 'post' instead of 'thread'. Just %s/post/thread/g if you know what I mean
So basically, my scraper is built in terms of topics. It takes a range of topic numbers to scrape, and works through each of them one by one, continuously clicking on the Next Page button in the process until it's on the last page.
Like this, I scan scrape 20 posts at once per page load. Though I really wish I could display more of them - that would make the process considerably faster.
About 1/3 or so of the topics I scraped don't exist, are deleted, quarantined, or nuked for whatever reason so the parser runs very fast in those cases. But for the ones that do exist, usually there are 1-2 pages only per topic, which is what the majority of running time the scraper is spent on.
Occasionally I come across 10+ or 50+ page topics. But just having a lot of pages is not going to pull the average number of scraped topics/day down unless there's hundreds of pages in the topic. Due to rate limiting constraints, it can only navigate to about 2.5k
pages per day. Including the 'Not Found' pages as well.
Here is a typical scrape in progress (this is only the log file. I actually built a whole user interface around this):