This will create a shitload of RSS files, so be sure to run in its own subdirectory, ie ~/hashtalkscambullshit/
then paste in the following
# if your name is homero, better hire the nations smartest developers and most feared lawyers to help you with this... we all know you have trouble with computers.
import urllib, os, time
def getstart():
startfrom = input('What post to start from? enter 0 to start from beginning: ')
endon = input('What post to finish on? enter something like 40000 to get the whole site: ')
runloop( startfrom, endon );
def runloop( i, x ):
currentlyat = 0
pauseat = 10
total = x - i
while i <= x:
if currentlyat == pauseat:
os.system('clear')
print "Always Profitable! Hang on, grabbing threads %s - %s.... Hopefully PayCoin doesn't reach $0 before we're finished!" % ( i, i + 10)
pauseat = pauseat + 10
time.sleep(1)
else:
os.system('wget -b -a htwgetlog --no-check-certificate -q --show-progress https://hashtalk.org/topic/%s.rss > /dev/null' % (i,i))
i = i + 1
currentlyat = currentlyat + 1
os.system('clear')
print "%s threads downloaded." % total
print "We're done! Enjoy searching the scam database."
getstart()
execute
It'll ask you which post to start from (0 if you're beginning or another number if you're resuming a previous scrape)
Then it'll ask which post to stop at. I don't know what the currently highest known post is, haven't bothered to check.
Oh, whats that you say? You say you're only interested in Homero's Sales pitches? No problem!
wget https://hashtalk.org/user/mrceo/topics.rss
bump...
updated to reflect that there are only about 37000 posts total, so set your max to ~38000 to catch everything.
TypeError: not all arguments converted during string formatting