Bitcointalk.org HTTP timeout problem

Nagle

legendary

Activity: 1204

Merit: 1002

Quote from: theymos on October 15, 2014, 06:44:54 PM

You're not allowed to access the forum more frequently that once per second on average.

I think that each time you send a request to https://bitcointalk.org/, you also download all of the image/CSS assets. Doing this several times puts you over the burst limit. I see this sort of limit being reached by your IP in the logs.

All it takes is two requests to be blacklisted for a minute or so. This is only true for HTTP requests. You can do HTTPS requests rapidly without penalty, so it's a pointless feature. All you get from an HTTP request for this site is a redirect to the HTTPS URL, anyway. Whatever generates the redirects is doing the useless blacklisting.

"wget" won't download the image and CSS assets at all. It just fetches a file; it doesn't parse it like a browser.

The crawler program is doing this is because it's trying "bitcointalk.org" and "www.bitcointalk.org", both with HTTP and HTTPS requests, to figure out what redirects to what. It doesn't read all of the page; it just opens the URL, reads the HTTP header, and closes. Once it's done that, it reads the home page in its entirety. That fails, because of the strange blacklisting mechanism. None of this reads the CSS or images.

What's doing that? Simple Machines? Some firewall? If this is a generic problem with other sites, I'd like to know about it.

Vortex20000

hero member

Activity: 504

Merit: 500

sucker got hacked and screwed --Toad

@OP, maybe just do requests once per minute?

theymos

administrator

Activity: 5222

Merit: 13032

You're not allowed to access the forum more frequently that once per second on average.

I think that each time you send a request to https://bitcointalk.org/, you also download all of the image/CSS assets. Doing this several times puts you over the burst limit. I see this sort of limit being reached by your IP in the logs.

Nagle

legendary

Activity: 1204

Merit: 1002

Here's an interesting bug, which may be related to some DDOS-prevention tool on "bitcointalk.org". Our SiteTruth site rating system keeps reporting that "bitcointalk.org" has no web site. This is because, if you make certain HTTP requests more than twice to "bitcointalk.org", the site blocks you for a minute. At the bottom of this post is a Python 2.7 program you can use to demonstrate this. The output of the program looks like this:


>\python27\python timeoutbugtest2.py
Try 0:
MyURLOpener opening request: http://bitcointalk.org [('Accept', '*/*'), ('User-agent', 'SiteTruth.com site rating system')]
 Opened OK.
Try 1:
MyURLOpener opening request: http://bitcointalk.org [('Accept', '*/*'), ('User-agent', 'SiteTruth.com site rating system')]
 Opened OK.
Try 2:
MyURLOpener opening request: http://bitcointalk.org [('Accept', '*/*'), ('User-agent', 'SiteTruth.com site rating system')]
 Open FAILED: ('http://bitcointalk.org', u'HTTP error - timed out.')
Waiting 60 seconds before retry.
Try 3:
MyURLOpener opening request: http://bitcointalk.org [('Accept', '*/*'), ('User-agent', 'SiteTruth.com site rating system')]
 Opened OK.
Try 4:
MyURLOpener opening request: http://bitcointalk.org [('Accept', '*/*'), ('User-agent', 'SiteTruth.com site rating system')]
 Opened OK.
Try 5:
MyURLOpener opening request: http://bitcointalk.org [('Accept', '*/*'), ('User-agent', 'SiteTruth.com site rating system')]
 Open FAILED: ('http://bitcointalk.org', u'HTTP error - timed out.')
Waiting 60 seconds before retry.
Try 6:
MyURLOpener opening request: http://bitcointalk.org [('Accept', '*/*'), ('User-agent', 'SiteTruth.com site rating system')]
 Opened OK.
Try 7:
MyURLOpener opening request: http://bitcointalk.org [('Accept', '*/*'), ('User-agent', 'SiteTruth.com site rating system')]
 Opened OK.
Try 8:
MyURLOpener opening request: http://bitcointalk.org [('Accept', '*/*'), ('User-agent', 'SiteTruth.com site rating system')]
 Open FAILED: ('http://bitcointalk.org', u'HTTP error - timed out.')
Waiting 60 seconds before retry.

This continues indefinitely - two successful opens, then a timeout, wait 1 minute, repeat.

It's not clear what sets this off. Browsers don't seem to trigger it. Our site rating system does, though. When it starts rating a site, it makes a few requests ("example.com", "www.example.com", an HTTPS request, etc., checking for redirects and trying to find the front door to the site.) That's enough to trigger this.

Anyone associated with the site know what's going on, and what's in the path to the site? This could be some load-balancer or firewall problem.

How do you contact the people behind "bitcointalk", anyway.

The code:


#
#   Test for SiteTruth URL timeout bug.
#
import urlparse
import urllib2
import time
import encodings

kuseragent = "SiteTruth.com site rating system"    # USER-AGENT sent when crawling
kdefaultsockettimeout = 15.0        # allow this much time seconds for socket timeout

#   Class InfoException  --  used for exceptions related to a page or URL
#
#   Usage:   InfoException(url, message)
#
class InfoException(Exception) :
   "Information from external website was not as expected"
   def __init__(self, *args) :            # Initializer
      self.url = args[0]               # save troubled URL
      self.errmsg = unicode(args[1])      # save problem
      Exception.__init__(self,args)      # initialize parent

   def __unicode__(self) :               # convert to string
      msg = u'Problem with page "%s": %s.' % (self.url, self.errmsg)
      return(msg)


def open(purl) :
        try:                                        # catch only "Unicode error" in URL
            headers = { "User-agent" : kuseragent }  # set our user agent
            req = urllib2.Request(purl, None, headers)      # build request
            #    Workaround for Coyote Point load-balancer bug.
            #    If the last field is User-agent, and it ends with "m" but doesn't otherwise contain "m",
            #    a Coyote Point load balancer will drop the packet.  So we add an extra header
            #    that really isn't necessary.
            req.add_header('Accept', '*/*')         # add unnecessary header
            print("MyURLOpener opening request: %s %s" % (purl, repr(req.header_items())))   ## ***TEMP***
            result = urllib2.urlopen(req, None, kdefaultsockettimeout)        # do the open
        except UnicodeError:                        # bad domain name syntax in Unicode format
            raise socket.gaierror("Syntax error in domain name")    # treat as get-address-error error
        except urllib2.HTTPError as message :
            raise InfoException(purl, u'HTTP error - %s.' % (unicode(message.code)))
        except urllib2.URLError as message :
            message = getattr(message,'reason',message)        # use "Reason" if available"
            raise InfoException(purl, 'HTTP error - %s.' % (unicode(message)))
        return(result)                              # return result of open
        
#
#   Main program
#   
def main() :
    retrydelay = 60                                 # wait 60 seconds before retry
    for tries in range(100) :
        print("Try %d:" % (tries,))
        try :
            fd = open("http://bitcointalk.org")                  # URL causing problem
            print(" Opened OK.")
            fd.close()
        except (InfoException,EnvironmentError,) as message:
            print(" Open FAILED: %s " % (message,))
            print("Waiting %d seconds before retry." % (retrydelay,))
            time.sleep(retrydelay)
    
main()

Topic: Bitcointalk.org HTTP timeout problem (Read 903 times)