Hi,
It was a long weekend last week here in Australia, and I was too broke from buying Bitcoins to go out, so I embarked on a project to create an unofficial
Bitcointalk.org REST API.
I will give a short description of the API, then there are some links you can visit which show the API in action.
The API is JSON only, and currently lets you list categories, boards, topics, and posts. Future work may include adding user profiles also. The API only lets you
read from Bitcoin Talk; there is currently no way to post, edit posts and so on. The API is hosted on Google App Engine (GAE). NB: Occasionally an HTTP request to GAE apps will take a long time to respond if it has to spin up a new instance.
If you browse the API you may discover boards and topics which don't appear to have any content - that's due to my 'lazy scraping' pattern detailed below.
A small note about viewing the API from your web browser: I'm using a little trick I learned from some of my old colleagues while developing APIs - the API checks the "Accept" header to detect if you are requesting the content from your browser, or from JavaScript. If you are in your browser then the API returns pretty printed HTML and turns relevant properties into links, so you can follow them. If you request from JavaScript then non pretty printed, non-linked JSON objects are returned. This lets you browse the API easily without a REST client.
The API works by using a screen scraper on the WAP version of the forums, as suggested by theymos in
this post. The screen scraper is in its own project, separate from the API itself.
The overall design is to avoid scraping old topics, and instead focus on the newest content. The scraper can’t download the entire forum at once, so it uses a ‘lazy download’ or ‘lazy refresh’ approach to scraping content - if a topic or board is requested and it was scraped less than a certain amount of time ago*, then a task will be added to a task queue for the topic to be re-scraped. That means if nobody is using the API then no tasks go in the queue, and if lots of people are using the API then lots of tasks go in the queue. The task queue will then eventually be filled up by many requests to scrape many boards and topics, and I can adjust the rate of the execution of these tasks based on how much theymos yells at me
. If an extremely large topic is encountered (like the Wall Observer thread in Speculation) then the first two pages and the last two pages will be downloaded only, to avoid generating too many requests.
* Current ‘freshness’ of boards is 5 minutes and freshness of topics is 1 minute. If you’re making a client then I recommend simply hard-coding the list of categories and topics since it’s not likely to change that much.
Limitations:
- Fixed number of posts in a page (20, same as these forums)
- Currently JSON only
- There isn't any time/date information on posts - that's because I'm using the WAP version of the forums and the date isn't displayed.
- Similar to the point above, the formatting for the BBCode (quote) feature is lost and it just comes out as plain text like 'Quote: ...
Code is open source at GitHub (Java):
BitcoinTalkScraper and
BitcoinTalkAPIStay tuned this week for news on my unofficial Bitcointalk mobile/tablet forum client.
I am doing this totally for the love of Bitcoin and these forums, and everything is free, but if you are feeling really generous then please tip me at 17SbWcyRoZd7u1tZeJtjzm834a3gAHdf2A. Thanks!