Pages:
Author

Topic: Discussion for MtGox trade data downloader - page 6. (Read 14422 times)

member
Activity: 70
Merit: 10
Nitrous & MagicalTux - I really appreciate your effort in getting this data source available on big query.

I ran a query to pull all trades since 1 Jan 2013 and it looks complete up until 23 May 2013. However, I've also observed some gaps in the data where periods of hours or days do not have any trades (eg between 1367430878865520 & 1367547410438010). Yet there are 60K trades during that period in the data source, so my code has tripped up somewhere.

I just wanted to mention it in case it relates to the WRITE_TRUNCATE method to derive the sorted table. FWIW I queried the trades_raw table as a trades_sorted table wasn't available yet. I imagine that once you do have the sorted table then trades since 23 May will start being uploaded?

In the meantime I will use the python tool from Nitrous to get the complete dataset.
sr. member
Activity: 246
Merit: 250
I have 2 weeks of exams coming up followed by a family holiday, so I won't be able to put in too much effort.

Enjoy your family holiday  Smiley

Thank you Smiley



I just heard back from Jordan that copying does respect the order, so option 1 should definitely be plausible. Creating a table in a single operation should also result in stable order access, so option 2 is also possible. The only uncertainty I have is whether the table needs to be deleted and recreated to 'reset' the coalesce operation, or whether just using a WRITE_TRUNCATE will be sufficient. Either way we should be able to do this quite easily now Smiley

Update: Perfect! WRITE_TRUNCATE will be sufficient, so it should be quite easy to get this going.
sr. member
Activity: 378
Merit: 250
Born to chew bubble gum and kick ass
I have 2 weeks of exams coming up followed by a family holiday, so I won't be able to put in too much effort.

Enjoy your family holiday  Smiley
sr. member
Activity: 246
Merit: 250
There are a couple of solutions I've already thought of, each with pros/cons, so I think it should be resolved soon.

Can you share your thoughts on possible solutions or is it too early?

Update: added fifth option (see the second row, using destination table)
Update 2: assuming that option 2 is feasible, MagicalTux has confirmed he will implement it Smiley

Possible solutionAssumptionsProsCons
There are two tables, trades1 and trades2. When it is time to update, MtGox appends to trades1, then copies this table to trades2. Next time, they append to trades2 and copy to trades1. Both tables are identical, and the tool can download from trades1 every 10 minutesThat copy operations preserve row order and prevent coalesce operationsFree, simple, quick, tool continues to work as it does now -- basically the ideal solutionNone
There are two tables, trades and trades_sorted. When it is time to update, MtGox appends to trades then performs a "SELECT * FROM trades ORDER BY Money_Trade__ ASC" using trades_sorted as the destination table (and using WRITE_TRUNCATE mode)That the sorted table will be guaranteed to remain sorted, and won't be coalescedTool continues to work as usual (switched to download from trades_sorted)MtGox must pay up to $60/month to implement this
Hybrid solution. The tool checks whether the trades table has been coalesced since last updated. If yes, then it downloads by SQL query, otherwise it downloads using the normal tabledata:list methodThat it is possible to check for coalescence, that checking is freeSince coalescence only occurs about every 2 days at most frequent, the usage should remain free, downloading from a query operation could be slightly quickerMore complex tool (only slightly though), query operations are slow (can take ~100s to sort the entire table)
Google makes the coalesce operation order-preservingThat changing the coalesce behaviour is easy and quickNo changes to toolAssuming that Google will change the way their service works just for our benefit
Google implements automatic table sorting for tables, and MtGox selects "Money_Trade__ ASC, Primary DESC" as the sorting orderThat Google would implement an entirely new functionalityNo changes to the tool, functionality could be useful for many people and other bigquery usersA lot to ask from Google, moreso than the last solution

As you can see, the first two rely on possible pre-existing behaviour, and the last two rely on Google implementing new behaviour, so the first two are more ideal. Unfortunately, however, Jordan and my timezones intersect at awkward times so correspondence can be a bit slow.



Update to the development:

The basic underlying tool mechanism is pretty ready, todos are:
  • Make the tool more reliable and not at risk of corruption should bq return unexpected results
  • Create a graphical interface
  • Create documentation (if necessary)
  • Test out different python packagers for creating apps/exes to see which works with the necessary dependencies, and release self-contained apps for different platforms (windows, mac), as well as an archived distribution for linux/mac
  • Take feedback and make any necessary improvements

I have 2 weeks of exams coming up followed by a family holiday, so I won't be able to put in too much effort. Any python programmers though can feel free to fork my repo and work on these if you want Smiley
sr. member
Activity: 378
Merit: 250
Born to chew bubble gum and kick ass
There are a couple of solutions I've already thought of, each with pros/cons, so I think it should be resolved soon.

Can you share your thoughts on possible solutions or is it too early?
sr. member
Activity: 246
Merit: 250
Hi Loozik - the bq database is not updated every time a transaction occurs. Instead, MtGox will upload all new trades that occurred between the last update and then, about every 10mins-1hour. If there are trades occurring 300 times per second, then these updates would insert each of the new 180k-1.8m trades, so there's no problem there. Also note that MtGox does use microsecond timestamps (see the Money_Trade__ field).

If there is no resolution to the problem, then up to $60/month would be incurred by the users of the service, but this is only if the user were to update their local dump every 10 minutes. If they did so every 4 hours or less, it would be within the Google bq monthly quota, and would be free. I understand the usual forex policy for API access, but I'd like to avoid any changes that would charge for access.

There are a couple of solutions I've already thought of, each with pros/cons, so I think it should be resolved soon.
sr. member
Activity: 378
Merit: 250
Born to chew bubble gum and kick ass
After talking with Jordan Tigani from Google's bigquery project, it seems that this tool may stop working at some point after MtGox starts regular updates, because Google occasionally performs coalesce operations that don't respect the order of rows in a table.

I read your discussion with Jordan Tigani (I do not understand it very well)

Otherwise it will end up costing around $60/month to keep local copies regularly updated because `order by` queries will need to be run for each update instead of a simple (and free) tabledata:list. Running SQL queries would also slow down the tool speed.

Is $60/month the cost to be incurred by MtGox / you or by every single user of this service?

If bigquery doesn't implement some kind of row order permanence, then I'll ask MagicalTux to consider doing this officially.

Couldn't MtGox just invest in servers / rent a data centre and pay a programmer like you to make a dedicated service / tailor made service instead of spending many hours talking to Google (without a guarantee for a success) - time is money.

Then MtGox can simply provide data API service to customers who will pay e.g. BTC 0.5 per month - it is a standard practice in financial industry that data API is provided to users (i) for a fee or (ii) in exchange for maintaining a certain balance in the account (e.g. if you want API from Dukascopy forex broker you must have a balance of at least $100k)
sr. member
Activity: 246
Merit: 250
Maybe there is an installation doc I missed?

Hi again, I just updated the pykrete docs to be a lot more complete, and they now include some simple installation instructions Smiley
sr. member
Activity: 246
Merit: 250
After talking with Jordan Tigani from Google's bigquery project, it seems that this tool may stop working at some point after MtGox starts regular updates, because Google occasionally performs coalesce operations that don't respect the order of rows in a table. The initial download should always work properly, but if a coalesce occurs, then future updates may corrupt the local dump. Bigqeury do have plans to implement some way of respecting order, but it is not yet on the tables. Perhaps if we made a case to Google about the importance of this we could reach some kind of solution? Otherwise it will end up costing around $60/month to keep local copies regularly updated because `order by` queries will need to be run for each update instead of a simple (and free) tabledata:list. Running SQL queries would also slow down the tool speed.

The only other option I can envisage is someone performing a `SELECT * FROM [mt-gox:mtgox.trades] ORDER BY Money_Trade__ ASC` query every time the database is updated, saving it to a table, and then the tool would download from this table instead. If bigquery doesn't implement some kind of row order permanence, then I'll ask MagicalTux to consider doing this officially.
sr. member
Activity: 246
Merit: 250
Awesome work.

I am currently doing some research on how to get historical data from different exchanges to enable backtesting features. I want to use your tool to download everything into a SQLite db so that my bot can read this data. However I am running into trouble installing your tool:

  • I have installed python 2.7 (requirement of Google API client library)
  • I have downloaded the Google API client library and updated the clients_secrets and its in folder A
  • I have downloaded bq and pykrete in folder B and C

Do I need to point bq to pykrete and the Google library or do I need to put certain files at certain places? Maybe there is an installation doc I missed?

Sorry I wasn't more clear on this. To install a python package, you need to run
Code:
sudo python ./setup.py install
from the directory using the console/terminal/command prompt. I'm assuming you're on linux or mac here, if you're on windows omit the `sudo` part. Once you've installed pykrete and the google library, you should then be able to run mtgox.py and it should start the download Smiley It should also be able to resume previous downloads if you cancel it.

EDIT: When the tool is more complete and is ready to be packaged up into a self-contained GUI app, all these dependency issues should be gone as they will be built-in Smiley My first quick attempt to do this using pyinstaller didn't work, which is slightly ominous, but hopefully other tools such as py2exe and py2app will be able to handle the google library properly.
sr. member
Activity: 287
Merit: 250
Awesome work.

I am currently doing some research on how to get historical data from different exchanges to enable backtesting features. I want to use your tool to download everything into a SQLite db so that my bot can read this data. However I am running into trouble installing your tool:

  • I have installed python 2.7 (requirement of Google API client library)
  • I have downloaded the Google API client library and updated the clients_secrets and its in folder A
  • I have downloaded bq and pykrete in folder B and C

Do I need to point bq to pykrete and the Google library or do I need to put certain files at certain places? Maybe there is an installation doc I missed?
sr. member
Activity: 378
Merit: 250
Born to chew bubble gum and kick ass
I vote for ''download a complete dump then process''

Thanks
sr. member
Activity: 246
Merit: 250
N.B. On my ~8meg connection, the full download took around 2.5 hours, and resulted in a 457mb db
sr. member
Activity: 246
Merit: 250
Hello,

As you may know, I am developing a tool to download data from MtGox's bigquery database (see here for info on this db). The tool is being written in python, but hopefully I will be able to also release it as a multi-platform self-contained app for non-programmers as well. I also plan on supporting several different formats, e.g. phantomcircuit's sqlite, csv, full sqlite dump, among others (it will also be easy for anyone who can program to implement their own).

The problem is that bigquery can be quite slow to sort this dataset, and downloading (and maintaining) the data using these ordered queries can quickly use up the free bigquery limits, so it is only really feasible to obtain the raw table data. This is good because it's quick and uses up 0 bytes of processing quota, however it also comes in unordered. For database formats this doesn't matter, but for formats like CSV, it's not really desirable to have records all over the place.

  • To solve this, I propose that the tool will download a full and complete sqlite dump of the data, which can then be used on its own or transcribed to other formats (in order), e.g. CSV, phantomcircuit's format, etc. Unfortunately, however, this will result in more memory usage as you will probably have at least 2 copies of the data. I estimate that the full dump will be about 460 mb, and other formats will depend on what data is  included and in what format. This does have the benefit, however, that you will always have an up-to-date copy of the bq database, and so can generate whatever format you need, even if you hadn't anticipated it.
  • The alternative is to process it while it downloads, however this will mean that you will be locked into the format you specify, and it will be in the order of the bigquery database, so the CSV format, for example, would be difficult to make effective use of.

As you can see, there are two options with both advantages and disadvantages. Please select which option you would prefer in the poll above. It would also be good to hear your thoughts/discussion on this tool.

In the meantime, you can download the partially complete tool which is capable of downloading a complete dump to a sqlite3 database here: https://bitbucket.org/nitrous/bq. Its dependencies are my sqlite class (pykrete, here) and the google python client library here
Pages:
Jump to: