Pages:
Author

Topic: Discussion for MtGox trade data downloader - page 2. (Read 14411 times)

sr. member
Activity: 246
Merit: 250
December 29, 2013, 04:09:07 PM
#94
The (windows version of the) tool crash after downloading about 100Mb of data, and the crash occur every time at the same spot.

I have tried several times, doesn`t matter if I`m downloading a new dump or just try to resume one.

Code:
Rows downloaded: 1024000
Latest TID: 1366547913512370
Data up to: 2013-04-21 12:38:33

Update in progress - 5074502 rows to download


A few other people seem to be having this problem (see the last couple pages of this thread). I'm not entirely sure why it's happening, but I think Google have changed their protocols subtly and it's broken the tool. As a result, I've created some full dumps, including data all the way up to mid-December 2013 here.
sr. member
Activity: 458
Merit: 250
beast at work
December 29, 2013, 03:12:52 PM
#93
The (windows version of the) tool crash after downloading about 100Mb of data, and the crash occur every time at the same spot.

I have tried several times, doesn`t matter if I`m downloading a new dump or just try to resume one.

Code:
Rows downloaded: 1024000
Latest TID: 1366547913512370
Data up to: 2013-04-21 12:38:33

Update in progress - 5074502 rows to download
sr. member
Activity: 246
Merit: 250
December 22, 2013, 08:32:06 PM
#92
Quote
RFC should really standardise the csv number format as this wasn't very obvious. Anyway, I'm glad you've fixed the problem now, and hopefully this will be useful for other people

Another thing that logic afterwards but a bit confusing is: Excel automatically changes the "." to a "," as decimal separator when you do it as described above, because this is the country setting (in europe you use "," to separate decimal numbers) if you now save the file again as .csv Excel is not saving the "," as "." again, it leaves the "," - this leads now to the problem what you mentioned that excel could confuse colums, but then its smart and changes the commas as separators with ";" with is well thought out from MS, but one can be confused when importing to the charting software and then thinks "But wtf why is comma not working i saved this as Comma Separated..."

Now we have it all through with Excel what  Wink
It should all be good now, but i don't have time to check it out now, Since i have to pack my stuff fast for "Driving home for Christmas" and there i probably will be pretty busy...

Which brings me to the point: Nice X-mas to you nitrous and all of this board here

till later!

Thanks, Merry Christmas to you too Cheesy
BNO
full member
Activity: 157
Merit: 100
December 22, 2013, 08:46:12 AM
#91
Quote
RFC should really standardise the csv number format as this wasn't very obvious. Anyway, I'm glad you've fixed the problem now, and hopefully this will be useful for other people

Another thing that logic afterwards but a bit confusing is: Excel automatically changes the "." to a "," as decimal separator when you do it as described above, because this is the country setting (in europe you use "," to separate decimal numbers) if you now save the file again as .csv Excel is not saving the "," as "." again, it leaves the "," - this leads now to the problem what you mentioned that excel could confuse colums, but then its smart and changes the commas as separators with ";" with is well thought out from MS, but one can be confused when importing to the charting software and then thinks "But wtf why is comma not working i saved this as Comma Separated..."

Now we have it all through with Excel what  Wink
It should all be good now, but i don't have time to check it out now, Since i have to pack my stuff fast for "Driving home for Christmas" and there i probably will be pretty busy...

Which brings me to the point: Nice X-mas to you nitrous and all of this board here

till later!
sr. member
Activity: 246
Merit: 250
December 22, 2013, 07:50:53 AM
#90
Hi,

i played with this. It was my fault sorry if i caused extra work. The explanation what was the exact Problem might be interesting for people from Europe (like me i'm from germany) importing this into Excel.

The Setting here have to be like this.
...

The "decimal separator" (that's how its called in German Version translated to english) has to be "."

In my defendance i have to say it wasn't logic for me that this could have caused this error since, and that was now what was the real trick: The combination of the "1000's separtor setting" and the "decimal- separator" led to the "weird" effect that all numbers below 1 where shown correctly but all number beyond 1 not. finally realising this led me to the point that i thought: "Hm O.K. it shouldn't be somethign with the decimal seperator since the numbers in the beginning are right, but let's check it again.

Sorry for wasting your time...

Greetings.

Hi BNO,

Haha, no don't worry about wasting my time, that was really confusing! That makes a lot of sense now, although now I'm not sure how Excel managed to distinguish between the columns considering they use a comma for the column separator... RFC should really standardise the csv number format as this wasn't very obvious. Anyway, I'm glad you've fixed the problem now, and hopefully this will be useful for other people Smiley
BNO
full member
Activity: 157
Merit: 100
December 22, 2013, 07:43:08 AM
#89
Hi,

i played with this. It was my fault sorry if i caused extra work. The explanation what was the exact Problem might be interesting for people from Europe (like me i'm from germany) importing this into Excel.

The Setting here have to be like this.


The "decimal separator" (that's how its called in German Version translated to english) has to be "."

In my defendance i have to say it wasn't logic for me that this could have caused this error since, and that was now what was the real trick: The combination of the "1000's separtor setting" and the "decimal- separator" led to the "weird" effect that all numbers below 1 where shown correctly but all number beyond 1 not. finally realising this led me to the point that i thought: "Hm O.K. it shouldn't be somethign with the decimal seperator since the numbers in the beginning are right, but let's check it again.

Sorry for wasting your time...

Greetings.
sr. member
Activity: 246
Merit: 250
December 21, 2013, 03:02:01 PM
#88
I just found this resource and I thought it would be useful to some people Smiley -- http://api.bitcoincharts.com/v1/csv/

I can't believe I didn't find it before, anyway, it has regularly updated (15mins?) CSV files with {unix timestamp (1s resolution), price (float), volume (float)} fields for many different exchanges and currencies.
sr. member
Activity: 246
Merit: 250
December 21, 2013, 02:58:27 PM
#87
Hi Nitrous,

wanted to write you yesterday but didn't find the time to do so..

One problem occured in the field price when i looked at the data:
...

i remember that in the BigQuery table you had to divide the price field by 10.000. Here something seems to have gone wrong you know what?

The field Volume i divided by 100.000.000 for adjusting to full coins, might something similiar have happened in this field too?

Bye

Edit: I just saw that the .csv you script created is fine, it must have happened during import to Excel. Do you have an idea what might have caused this? Huh

Hi BNO, Excel gets stranger and stranger!  Huh

I just did a test CSV export myself and the CSV is definitely fine, I'm not sure how or why excel is mangling the data like this. Dividing one column by 1e8 should not affect any other column, especially as inconsistently as the price column seems to be being affected :S What is the exact procedure by which you imported the data into excel?
BNO
full member
Activity: 157
Merit: 100
December 20, 2013, 08:01:51 PM
#86
Hi Nitrous,

wanted to write you yesterday but didn't find the time to do so..

One problem occured in the field price when i looked at the data:



i remember that in the BigQuery table you had to divide the price field by 10.000. Here something seems to have gone wrong you know what?

The field Volume i divided by 100.000.000 for adjusting to full coins, might something similiar have happened in this field too?

Bye

Edit: I just saw that the .csv you script created is fine, it must have happened during import to Excel. Do you have an idea what might have caused this? Huh
sr. member
Activity: 246
Merit: 250
December 18, 2013, 06:07:54 PM
#85
Hi nitrous,

Since my charting software didn't eat well the Date, Time as two columns i recalculated in another column as Date+time, and then saved this as .csv. On first glance it looks good. But Is there any problem with this approach, should i beware of something (like leap years and other stuff that could fuck up the calculation)...

Quote
It doesn't make a difference to the data though, all ticks should still be present, it's just a curiosity. Of course, this discontinuity will probably have some effect on the prices around that time, so you might want to exclude a few days ± around that point for that reason as it might mess up your backtesting.

I don't know if i got you right, you mean the data is correct, but the closing of the exhange for almost a week will have had its impact on the prices as people got scared but with the data itself all is good?

I go to bed see you tomorrow...

Hi BNO,

I don't think so -- if you calculated the date+time as just the concatenation of the two, it should be fine as I created the columns directly from the original unix timestamp (in dump #1) using a decent conversion function. How did you compute the column? I guess considering the weirdness from excel before, it is possible the excel function you used might not work well - hopefully it will be consistent though.

Yeah, anytime an exchange stops or starts, even for a few hours, the data either side will probably be a bit skewed -- both for emotional reasons (like being scared), but also mainly just as the price readjusts through arbitrage to the market price elsewhere (or at least, this is what I would expect).

Cya
BNO
full member
Activity: 157
Merit: 100
December 18, 2013, 05:17:06 PM
#84
Hi nitrous,

Since my charting software didn't eat well the Date, Time as two columns i recalculated in another column as Date+time, and then saved this as .csv. On first glance it looks good. But Is there any problem with this approach, should i beware of something (like leap years and other stuff that could fuck up the calculation)...

Quote
It doesn't make a difference to the data though, all ticks should still be present, it's just a curiosity. Of course, this discontinuity will probably have some effect on the prices around that time, so you might want to exclude a few days ± around that point for that reason as it might mess up your backtesting.

I don't know if i got you right, you mean the data is correct, but the closing of the exhange for almost a week will have had its impact on the prices as people got scared but with the data itself all is good?

I go to bed see you tomorrow...
sr. member
Activity: 246
Merit: 250
December 18, 2013, 12:39:49 PM
#83
...

Hmm, some really strange stuff happening... Firstly, showing up as 1.07 GB - I just realised that the tool will create a duplicate index, I should have thought about that. It's not really a problem except for taking a few minutes the first time you load it up and increasing the file size more than necessary, otherwise though it shouldn't affect your usage.

Yeah sorry about that - the format I was asked to support only went down to second resolution, so the microsecond resolution isn't present. In fact, microsecond resolution isn't even available for the first 218868 ticks. Unfortunately there's so many different possible formats I could export to, so I picked a few and stuck with them (although obviously the dump contains all the raw data unfiltered). If there's one you really want then I could release a new version of the tool with that supported, but if you want to manipulate the data into more formats or more than you can do in Excel consider playing around with Python and seeing what you can do with it Smiley

Excel is being very weird - if you notice, it's taking the minute and seconds and converting them into a new millisecond value for some very strange reason, such as 17:48:56 -> 17:48:56:4856. There's no imperative for it to do this.

Yes, the first data source is the Google BigQuery database, the second is the MtGox HTTP API, and the third is the MtGox Socket API -- basically the socket API is used to just collect the last few trades in real time. If you're going to cut them out, then you can cut out the last few minutes, half an hour to be safe, of the data, and just the one day on May 23rd (although really there shouldn't be any discrepancy, the data should be exact).

Yes, the large jump is because MtGox changed recording format -- Money_Trade__, otherwise known as the TID/Trade ID, used to be a (mostly) sequential integer, then it became a microsecond timestamp afterwards (coinciding with that closure, I believe). It doesn't make a difference to the data though, all ticks should still be present, it's just a curiosity. Of course, this discontinuity will probably have some effect on the prices around that time, so you might want to exclude a few days ± around that point for that reason as it might mess up your backtesting.
BNO
full member
Activity: 157
Merit: 100
December 18, 2013, 12:09:10 PM
#82
i downloaded the 2nd file (899,2 Mib, but i on my computer it shows with 1,07 GiB)  and exportet it to an .csv file - that worked. Great! Today i was playing a little with the data, reading your manual (very well written kudos) but many things are still very unclear to me:

What really confuses me is the whole complex surrounding the timestamps. When i convert the file to an .csv  for example the last entry looks like this when i open the .csv in notepad:
 2010-07-17,23:09:17,0.049510,2000000000

when i import the same .csv into Excel and i change the formating of the cells to hh:mm:ss:ms (could not find smaller units than milliseconds in Excel) than excel shows me: this as timestamp: 23:09:17:917

for me several things are not clear for me with this:
1. How can Excel display something which is not even in the .csv file contained. Did Excel just "make this up"?
2. In the thread there stands that the data is in microseconds accuracy, but when i look at the file in editor it seems to be seconds. For example the last 3 lines of the file are:

2013-12-17,15:47:30,715.700000,1210000
2013-12-17,15:47:30,715.700000,1000000
2013-12-17,15:47:30,715.700000,780000

3. Not so important but maybe someone knows here/ has experience with Excel displaying timestamps: As far as i found Excel (i use 2010) is not able to display a better resolution than milliseconds. But some of the timestamps in my Sheet have 4 digits after the seconds e. g. 18.07.2013 17:48:56:4856
 how is that even possible?

Quote
It is also possible that there are some oddities around May 23rd 2013 and around today December 17th, purely because I collected the data from 3 sources, and those were the boundaries - I'm fairly sure that there shouldn't be a problem, but if you wan't to be really safe then you can avoid those two days.

If i understand you right this file is a combination of 3 datasources. The 3 Datasources are not mashed up on a day by day fashion but more like this:
from 07/17/2010 : Datasource1 (Mark Karpeles?)
05/23/2013: Datasource2 (API from mtgox? Bitcoincharts?)
12/17/2013: Datasource3?

so should i cut out just May 23rd and Dec 17. or +- some days before and after?

Quote
There are a couple oddities with the Money_Trade__ values, but I don't think this will be particularly relevant to you. Otherwise, I think the data is relatively accurate.

before i posted here i downloaded the files from Google BigQuery. I noticed then that there have been quite large jumps in the Trade ID's. Are you refering to that or could it be that with the change of the "primary key" of the database after Trade id 218868 things could have been messed up? No or? I mean the closed the exchange for 6 days back then, to set everything up right..




sr. member
Activity: 246
Merit: 250
December 17, 2013, 08:57:34 PM
#81
Hi,

thanks a lot, i'm already downloading and post when i'm done.

Some things i wanted to ask you - since you have lots of experience with this data now.
How would you rate the data quality in terms of acurateness and "rightness". Any known bugs oddities? Is this data actually as all trades went through on Mtgox? something to consider if i might want to use this data to develop a strategy (need to read first, to get better with this)?

The data should be an exact representation of what happened in real time, or at least a microsecond-resolution approximation. By that I mean that it it is possible multiple trades were executed in the same tick, but for all intents and purposes this shouldn't affect your use of the data for strategy development.

There are a couple oddities with the Money_Trade__ values, but I don't think this will be particularly relevant to you. Otherwise, I think the data is relatively accurate.

It is also possible that there are some oddities around May 23rd 2013 and around today December 17th, purely because I collected the data from 3 sources, and those were the boundaries - I'm fairly sure that there shouldn't be a problem, but if you wan't to be really safe then you can avoid those two days.

Lastly, the Primary column -- some trades are duplicated into other orderbooks, and their duplicates will be marked with their Primary field as false. It should be easy to exclude this data from your exports though (and it should be excluded by default). Alternatively, you might want to include it as I think these non-primary trades can influence other currencies, but I'm not too sure about this -- if you're not sure, just exclude it.
BNO
full member
Activity: 157
Merit: 100
December 17, 2013, 11:54:10 AM
#80
Hi,

thanks a lot, i'm already downloading and post when i'm done.

Some things i wanted to ask you - since you have lots of experience with this data now.
How would you rate the data quality in terms of acurateness and "rightness". Any known bugs oddities? Is this data actually as all trades went through on Mtgox? something to consider if i might want to use this data to develop a strategy (need to read first, to get better with this)?
sr. member
Activity: 246
Merit: 250
December 17, 2013, 11:47:54 AM
#79
Hi nitrous,

i was just stopping by. Wanted to thank you for your answer first. Googled the last 2 hours to become a bit smarter on databases and how to get  this somehow done (didn't get far though Huh)

Do i understand you right, that  i could download the 2nd file and convert it with your tool? That would be awesome... Smiley



Hi, yep, you should be able to just load it in and export Smiley
BNO
full member
Activity: 157
Merit: 100
December 17, 2013, 11:38:12 AM
#78
Hi nitrous,

i was just stopping by. Wanted to thank you for your answer first. Googled the last 2 hours to become a bit smarter on databases and how to get  this somehow done (didn't get far though Huh)

Do i understand you right, that  i could download the 2nd file and convert it with your tool? That would be awesome... Smiley

sr. member
Activity: 246
Merit: 250
December 17, 2013, 11:10:53 AM
#77
I just tried the latest Windows binary (without status bar) and can confirm it has successfully exported 3600 second candles. I"ll try some other values as well and report if I encounter any errors. Win7/64.


That's good Smiley (although I was sure that the dump I uploaded would be incompatible...) Anyway, in case it still is, I've just finished another dump, up to date for 2013-12-17 15:48:26.308886, which should be fully compatible (the difference with the first one is that the date column is now in ISO format rather than unix timestamp, and the primary column is [true, false] instead of [1, 0]).

Here are the two dumps:
1) https://docs.google.com/file/d/0B3hexlKVFpMpYmdoUUhXckRrT2s (2013-12-12 22:13:40.483107; 665.1 MB)
2) https://docs.google.com/file/d/0B3hexlKVFpMpdVhxd1NvQ0Fqanc (2013-12-17 15:48:26.308886; 899.2 MB) <--- fully compatible with my tool for export (don't try to update it though)

The size discrepancy is because the ISO date format (2) is stored as text while the unix timestamp (1) can be stored more compactly as an integer.
newbie
Activity: 45
Merit: 0
December 17, 2013, 10:10:38 AM
#76
I just tried the latest Windows binary (without status bar) and can confirm it has successfully exported 3600 second candles. I"ll try some other values as well and report if I encounter any errors. Win7/64.
sr. member
Activity: 246
Merit: 250
December 16, 2013, 06:42:25 PM
#75
The tool from poikkeus has one file history.dat can i download this and - same question - get this somehow into a .csv file?

I haven't looked at his source code yet, but looking at the contents of history.dat it looks to be in a binary format, which means you'll need a script to convert it. Perhaps poikkeus could add conversion to CSV to his tool? I'm sure if not someone else could quickly hack up a conversion script, assuming the format is simple enough (it looks like it's a java object dump).
Pages:
Jump to: