Author

Topic: How can I reduce time wasted on data cleaning? (Read 160 times)

hero member
Activity: 2660
Merit: 630
Vave.com - Crypto Casino
I think that this preferred period should definitely be variable depending on your daily life. For example, if your profession is mainly related to this job or if your main source of income is this method, I think it would be very normal to spend 8 hours a week, but if these situations are not valid for you, I think it will be enough to spend 2-4 hours a week. For this reason, I think that this process will vary from person to person and should be arranged according to your daily life routines.

I think also to buy minimize visiting sites that is not useful for the person or to reduce the rate of unnecessary data wastage. Some people use data more than they are required. Also, taking time out to adjust the amount of daily usage on the phone.
sr. member
Activity: 2828
Merit: 344
win lambo...

Second problem is with the gaps. Some data sets (for example's sake, imagine candles) can have a couple of days long gap in them. Most sources are telling me to just average it out but I don't like that as I'm worried it influences the models. Also filling the gaps from another data source works rarely because usually the timestamps are not in sync so it becomes almost impossible to fit them retrospectively.

I'm now using only a couple of exchanges and a handful of pairs. I'd like to increase the amount of markets but I'm worried about what kind of issues there will be. I'm already spending so much time in fixing these things that I don't know if I can manage more markets. Any ideas/help will be really appreciated!
Try to use other exchanges. If the gaps pertain to happen again, there might have a problem with your browser or your internet connection, you might try to use another browser instead. I believe there is no problem with the site because they will probably know it already as hearing complaints and for sure they are fixing it already but in your case, there is nothing it happens.

The last option is that you, you need to check your computer or try to use a new one, then figured it out what is the difference because it possible that there might some problem with the programs installed or the capability of it.
sr. member
Activity: 980
Merit: 260
I'm fairly new to the trading scene, started in 2019 with TA and this year got more into more quant based strategies. So my problem is that my models are constantly getting issues because of timestamps not being synchronized (can't interlink data sets) or because the data set has gaps. I've noticed that every 10 hours I work in modelling I spend 5 just cleaning the data.

This topic is a bit general, but as I'm quite new to this I'd like to know how to make this faster, and is it normal to spend this amount of time in preparing the data? Any suggestions are welcome. I use python/jupyter and data usually I find from free online sources and/or fetch from exchanges.

Dude, that sounds excessive. By the time you're done with the data modelling and cleaning the market shifts and most of the effort is wasted. It is best to develop something more time efficient, something automated or find a way to look at data that isn't so difficult to unravel.
hero member
Activity: 1666
Merit: 629
I think that this preferred period should definitely be variable depending on your daily life. For example, if your profession is mainly related to this job or if your main source of income is this method, I think it would be very normal to spend 8 hours a week, but if these situations are not valid for you, I think it will be enough to spend 2-4 hours a week. For this reason, I think that this process will vary from person to person and should be arranged according to your daily life routines.
copper member
Activity: 2856
Merit: 3071
https://bit.ly/387FXHi lightning theory
For the first problem, you're trying to measure lag from the server to you. The only feasible way I can think of is if you place an order and see how long it takes to have the order added to the book (but it probably only works if you've got that level of aggregation either pick a quiet time to do it or have an order ID you can search for once collecting and timestamping the data).

If I were you I'd just ignore the gaps, it's a but stupid to extend the candle by 3 days and that's the only thing I'd do, because you know for the 2 days for something like the cme futures that the open was Fridays close and the close of the candle was Mondays open... But it doesn't make sense to me to have any possibility to normalise the data when markets close.

I think every exchange has closed trading for a certain amount of time. Too so everywhere has gpas that just stay unfilled.
newbie
Activity: 2
Merit: 0
To clarify you're trying to build models from the data but want stuff to be synchronised between different datasets.

I may not have gone further than just a fairly Timeframe but is it not possible to just aggregate the data to a point where both sets are in sync rather than trying to fill gaps with models etc...

Could you give an example of the stuff you're trying to do or a simple application others would use that links to yours - without giving away what you're trading off..

I've encountered 2 types of problems so far. First is that the timestamp syncronizing is really difficult. Not all events have time stamps based on when they happened in the market, so I'm not sure how to sync them. As an example, imagine that I receive the events from two markets (like order book updates) and want to see which market moves first. The problem is that one of the markets does not give a timestamp on when the event was registered in the trading engine, but only timestamp I have is the one I record in my server. I don't know how long the data is in transit from the exchanges server, so it becomes really difficult for me to estimate which markets orders were recorded first. Any suggestions what can I do?

Second problem is with the gaps. Some data sets (for example's sake, imagine candles) can have a couple of days long gap in them. Most sources are telling me to just average it out but I don't like that as I'm worried it influences the models. Also filling the gaps from another data source works rarely because usually the timestamps are not in sync so it becomes almost impossible to fit them retrospectively.

I'm now using only a couple of exchanges and a handful of pairs. I'd like to increase the amount of markets but I'm worried about what kind of issues there will be. I'm already spending so much time in fixing these things that I don't know if I can manage more markets. Any ideas/help will be really appreciated!
copper member
Activity: 2856
Merit: 3071
https://bit.ly/387FXHi lightning theory
To clarify you're trying to build models from the data but want stuff to be synchronised between different datasets.

I may not have gone further than just a fairly Timeframe but is it not possible to just aggregate the data to a point where both sets are in sync rather than trying to fill gaps with models etc...

Could you give an example of the stuff you're trying to do or a simple application others would use that links to yours - without giving away what you're trading off..
newbie
Activity: 2
Merit: 0
I'm fairly new to the trading scene, started in 2019 with TA and this year got more into more quant based strategies. So my problem is that my models are constantly getting issues because of timestamps not being synchronized (can't interlink data sets) or because the data set has gaps. I've noticed that every 10 hours I work in modelling I spend 5 just cleaning the data.

This topic is a bit general, but as I'm quite new to this I'd like to know how to make this faster, and is it normal to spend this amount of time in preparing the data? Any suggestions are welcome. I use python/jupyter and data usually I find from free online sources and/or fetch from exchanges.
Jump to: