How far should we go into the past for Forex simulations: Is old data really useless?

When you build algorithmic strategies for Forex trading you will often hear people telling you that you should or shouldn’t use different data periods for one or another reason. There are some traders who suggest that you should only use data from the past several years, the past year or even only the past few months because “old data” is irrelevant to trading today. However these traders will never give you actual statistical reasons to backup their claims, they usually have no reason to use more recent data besides a belief that using past data won’t bring better results. On today’s post I want to discuss the issue of “old data”, the statistical arguments I have for using old data and when old data should definitely not be used. After reading this you should have a better idea of how past and current data are different and how this affects the data you should use depending on the system you want to trade.

RatDataEntry

So why is there a belief that past data should not be used? There are many arguments that traders who defend this idea use to justify using only recent data. The main argument people will use is that the market is substantially different now than it was 10 or 20 years ago. The rise of algorithmic trading and the now prevalent use of high frequency trading have made the market fundamentally different. From this point of view designing something to trade as if it was 1995 makes little sense since the market we saw in 1995 is a market that will never be repeated (as the conditions in 1995 will never happen again, since we’ll never go back to the technological level we had then). This idea then tells you that if you want to succeed in today’s markets you need to go close to today’s conditions, for most traders who have this belief this means using less than 10 years of historical data, in some cases even less than 5.

The above sounds pretty reasonable, so when is data from 1995 relevant? When building algorithmic trading systems we are subject to curve-fitting bias, which arises from our lack of total knowledge about the market. We know that there are certain market conditions that have existed in the past but we are unsure of whether we can infer general inefficiency examples from them or if the inefficiencies within this data are simply special cases unique to it. From this statistical perspective, the more data we have, the more we are able to find something that is “more general” since our inference exercise has a higher significance (we have more data). Our goal in this case is to find some market behavior that can generate profit systematically across the entire set of data, meaning that we want something that worked in 1986 as well as in 2015. Of course there are many things that worked in 1986 that would not work today so our idea is not to get fooled by the past but to construct something that is “ever present” — to find a general inefficiency.

Even performance is a key aspect when using long term historical data (25+ years). You can design a strategy that performs extremely well in the 1986-1996 period and then only mildly well in the period from 1996 to the present. Such a strategy is obviously something we wouldn’t want to trade live because it is implicit within it’s results that it might be doing something incorrectly. For example assuming a trading cost that was too low in the past or assuming that it could execute trades within a given speed under past market conditions that was simply technologically impossible at the time. This means that we should definitely consider the type of strategy we are simulating when using old data.

cartoon.jpg-tech

If you want to simulate a scalper than attempts to profit from tick-based analysis, then using data before 2007 would make little sense, since the structure that allowed this type of retail trading simply did not exist before then. However if you want to design a daily trend following system then you could have done that from 1986 without a problem.The disadvantage you have when using a strategy that requires data from 2007 is definitely that you have less data available, less information, so you have by definition a larger probability to fall into curve-fitting related traps. The smaller the amount of data you use the more you can fall into wrongly inferring that an inefficiency within your data is a general market characteristic. If your target strategy allows for the use of “old data” then by all means develop something that works equally well from 1986 to 2015, changes are you have found something that is much more general than something that just worked from 2007 to 2015.

If you would like to learn more about the way in which I design strategies and how you to can create strategies using long term historical data please consider joining Asirikuy.com, a website filled with educational videos, trading systems, development and a sound, honest and transparent approach towards automated trading.

You can skip to the end and leave a response. Pinging is currently not allowed.

2 Responses to “How far should we go into the past for Forex simulations: Is old data really useless?”

  1. Rob van der Houwen says:

    Hi Daniel,

    how are you doing with Asirikuy?

    When using a scalper ,you are on the 1 or the 5 minutes timeframe.
    When using a daily ea , you are maybe on a 60 minutes timeframe.

    So ,correct me if I am wrong here,a 1 year 1 minute scalper uses the same amount of data as a daily system with 60 years of history!

    best regards,

    Rob

    • admin says:

      Hi Rob,

      Thanks for writing :o) You cannot confuse information with “number of points” on a data file. Sixty daily data points are not the same as sixty one minute bars. The ticks that are needed to construct the daily data points are many more than the number of ticks needed to construct 60 one minute bars. The amount of market conditions within your data is proportional to the amount of information (number of ticks used to construct the data points) and NOT the absolute number of data points.

      If you construct a system using information from 360 days of 1M data it will never be as robust as a system constructed using 25 years of daily data, even if the amount of “data points” is far greater. The important number is in reality the actual amount of ticks it took to construct both series (the amount of ticks needed to get to those OHLC values). I hope this answers your question,

      Best Regards,

      Daniel

Leave a Reply

Subscribe to RSS Feed Follow me on Twitter!
Show Buttons
Hide Buttons