Predicting out of sample performance

For a quantitative trading strategy developer the holy grail would be to have a method that can forecast system performance under unseen market data with some accuracy. This means that you would be able to take your algorithm with some historical performance metrics and through some analysis derive an expected performance measurement for the future that would be accurate, at least to some extent. Given this fact many people have become interested in the creation of such methods in order to be able to derive some form of prediction of real out-of-sample success. On today’s post I want to talk about some of the attempts that have been made, some of the attempts that can be made and why and how these efforts can easily become exercises in futility.


It is definitely normal and desirable to try to study how system behavior changes when going from an in-sample design period to a pseudo-out-of-sample or a real out-of-sample period. Most research is performed using pseudo out-of-sample periods – meaning that historical data is divided and a part is left out from the initial design phase – as the use of real out-of-sample periods is inconvenient due to the amount of time it takes to trade systems under truly unknown market conditions (new market data). Using pseudo out-of-sample periods a quantitative trader can easily study large groups of systems and evaluate how they behave with data that was excluded from a given design phase. Of course this is not a real out of sample and the fact that the data is a pseudo out of sample (historical data) introduces several important sources of statistical bias (read here for more info). However there are some broad conclusions that can be reached while keeping bias to a minimum.

For example you can study if there are any broad correlations between historical performance across very long testing periods. I have published several posts and articles about such exercises and it is clear that some broad correlations do start to show up. For example systems are are highly stable (R² > 0.95) across long periods of time (10+ years) in back-testing show a very high likelihood to perform positively across equally long subsequent periods of time. It is also clear that there is a minimum trading frequency that is necessary for this to be meaningful, at trading frequencies below around 10 trades per year the above mentioned relationship tends to break down, most probably because the determined correlation coefficient is not statistically solid enough (determined over a small number of data points). Other statistics do not seem to have a lot of importance – things like the maximum drawdown length, Sharpe ratio, winning ratio, risk to reward ratio, etc – as the high R² measurement already seems to encompass a big part of what constitutes system quality.


The above relationships become much smaller and tend to lose all predictability across shorter time spans. The R² across a single year or two has almost no bearing in future profitability, even if the amount of trades during the design period is very large. This is the result of curve fitting as market conditions can fluctuate very strongly from one year to another. Future survival ability of systems tends to increase as a function of design-period length, showing that the more data is used to create a system the more the systems can capture general market behavior instead of peculiarities that might be present only under some types of market conditions. You can read some of my recent blog posts about market conditions where it becomes clear why using longer term data is so important. Another very useful post is this one, where you can see clear evidence of why past data – even very old data – is not irrelevant.

There are however strong limitations to pseudo out-of-sample analysis. It might be tempting to attempting to derive complex relationships between in-sample and pseudo out-of-sample success which leads to complex selection criteria that can manifest themselves as complex system selection filters or as much more complex mechanisms, such as walk forward analysis methods (WFA). Using WFA a trader can create systems that frequently optimize and succeed under most if not all pseudo out-of-sample market conditions – with an apparent marvelous ability – but this tends to break down under real out-of-sample market conditions. This is a consequence of the strong bias that is suffered when using the pseudo out of sample results to create complex system selection mechanisms. There is up until now no concrete evidence I have seen that shows that WFA generated systems have any larger chance of success under real out of sample market conditions.


There is not a lot of published research on the matter. A recent paper by the people at Quantopian attempted to evaluate if they could predict 6+ month real out of sample success from trading strategies and they found out that this was extremely hard to do. Not only are the correlations they found between in sample variables and real out of sample statistics weak but their only limited success was obtained using machine learning which is definitely bound to fail under a future out of sample period prediction due to the inherent curve fitting characteristics of these methods. I would have been surprised if they have had any different results since out-of-sample success under a 6 month period is generally extremely uncertain due to the potential distribution of Sharpe ratios from individual systems. A Monte Carlo simulation can easily reveal how a system could have a -5 to 5 Sharpe ratio range under such a short period, even if it did prove to be successful in the longer term. Another problem is that most systems at Quantopian are designed with small amounts of in-sample data, as per my research I would expect most of their systems to behave randomly in the out-of-sample simply due to curve-fitting bias.

As you can see out-of-sample predictions are no easy task. Not only are we limited in what we can do with pseudo out-of-sample data but we also need to wait for significantly long periods to have chances to confirm real out-of-sample predictions. However a broad scope analysis using large amounts of data can reveal some extremely useful general guides: use as much data as you can to build the most stable systems you can. Of course other things also need to go into consideration, such as data-mining bias. If you would like to learn more about trading system design and how you too can create your own trading strategies please consider joining, a website filled with educational videos, trading systems, development and a sound, honest and transparent approach towards automated trading.strategies


Print Friendly, PDF & Email
You can leave a response, or trackback from your own site.

Leave a Reply

WordPress › Error

The site is experiencing technical difficulties.