Why do historically top performing trading systems almost always under perform?

It is no mystery that top performers in historical testing are often never ranked in the same manner across new market conditions. There is always a tendency for the “best systems” in the past to under perform – to give statistical results that are inferior to those that we expect from their historical track record – reason why it seems to be extremely difficult, almost impossible, to be able to select the best possible system without hindsight. Today I am going to talk about why this happens and why it will always happen, what the statistical reasons behind this phenomena are, and what we can do to attempt to alleviate this problem in some manner.

If you have ever performed a system optimization or have used any system mining software and have then live traded the “top performer” it is no mystery that no matter how you choose the parameters or how you mine the strategy the “best system” will almost always look worse under live trading conditions. When looking back the best possible performer in the future seems to have performed rather average in the past while the best systems in the past rank quite differently across the new data. Is it that somehow the strategy deteriorated? Is there a better way to pick a candidate without hindsight to somehow ensure better performance?

The answer to the above questions can be obtained if we look at the statistical heart of the matter. Imagine that you’re giving a list of multiple choice test questions that measure the aptitude in a given mental skill to a group of students. The results of the test are normally distributed. Since you want to select those that are most skilled you take the top 10% of the class. Have you done a good job? When you look at the results of your students performing the mental skill you then see that they disappoint. You perform the test again and the top 10% of students are now different. What happened here? Did your first test select the wrong students?

What we’re observing here is a classical statistical problem. When selecting the “best performing” students you were selecting a group that was in some ways potentially favored by random chance and when they were exposed to a new evaluation of their skill they tended to return to the mean of the distribution. This is why top performers always disappoint. When performing selections from a population the top selections will always tend to disappoint because these selections always contain some favoring by chance that is difficult to distinguish from their skill. Separating exceptional skill from exceptional luck becomes an important issue that is significantly difficult to assess.

In trading system creation we have a similar and yet more complex problem. We cannot just “pick the average” because the universe of all potential trading logic sets is heavily losing – what we would expect from trading randomly – but we must ensure that what we pick has a high probability of being the result of skill and not of luck. This is where system mining exercises on data generated using bootstrapping with replacement from the real data can play a role since they can let us know the probability that we would find a similar level of skill out of simple luck out of our system mining process. This allows us to say that the results we have are to some degree the result of skill.

The problem is of course that luck is – to some degree – always a factor. Even if we determine that a trading strategy has a probability below 1% to be the result of simple luck from the mining process, the trading strategies we obtained can still get some of their profitability from luck. Even if the probability that the result was entirely out of luck is very low, there can still be some luck involved in obtaining the result we see. Trading system results are always the result of skill plus luck, if you select the best of the best you are maximizing both. But how can we know the degree of luck that a system can have? How can we ensure that we will not get systems that will significantly under perform?

Our system mining case is alike measuring the degree of skill for an ability within a student population where a large majority of students have absolutely no skills and a small population have a high degree of skill. When you perform the test for the ability that interests you you see that a significant number of students appear to have a significant level of skill but then when you perform a random test – that measures no skills – you see that the number of high ranking people is significantly lower. This definitely means that you have some people with real skills for the ability that interest you in the population but when it comes to selecting the students who are in fact skilled you will succeed in selecting those that were skilled and lucky, plus a very few that were only lucky, but you will not be able to determine what the expected performers of those that are skilled really is — because the distributions are not separated.

To do this you need to put the selection to the test again, which allows you to draw a new distribution that is now the distribution of those that are skilled with the luck reshuffled. The mean of that distribution is the realistically expected performance for the level of skill that you can attain from selecting a skilled populations from the general population. Of course it is clear that if you attempted to select the top after the second test you will only be selecting luck from the new draw a second time, reason why at this point you should expect the mean performance – rather than the extreme – to get a realistic view of what the level of skill you have extracted can actually be. After separating skill from a general population you need to avoid making selections based on top performers. Top performers will always disappoint because luck is just random.

Of course trading is even more complicated because you have an additional possibility due to the fact that the market is ever-changing. It is not like trying to determine the average skill of chess players whose performance has a less than 1% chance of showing up in the general population by chance but more like trying to assess the level of skill of players in a game where the rules can change with time as well. We will get deeper into this in a future post. If you would like to learn more about trading system design and how you too can trade using systems created with proper consideration for luck and skill  please consider joining Asirikuy.com, a website filled with educational videos, trading systems, development and a sound, honest and transparent approach towards automated trading.strategies