From the different types of statistical bias present within the algorithmic trading strategy creation process the most difficult to understand – at least it was for me – is data mining bias (which I may call DMB from now on). This is because data mining bias does not depend on the generated system but it depends on the process that you go through to generate that system. A type of bias that is related with the procedure rather than the output is particularly hard to evaluate and making wrong assumptions about data mining bias can seriously hinder any measurements that you make trying to tell whether your process has a small or large amount of DMB. On today’s post I am going to talk about what data mining bias really is, why it is not a single system property and how different methods leading to the same system can have very different levels of DMB. In this manner we’ll learn what DMB can tell us and what it cannot.

–

–

I often say that you should generate strategies that have low DMB. This sentence is particularly misleading, as it treats data mining bias as if it was an intrinsic property of a strategy, such as its average yearly profit or its Sharpe ratio. However a strategy has no “data mining bias” value but instead a strategy is the output of a process that has a certain DMB value. The data mining bias is therefore a statistical bias type related with the overall system creation process rather than its end product. But what is this data mining bias? How can the exact same system be generated with different levels of mining bias? I will now give some examples that I hope will clearly explain these points.

Imagine that you have never seen the markets before, you sit in front of a computer to code a trading strategy and the first strategy you randomly coded generated an enormous historical profit. Doing the same process – generating a random strategy and testing its output – shows you that the probability to obtain this same result by doing this same process on random data (preserving the statistical return distribution characteristics of your original data) is 0.00001%. You have a strategy that – through the process you have carried out – generated a return that is extremely unlikely to have come from random chance. The probability that the inefficiency you have found is not real but simply the result of your mining process is extremely low, as your mining process is extremely simple (you only generated one single strategy on the real data!).

Now you have a friend that likes to get down and serious with trading and he has coded software that can test 100 billion different strategies within the same market that you tested. He also finds your strategy within his mining exercise but he tells you that the probability to find the strategy out of random chance from his process is actually 20%. Since the strategy has such a high probability of not addressing a real historical inefficiency according to his mining process, your friend decides not to trade the system, while you decide that your process shows a very low DMB, so you trade the strategy. So what is the difference here? Why do we have two different decisions for the same system? It’s the same system!

–

–

Let us consider the two possibilities here: either the strategy tackles a real historical inefficiency or it does not. The first process that generates the strategy tells us that it almost certainly does not but the sampling size of this first process is extremely small (only one system) the probability to get this lucky on just one system selection is incredibly small, which is what we see reflected in the DMB for this case. You have generated such a small sample that the DMB is telling you: “it is very difficult to be this lucky”. The second process however, which tackles a much larger sample size, is telling us that when you look hard enough, you don’t have to be that lucky to find a strategy like this. The traders are therefore taking decisions focused on their probabilities to have been lucky.

What is the right choice? You can mathematically demonstrate (on my next post!) that for a given logic space containing a selected suitable system candidate the DMB will only be equal or lower as you focus more and more on that region, ignoring surrounding spaces. This has the consequence of giving us an erroneous sense of DMB when we evaluate smaller logic spaces, even if just by chance, when looking at very small generation exercises. On the contrary, expanding the logic space gives us a much more holistic vision of how our system fits within the grand scheme of things for a given financial time series. If a DMB exercise on a large space is saying that the probability to find that system out of chance is 90% and an exercise on a smaller space is telling you that it’s 1% then in all likelihood the person with the lower DMB is within that 1% fault, that person was just lucky in their generation exercise (focused on a small region with highly suitable systems in the logic space without knowing). The larger mining exercises that measure for DMB and find low DMB strategies are much more likely correct about their statistics (think sample size!) than those who find something through very focused exercises.

There is a wealth of interesting mathematics that you can explore surrounding DMB (which I will get into soon) that give us a lot of hints onto how we should engineer our mining exercises to generate mining procedures that are more reliable and give us more trustworthy results. If you want to learn more about data mining bias and how we evaluate for it please consider joining Asirikuy.com, a website filled with educational videos, trading systems, development and a sound, honest and transparent approach towards automated trading.