On yesterday’s post we discussed some of the main characteristics of data-mining bias as well as the parameters that we will be using to calculate an example of this value using Kantu — an exercise you can repeat at home to compare with me. If you haven’t done so already I would recommend you read this post before you continue. On today’s post we are going to look at the results from this analysis and see what conclusions we can draw from them, we will see what system statistics Kantu was able to obtain on random data – for the data lengths and degrees of freedom chosen – and we will compare them with the statistics obtained for systems obtained within real data. In the end we will be able to obtain a measure of data-mining bias and this will allow us to have some minimal criteria for the selection and discrimination of trading strategies within the data-mining process.
The overall analysis took a bit more than 2 days to complete, it went through 120 million systems on the random data (which should cover a very large portion of the logic space, if not all of it) and generated some strategies. It is worth noting that I introduced some filters within this analysis as I was clearly not interested in looking at the results of strategies that were below a certain degree of profitability. In order to obtain a good amount of systems but still have less than tens of thousands I decided to filter systems with less than 10 trades/year, a 1% Average Annualized Return and a 0.8 linear regression correlation coefficient (R squared). Systems below these marks are clearly very bad systems that are barely profitable or very unstable through time, reason why I wanted to ignore them. Through the whole analysis I collected about 1500 system results, which we are going to use in order to determine our data mining bias.
The strategies obtained through this process (sorted by absolute profit/year to maximum drawdown as percentage of initial capital ratio a.k.a Profit.Max.DD) are clearly limited to values of the Profit.Max.DD of less than 0.27. Even after a very extensive amount of data mining the Kantu system generator was unable to find any system that went above this possibility. The absolute profit is also limited to about 50,000 USD which in Kantu (which uses non-compounding simulations), means that the systems would achieve an Annualized Average Return (AAR) of about 2-2.5%. The Max.DD.Length statistic also shows a significant limit near the 950 day mark as no system with a drawdown length below 950 days was produced. Linearity however is not a good criteria for the distinguishing of spurious correlations on its own as highly linear systems (R^2 > 0.98) were produced as well. It is also worth noting that this analysis was repeated across 4 random data series as well as 2 additional random data series were returns were produced using the mean of the EUR/USD return distribution (instead of 0) which yielded very similar results (statistics limited to the same values). You can use the R-scripts below to obtain the same graphs using the csv saved from the Kantu results grid (right click, save results as csv).
dataset <- read.csv("E:/PathToResults/RESUTLS.csv")
After this analysis it is clear that the Max.DD.Length and Profit.Max.DD statistics give us the best opportunity to clearly distinguish a system that is produced due to some real price inefficiency rather than some spurious correlation. If we look for systems above the 0.27 Profit.Max.DD ratio and a Max.DD.Length below 950 days – on 25 year back-tests – we are bound to find strategies that are in most likelihood the result of some real effect within our financial time series. To be sure we should go well above these values, for example to 0.4 and 750 days. Note that this doesn’t mean that these strategies will be generating profit in the future – as the future is unknown – but simply that these systems are the result of some real historical difference between the real financial time series and random data. Whether this difference is permanent or will change cannot be known in advance (or we would all be rich :o)). What is known here is simply that this inefficiency constitutes a historically valid effect that could not be found on random data with the given degrees of freedom. Remember that the data-mining bias depends on your degrees of freedom in the data-mining process and therefore a new analysis needs to be carried out if this changes.
The next step is to see whether we can generate systems on the EUR/USD that exceed these thresholds. I generated 600 systems using the same thresholds as on the random data, this was actually done within a few minutes while on the random data it took days. The systems generated also have very favorable statistics. From these systems 3% exceeded the Profit.Max.DD of 0.4, while 4.5% had Max.DD.Length below the 750 day mark. Knowing this we can further analyse these systems and discard those that have low linear regression coefficients, to stay only with those that show high historical stability. This shows that the EUR/USD can produce systems that are far more profitable and have less drawdown than systems produced in a financial time series with the same mean and standard deviation. You can see on the two images that there are some obvious difference between systems created on the EUR/USD and systems created on random data. For example the maximum profitability for EUR/USD systems also exceeds a 15% AAR, about 7 times higher than for the random data.
When generating systems with these degrees of freedom I now know that I can put up the Profit.Max.DD and Max.DD.Length filters in advance – as my data-mining bias – and obtain systems that are well beyond what is expected for random financial data. The next question to ask is whether this bias is different or the same as what can be obtained using random variables on the real financial time series. Can I still obtain systems that have above-data-mining bias results using randomly generated variables for testing? Within the next post on this series we are going to look at random variable generation on real financial time series to see if we can still find a spurious correlation that produces results above the data-mining bias.
If you would like to learn more about the data-mining process and how you too can generate your own systems using the Kantu system generator please consider joining Asirikuy.com, a website filled with educational videos, trading systems, development and a sound, honest and transparent approach towards automated trading in general . I hope you enjoyed this article ! :o)