Machine Learning in Forex: Data quality, broker dependency and trading systems

Although my efforts in building machine learning systems for the Forex market were initially centered on building systems for the daily time frame using non-linear regression methods (mainly neural networks) I have been moving towards the building of machine learning systems on the lower time frames using a more varied arsenal of algorithms during the past year. This is both because building of historically profitable systems on the daily time frame for instruments besides the EUR/USD was never achieved successfully and because I wanted to explore other classification and regression techniques that might me much more computationally cheap. Since the building of systems that constantly retrain on the lower time frames generally demands a much larger number of bars, using methods that are computationally cheaper makes more sense. However during the past few months I have been hitting a road-block in the building of these systems, mainly due to issues related with the broker dependency. Through the rest of this post I will explain to you what my problems have been and how I have attempted to tackle them in order to generate robust machine learning methodologies.


A few months ago I was eager to write about the building of historically profitable systems trading on the lower timeframes across several Forex symbols using constantly retrained machine learning techniques. As a matter of fact I later refined this methodology enough such that I am now able to generate historically profitable results on all Forex majors during the past 25 years of data, using simple ensembles of classifiers on the 1H timeframe (systems with high linearity and very decent profit to drawdown characteristics). The systems retrain their models on every new hourly candle and make use of simple trade management mechanisms (such as trailing stops) to further enhance their profitability. When I had generated a portfolio of systems to trade across all FX majors and I was ready to move them to live trading I decided to run a final test which sought to  evaluate feed dependency, so I used data from a separate broker (different from my 25 year data source), to see the results.

I was quite shocked to see that results were not only completely different but profitability was obliterated. In the graph above you can see the 2002-2012 period evaluated on both datasets (results were analyzed in R after the back-tests were finished with the F4 framework). The red set is where the system was created and the black set is a data-set for the exact same Forex symbol, coming from a completely different source. The correlation between the monthly return of these systems is actually only 0.3, meaning that in practice the classifier used here behaves like two completely different systems across both datasets. When building systems on the daily time frame I never actually faced this issue, because feed differences across daily time frames are not large enough to affect the performance of machine learning systems, while in the lower time frames the differences are magnified (represent a much larger percentage of each bar) so the results are actually very different for machine learning systems trained across two different broker feeds.


Looking at the difference between both feeds we can actually see two important things. The first is that the difference is most prominent the further we go back in the past and the second is that most differences are actually small (average difference is 1.6 pips). We can also see that the differences in trading system performance are the biggest the further we go back and become much less prominent during the 2008-2012 period.  This means that the problem is much less important on recent data and becomes much heavier as the uncertainty surrounding the “fine grain” of the data becomes larger (further disparity between the data sets). Looking at the 1000 moving average of the data feed differences also reveals that differences have been steadily declining as trading has evolved despite the fact that 2008-2009 had a very large volatility peak. If the values were adjusted to volatility the difference would appear even more dramatic. However since broker feed differences are probably affected mainly by liquidity (not simply by market movement ranges) we would simply expect a decline in differences between feeds as the market becomes more liquid (as differences between liquidity providers should smooth).

It is also worth noting that peaks in the feed differences do not seem to affect performance, since these machine learning systems use data from 300-400 learning examples that may very well use the last 10,000 to 15,000 hourly bars. This means that broker differences are only relevant if they are high enough across the whole training sample set (enough to cause wide differences) but are not so important if they only affect a few of these examples. From these graphs we can conclude that the machine learning system – along several others I studied with very similar results – are mainly affected when the average difference for the past 1000 bars is above the 2 pip threshold, as the average has descended below this point, it has become tolerable enough for the systems to give adequate results. So how do we solve this problem? How do we generate results for lower time frame machine learning strategies that work across feeds that were so different historically?


There are mainly two answers to this question. The first would be to simply ignore previous data and develop systems from 2008 to 2014, since previous data can be considered unreliable regarding the degree of matching needed for machine learning strategies, it would be a good idea to simply consider this data of low quality for machine learning and move on. Surely you may come-up with strategies that have “less powerful” generalizations than if you used 25 years of data but since the data you use is of a better quality (subject to less variability) the models may come up with more useful trading methods.

The second option, is to attempt to build systems that are robust to the perturbations in the past which can – despite of this fact – come up with historically profitable machine learning methods. There are many method that can be used to achieve this purpose but perhaps the most commonly used is to add noise to the data such that any conclusions on either data set will become the same, as the machine learning method would only be left with enough information to see the “very global picture”. If you have an average candle difference of X you can distort all candles by a random quantity 2*X and then you can make a trading decision based on the output of an array of predictors. If you train 200 models with randomly distorted samples and the conclusion is that all of them say that getting into a long trade is the best decision then the answer is probably going to be the same on a different data set where the random distortion is of the same magnitude, if noise doesn’t change the conclusion then the conclusion must be “real” (there must be some underlying quality in the data more important than the added noise). This makes sure that we do not simply find patterns in the inherent noise of the time series but actually find something relevant.

As you can see the problem is quite complex and it will take me a few more blog posts to fully share with you some of my results in the above area. What would you do? Would you trade a machine learning model using data from only the past few years or would you attempt to build a model more robust to broker differences? Let me know in the comments below :o)

If you would like to learn more about machine learning models and how you too can create strategies that retrain daily using our FX trading framework please consider joining, a website filled with educational videos, trading systems, development and a sound, honest and transparent approach towards automated trading in general . I hope you enjoyed this article ! :o)

Print Friendly
You can 4 Comments, or trackback from your own site.
Subscribe to RSS Feed Follow me on Twitter!