Yesterday I spoke about some of my initial developments in reinforcement learning using Q-learning in Forex trading and how important it is to control curve-fitting in Q-learning to ensure you don’t get algorithms that fit to the noise within the data. In today’s post I want to take a deeper look into this using the model I have developed for the daily chart which uses binary information about bearish/bullish candles to define states and then performs a Q-learning exercise based on this information. With this example we’ll be able to see the effect of changing the amount of degrees of freedom within reinforcement learning and how this affects our ability to curve-fit the system to the data. We will see how this affects the results of doing pseudo out-of-sample tests and how Q-learning gives us a rather unique ability to control how well our system is able to learn from the data at hand.

–

–

In my initial Q-learning example we have a system that learns how to trade using information from past bars and we can control how much information we provide and therefore how well the system can distinguish between two different market states. Imagine that we trade a system that only sees the past two bars and whether they were both bearish, both bullish, one bearish one bullish or one bullish one bearish; our system would be able to see only 4 different market states and would have to learn from thousands of bars how to trade within those four different circumstances. Of course our system is not bound to do well unless there is an extremely obvious relationship between the direction of the past two bars and the direction of a subsequent bar, since this is bound not to be the case – as such an inefficiency would simply be too obvious – we expect our Q-learning to not be able to learn anything of substance.

The number of possible states that our reinforcement learning algorithm can distinguish evolves as 2^N where N is the number of bars in the past we’re looking at to define each state. In the above example we have 2 bars so 2²=4, if we use 3 then it’s 2³=8 and so on. Rather quickly we will reach a point where we can define hundreds of states as for example 2⁹=512. When you add more and more degrees of freedom in this manner you’re giving the algorithm the chance to distinguish two market states because of something that might not be relevant (noise) but this difference is enough to obtain a profit as the algorithm is able to derive long/short/notrade decisions from such a setup. This is where your curve-fitting starts to go wrong, you’re not learning the fundamentals of the game but you’re learning something irrelevant for the future. However adding more degrees of freedom also means that you’re able to make better distinctions in testing, which means higher historical profits.

–

–

The first image in this post shows you how historical profits evolve as we use Q-learning algorithms with more bars and therefore more possible states. As you can see under 7 bars (128 possible states) we don’t make a historical profit because we simply don’t have enough freedom to learn enough from the data while above this point we see a sharp increase in the historical profit as we add more bars as we start to enhance the algorithm’s ability to learn from the historical data very significantly. Since the growth in the number of states is exponential, so is the historical profit that can be achieved from them. If you only knew historical data it would be clear that we would want the best possible fitting – largest number of bars – but given that we want to test the Q-learning algorithm’s ability to really generalize we can carry out a pseudo-out-of-sample test splitting the data from 1986-2010 for training the algorithm (run several backtests using this data until learning converges) and then use the 2010-2016 period to run a single backtest and see how the algorithm fares.

The second graph in the post reveals what happens when you do this for 7, 8 and 9 bars. It is evident that the results in the pseudo out-of-sample testing space are rather similar for all of them (see the underwater equity plots). This is because the Q-learning algorithm does learn when it trades through this period, only that it cannot learn several times but only from the information it gets as it trades. The 9 bar algorithm does perform better than the rest in some senses – for example the volatility is lower – but it is the algorithm that disappoints the most relative to the results within the training period. If you had seen only 1986-2010 data and tried the 7-bar algorithm you wouldn’t have been disappointed by the results in the 2010-2016 period – they appear normal – but when you go to higher complexities the historical results become much better than what we are indeed getting in the pseudo out-of-sample. A classic case of over-fitting.

–

–

It is also important to point out that trying to eliminate this problem using simply pseudo out-of-sample testing splits is somewhat of an exercise in futility due to the danger of simply trying different state generation schemes until you find something that seems to work within your pseudo out-of-sample but wouldn’t work in a real out-of-sample — the classic multiple testing problem. There is a more elegant solution to this in Q-learning which is related to the ability of an algorithm to learn from random data, something we will discuss in my next post. If you would like to learn more about machine learning and how you too can design your own constantly retraining strategies please consider joining Asirikuy.com, a website filled with educational videos, trading systems, development and a sound, honest and transparent approach towards automated trading.strategies.