Our Reinforcement Learning mining and repository: Now live trading!

Machine learning has been a great passion for me during the past several years. During last year and most of this year I have been committed to the improvement and creation of an ML system repository based on classic supervised learning techniques and during the past several months I have been focused on bringing another machine learning vision – based on reinforcement learning – to life. After a lot of hard work implementing OpenCL based mining software – which can mine RL strategies using GPU technology – and also implementing the entire F4 framework trading and cloud mining server-side infrastructure today I am happy to announce the start of RL live trading using the first 91 systems that have been added to our repository as the result of our first low data-mining bias experiment. In this article I will talk a bit about these advances and some of the differences that RL has had with some of our other trading approaches.

Our reinforcement learning mining experiments proceed just like our price-action and machine-learning experiments have, with some small differences. The core of the process remains the same, we generate trading strategies using real data and then attempt the exact same search procedure using random data in order to discard any process where the generation of a profitable system in random data is more than 1/100 as probable as the same generation in real data. What this means is simply that we only care for systems that have a less than 1% chance of being created out of the simple strength of the data-mining process. In the RL case the creation process is however more complex since it involves training the reinforcement learning algorithm with 60% of the data – which involves 10 back-tests for each system – then testing within the remaining 40% and ensuring that the initial 60% remains coherent with the 40% used for testing (little deterioration in the pseudo out-of-sample). This exact same process is applied to real and random series. Note that we carry out this p-OS split in the case of RL because RL does not “lose information” due to having a p-OS period. This happens because it also trains through this period, although with no hindsight (only trains once as it passes over it with no ability to see into the future, just as it trains when live trading).

To many the above and rather complicated process might seem unnecessary. If you have a pseudo out-of-sample that is already 40% of the data, then isn’t this enough “guarantee” that you are not falling into an excessive curve-fitting or data-mining bias trap? The answer is that the multiple testing process – the fact that you’re searching multiple times for a pseudo out-of-sample that works – makes it necessary to ensure that you’re not just finding a pseudo out-of-sample that works just out of random chance. As a matter of fact the RL mining has showed to be extremely good at finding systems – yes, systems where even the testing phase looks great – where there is also a large propensity of finding the exact same “great systems” in random series. This shows that the strength of the mining process is huge, the RL process is very good at fitting and the chance that you also perform well in testing phases just out of random chance can also be very significant. The second image in this post shows you an experiment where RL finds a lot more systems across random series (orange) than it did in the real data (yellow).

Up until now we have only found a single case where the RL has been able to find great systems in real data but such systems have been very scarce (in fact non-existent) in random data series. This was a EURUSD experiment that was able to generate 91 uncorrelated strategies for this pair. The system showed in the first image belongs to this group although for this back-test I used a testing period of only 2010-2016 (although the system was generated using a 60/40 split as described above). As you can see there is some deterioration of the Sharpe within the testing period – the maximum drawdown happens within the testing phase – but overall at least 40% of the profit happens within the 40% testing period and overall system characteristics do remain similar. A very important thing is that linearity does not deteriorate significantly, meaning that the system does not show significant signs of alpha decay within this period, showing that the system is indeed able to adjust as it does its online trading.

These systems are now being live traded within an Oanda live account using the Asirikuy Trader. Another advantage of RL systems is that they execute very fast, given that they use in-memory arrays that are very efficiently accessed and the operations carried out on each bar are extremely simple. The 91 strategies execute in a bit less than 0.4 seconds in the Asirikuy trader, also thanks to some modifications I made during the past two weeks to greatly increase the data usage efficiency of the program (preventing unnecessary data requests and taking advantage of the fact that multiple systems might use the same symbol data). We will probably be able to execute hundreds of RL systems in the Asirikuy Trader before we run into problems. Since these RL systems use no SL or TP values they also have the advantage of being more resistant to execution issues since they are not searching for some predetermined price based exits but simply enter/exit trades within the start/end of daily bars (current systems trade on the daily timeframe).

Our reinforcement learning mining, trading system repository and live trading account are the start of a new journey in our understanding of machine learning, curve-fitting and data-mining bias. In a few months we’ll know how well reinforcement learning systems can respond to changing market conditions, how well do they learn when live trading and how easy or hard it is to find RL system generation processes with low data-mining bias. If you would like to learn more about RL and how you too can actually live trade systems using this type of trading please consider joining Asirikuy.com, a website filled with educational videos, trading systems, development and a sound, honest and transparent approach towards automated trading.strategies.

Print Friendly, PDF & Email
You can leave a response, or trackback from your own site.

11 Responses to “Our Reinforcement Learning mining and repository: Now live trading!”

  1. Chris Rebisz says:

    Hi Daniel,

    Congratulations on this breakthrough! Do you foresee these systems to be able to trade on Quri Quant with Oanda in the near future? Thank you.

    Chris

    • admin says:

      Hi Chris,

      Thanks for writing. Sure, in the future we will implement these systems in QuriQuant if the systems do perform well. Let me know if you have other questions,

      Best Regards,

      Daniel

  2. mac says:

    Hi Daniel,

    I am trying to replicate some of your findings. Could you tell me what is the size of the randomized data-set? Obviously you can create infinite number of such data-sets but when you do the compare do you use the same length of as for the training set?

    I’m finding your blog fascinating.

    Thank you!

    mac

    • admin says:

      Hi Mac,

      Thanks for writing :o). I am happy to hear you’re enjoying my blog! The random data sets have the exact same size and statistical distribution of returns as the real data set. The exact same split training/testing is done on the real and random data sets. The simulations on random data show that you can indeed find systems that pass a pseudo out-of-sample test with a 60/40 train/test split. This indeed demonstrates that using pseudo OS splits does not account for data-mining bias and that tests using random data are always needed to account for this bias source. Feel free to post if you have other questions,

      Best Regards,

      Daniel

      • mac says:

        Hi Daniel,

        Thanks for the answer. It all seems clearer now.

        I am currently performing some simulations using a similar method.
        This is my understanding of your process:

        You train your algorithm 10 times on 60% of data set (10 times to get it to converge) and then confirm that the performance does not deteriorate much on remaining 40% (only one pass and algorithm learns as well but without a hindsight). Your split is somehow arbitrary (is it that you use 1988 – 2005 data for training and 2006 – 2016 for test?).
        Once you are happy with the result you perform the same test using random data (my question: do you perform this test 100 times and get some average of the performance for comparison?).
        Only if the random test comes with much worse result you agree that you found a good set of features (as defined in csv file) and save it for live testing.
        Then repeat the whole process with new set of input features.
        (Question: do you go combination after combination of features or generate them in some random way?).

        Now what happens with the live test? Do you retrain your picked systems again this time using the whole of the historical data set (1988 – 2016)?
        And once the system is live it learns online (I understand) but it only learns once per bar (not like 10 times during training). Do you take each system down, say, every week to retrain it using all historical data (with the last week’s one)? Would the system deteriorate while live trading for a long time due to too week learning rate?

        Sorry to wreck your head. I am repeating most of your finding on H1 bars using 5 years of data (all I have) and looks promising.

        Thank you!

        mac

        • admin says:

          Hi Mac,

          Thanks for writing. I reserve more detailed discussion of the methodologies for our trading community forum. You can join here (https://asirikuy.com/newsite/?page_id=3742) if you wish. You’ll also get access to our historical data, programming framework and all the RL system code we use. Thanks again for writing and reading my blog,

          Best Regards,

          Daniel

        • keras says:

          Hi Daniel,

          Can you clarify how the RL is run? Assuming you are using a DQN or something similar, you run 10 epochs on entire training set, with terminal state being end of data? How do you guarantee the value and policy converge by epoch 10?

          • admin says:

            Hi Keras,

            Thanks for writing. We don’t guarantee convergence after 10 epochs, this is just a value we selected given that we have always seen convergence at 5-10 epochs for the Q-tables we have tested. However it may be the case that this is not the best fit for all cases. Thanks again for writing,

            Best Regards,

            Daniel

  3. leo says:

    Hi Daniel,

    How the live trading progress?

    • admin says:

      Hi Leo,

      Thanks for writing. We faced a few problems in the beginning due to the volatility of the RL instances plus some issues dealing with my ISP and NTP servers so we’ve had to restart our live testing. Ask again in a few months!

      Best Regards,

      Daniel

Leave a Reply

Subscribe to RSS Feed Follow me on Twitter!
Show Buttons
Hide Buttons