It is only up until recently that I have become seriously concerned with the intensive performance optimization of our F4 programming framework for the backtesting of trading strategies. This has become a necessity since we’re presently doing very intensive mining of machine learning strategies and even small gains in performance can translate into huge savings in time as the number of contributors to our cloud mining experiments continues to increase. Another reason why I have now become quite involved in this process is due to my recent acquisition of a Xenon Phi 60 core card (a post about this soon!) which I plan to use to also speed the backtesting of our trading strategies via OpenMP/MPI. On today’s post I am going to talk about some of my findings with different compiler setups and back-testing speeds in the F4 framework and will comment on some of the surprises and conclusions from these experiments.
The first thing I had to do to test the speed of the F4 framework was to device a given benchmark that would be representative of the back-tests that I am more interested in running in a fast manner. I came up with the 28 year backtest of a 1H timeframe, RSI Neural Network based trading system that is quite data intensive and a good representation of the tests I wish to enhance. I decided to optimize this on my i7 4770K Linux 64-bit system (running Linux Mint) since this is the setup that is generally used for machine learning system mining in the community and the setup that I will be using with the Xeon Phi card. We generally generate makefiles for the F4 framework using premake4 and we compile the setup using the Codelite GUI and the gcc compiler so it seemed quite straight forward to test different compilers and compiler optimization routines and see what could get me the fastest benchmark. I decided to test gcc and icc as compilers and to test a variety of recommended optimization options for both compilers.
The gcc compiler comes with 5 different levels of basic optimization that you can use. You have -O0, -Os, -O1, -O2, -O3 and -Ofast. The O0 level means no optimization, the Os optimizes for binary size while the other numbers are progressively more aggressive optimizations aimed to speed up your code for your particular building platform. The last option, Ofast, attempts to achieve even larger speeds ups by risking inaccuracies in the calculation of float operations. For this reason I consider the Ofast unsafe and I did not use it within my benchmarking process (my tests with Ofast did reveal important losses in accuracy through the testing process). There are some other available options, particularly the -march=native option that activates all optimization options for your host processor and the -flto and -funroll-loop options which enable link time optimizations and loop unrolling respectively. The icc optimizer has similar optimization options although the -march=native option in gcc is replaced by the -xHost option and the link time optimizations are done with the use of the -ipo flag instead of -flto. There is also the -no-prec-div keyword in icc to increase speed at the cost of possible losses in float operations but in this case there was no degradation in accuracy of my results so I did consider this flag for the bench-marking process.
For the discussion of these results I will take the gcc -Os as a basis for my results since this was the optimization flag that we had been using up until now to build the F4 framework on Linux 64-bit. As you can see on the graph below (best times from 10 runs for each setup) the gcc -Os option was close to the worst we could do, choosing this optimization level was quite naive and although it did generate smaller file sizes this came at a price of significantly longer execution times. The gcc -Os takes 192 seconds to run the back-test while the -O2 optimization already drops the time to 97 seconds which is a drop of almost exactly 50%. Other options such as -Ofast, -O3, etc achieve rather similar performance with values being almost equal with differences not being statistically significant.
Surprisingly the intel compiler (icc) from Parallel Studio XE 2015 gave rather worse results with the icc -O2 being even worse than the gcc -Os option. Processor based optimizations did not seem to make any difference in this case while the -xHost, -no-prec-div and -ipo options did manage to improve performance to around 130 seconds, still much slower than the simple gcc -O2 setup which achieves 97 second speeds without any risk of dropping accuracy in floating point operations. Overall this disappointed me since I had expected the icc compiler to be more efficient than the gcc in terms of binary execution speeds (people tend to have this experience online, although there are a few cases where gcc does beat icc they seem to be rather uncommon). Still the icc options in setups 8 and 9 do improve execution speeds 30% over gcc -Os, meaning that using this would still be a better option that what we have been using up until now. Since using the Xeon Phi cards will require use of the icc compiler I am happy that at least I could achieve better results than the gcc -Os setup.
As you can see from the above results using compiler optimizations can indeed increase your backtesting performance significantly. By using the gcc -O2 flag I have been able to reduce execution times by 50% which will greatly increase our ability to create machine learning strategies since this already doubles our throughput at the same computational effort. I will keep on reading on this subject and perform additional benchmarks with the hope to find even better setups for the F4 framework. If you have any ideas why we might have had worse results with icc than with gcc or what other compiler options might benefit our framework please feel free to leave a comment.
If you would like to learn more about our cloud mining machine learning efforts to build machine learning systems please consider joining Asirikuy.com, a website filled with educational videos, trading systems, development and a sound, honest and transparent approach towards automated trading.