Small computers: Getting pyOpenCL to run on the ODROID-XU4

During the past month I have been very busy with the development of our pKantuML software to perform machine learning mining using a CPU/GPU combo. The idea is to allow us to very quickly find and validate machine learning strategies for Forex trading, allowing us to create a fast evolving repository of such strategies for our trading. Since several Asirikuy members – myself included – purchased ODROID-XU4 cards to test cloud mining it was an obvious step to attempt to make pKantuML work on this cheap but very powerful credit card sized computer. On today’s post I want to talk about the issues I had setting up OpenCL to work on this card in the hopes that others who are attempting to do something similar will find my insights useful and will save some of the sweat and tears I had to spend.

The pKantuML software uses OpenCL to perform calculations on the a CPU or GPU and to do this it makes use of the pyOpenCL library. Getting this library to work requires the presence of a proper OpenCL SDK which is not very easy to setup within an ARM device. For this setup I decided to use the Ubuntu 14.04 LTS image for the ODROID-XU4 which you can find here. After setting up the image it was a matter of installing the needed libraries and loading an appropriate SDK to use, then installing the pyOpenCL library. To install all basic needed libraries I advice you use the commands shown above. The commands first install all libraries needed for the installation of the POCL SDK which is then downloaded, uncompressed, configured, compiled and installed using the last few commands. After all this is done the pyopencl library is then compiled from source using pip.

There are some peculiarities in the above setup. For example you may notice that I did not use the latest POCL implementation (0.13) and this is mainly because the LLVM that we can easily install for Ubuntu 14.04 LTS on ARM is 3.4 which is only supported on version 0.9 of the POCL SDK.  Additionally I also tried using the MALI SDK – which in theory would allow us to the use GPU within the ODROID-XU4 as well as the CPU – but I failed miserably as the MALI SDK does not seem to support the use of ICD which is required for the last version of the pyOpenCL library. I could get to run the examples that come with the MALI SDK but I was unable to properly get it setup with the pyOpenCL library. In the end I got stuck on a “platform not found” error which is related to the lack of a proper ICD in the system — as far as I could see. The MALI SDK is in the end difficult to setup and does not even come with proper installation procedures as is the case for the seemingly much better POCL SDK.


Using POCL as highlighted above I was able to get pKantuML to run using the CPU (which is recognized as a 16 core OpenCL device) making use of all available cores within the OpenCL code. The above image shows the part of the generation process that calculates the machine learning predictions – the part that runs in C/C++ and not in OpenCL – and the image below shows the rest of the process which runs on the CPU using OpenCL. Although the timing per system on the screenshot is not accurate — as it includes a the time of a very slow file generation step – we can indeed calculate it using the wall clock time taken to run the simulation. Since it took about 21 minutes to run 241,920 system simulations we can see that the average system simulation time was 192  systems per second or 5.20 milliseconds per system which is around 10 times slower than what we can get on an i7-4770K. However this is still orders of magnitude faster than what the simulations would take on the i7 if they were executed using our regular simulator and C/C++ instead of OpenCL code.

The most demanding part for the above simulation is therefore not the OpenCL part but the generation of the prediction binary files which are generated through the use of our traditional simulator – because they need to execute the complex ML algorithms – these files contain around 158K values each for a 30 year 1H test. The ODROID-XU4 takes around 10 minutes to generate 8 files, which is also around 10 times slower than an i7 processor, which is the regular speed penalty we have come to expect from this little machine. The overall process is however still 100-1000x faster than if you performed both predictions and simulations using traditional simulation techniques.


Of course if you don’t have access to our software you can still test your pyOpenCL setup using some simple demo code. If you’re interested in doing this I advice you try the script mentioned here which allows you to perform a simple addition task in parallel using pyOpenCL and the odroid processor. After launching the script the code should ask you to choose an OpenCL device where the 16 core OpenCL ARM processor device should show up. The ODROID-XU4 is definitely a powerful little computer that will continue to be useful for our experiments. I cannot wait for the advancement of ARM computers so that we can perform even more powerful simulations using these little machines. If you would like to learn more about mixed CPU/GPU simulations and how you can use pKantuML to mine for machine learning strategies please consider joining, a website filled with educational videos, trading systems, development and a sound, honest and transparent approach towards automated trading.strategies

Print Friendly
You can leave a response, or trackback from your own site.

Leave a Reply

Subscribe to RSS Feed Follow me on Twitter!
Show Buttons
Hide Buttons