Using the GPU and CPU for OpenCL programs in an ODROID-XU4

In the past I have talked about how to setup pocl in order to use the CPU in OpenCL application in the ODROID-XU4, however I recently realized that there is more computational power within this little device than simply the CPU and we can also exploit this power to gain an important computational gain. Today I want to talk about how you can install an ICD manager and use both the CPU and GPU for your computational need within an ODROID-XU4, this greatly increases the amount of computational power available within the device, allowing you to reach a computational speed that rivals even 2 or 3 intel i7-4700 CPU cores. I will talk about how to perform this installation using the latest Ubuntu 16.04 LTS image for the ODROID-XU4 and how you can also perform some tests to ensure that you have been able to setup everything correctly.

With the latest Ubuntu 16.04 LTS image the ODROID-XU4 can now use the LLVM 3.8 package which allows compatibility with the latest POCL version that implements several improvements over the 0.9 version we had to use with Ubuntu 14.04 LTS. The first thing you want to do with your ODROID-XU4 is therefore to install the latest version of POCL by first installing all essential building packages, LLVM and a few additional libraries that are needed. After this you download POCL and build it by first using the cmake command to configure the package and then using the make and make install commands to finish the process. In this case I have use the -DKERNELLIB_HOST_CPU_VARIANTS to specify the “generic” CPU build as I had problems when trying to build POCL for the specific ARM CPU available within the ODROID-XU4. The above commands also install the OCL-ICD manager which allows you to have several OpenCL SDK implementations in the same computer and choose any device that comes from these implementations.

The Mali-T628 that comes with the ODROID-XU4 is a 6 core GPU that gives you a lot of additional punch for your OpenCL computations. In previous versions it was difficult to use these GPUs because we had to build the Mali SDK from scratch but the latest ODROID-XU4 images now ship with the Mali SDK that we can use from the get-go. However there is no OpenCL ICD installed which means that the device is inaccessible after installing POCL until we create the appropriate entry in the /etc/OpenCL/vendors folder. The “sudo nano” command above opens up an editor to a “mali.icd” file where you can simply write the line “/usr/lib/arm-linux-gnueabihg/” (without the quotes). This creates an ICD entry that tells the computer that the OpenCL SDK for the GPU is located within a specifically defined path.



After doing this process we can then use pyopencl to print the OpenCL devices available by using the simple python script showed above. This will also show us the number of cores available as well as the identifiers for the different devices within the ODROID-XU4. The image above shows you the output of this command. As you can see we now see 2 different platforms – from the two vendor ICDs we have available – for the first one we see two Mali-T628 devices and for the second wee see the 8-core ARM processor from the ODROID-XU4. You can use together devices from the same platform so in this case we can run OpenCL programs with either the 8-core ARM or using the two Mali GPUs available. Sadly we cannot combine the three devices since they use different OpenCL library implementations. However you can just run two processes, one working with the 8-core main processor and another using the GPU.

The last image within this post shows our pKantuML software using the two GPU devices in order to carry out trading system simulations. The performance using the GPU is around 6x slower than the performance with an i7-4770 and around 100x slower when compared with a Radeon R9-290X GPU card. However since we can run an additional process using the CPU – which is also around 6x slower than the i7-4770 – in the end we can have performance that rivals around 2-3 i7 cores using the ODROID-XU4. Of course in the specific case of our machine learning system mining, for the generation of the binary files containing machine learning predictions we always use CPU cores so this advantage is only within the OpenCL part of our machine learning system generation process, meaning that in reality we will probably be getting just around 1.5x more performance than if we just didn’t use the GPU instead of the 2-3x we would get if we could use the GPU through the entire process.


The above is a good reminder that these small ARM based computers not only carry powerful processors (for their power consumption) but they also carry powerful GPU devices that can also be used in OpenCL application. By leveraging both the CPU and GPU for calculations we can definitely increase the efficiency per watt achieved for these devices. If you would like to learn more about our trading system mining and how you too can use the power of GPUs for automated trading system generation please consider joining, a website filled with educational videos, trading systems, development and a sound, honest and transparent approach towards automated trading.strategies.

Print Friendly
You can leave a response, or trackback from your own site.
Subscribe to RSS Feed Follow me on Twitter!
Show Buttons
Hide Buttons