Loading market data from a CSV into pandas, the fast way

When you want to analyse market data or create a back-testing program using python one of the first problems you face is how to get your data into memory. This is usually easily solvable when the amount of data is small – simply load it using default functions from python libraries – but it can get quite tricky as the amount of data gets big and it starts to run into the hundreds of megabytes. Attempting to load 10 years of one minute OHLC data using pandas’ default file loading functions can take around 10 minutes – depending on the specific file characteristics – something that is quite unacceptable if you want to perform analysis in a fast manner. On today’s post I am going to share some parsing tricks with you so that you can get any data loaded into pandas dataframe in python in a very fast manner.

#!/usr/bin/python
# Fast MT4 CSV data parsing
# By Daniel Fernandez Ph.D.
# http://mechanicalforex.com
# https://asirikuy.com

import subprocess, os, csv
from time import *
import argparse
from math import *
import pandas as pd
import numpy as np
from datetime import datetime
import sys

def parse_datetime(dt_array, cache=None):
    if cache is None:
        cache = {}
    date_time = np.empty(dt_array.shape[0], dtype=object)
    for i, (d_str, t_str) in enumerate(dt_array):
        try:
            year, month, day = cache[d_str]
        except KeyError:
            year, month, day = [int(item) for item in d_str[:10].split('.')]
            cache[d_str] = year, month, day
        try:
            hour, minute = cache[t_str]
        except KeyError:
            hour, minute = [int(item) for item in t_str.split(':')]
            cache[t_str] = hour, minute
        date_time[i] = datetime(year, month, day, hour, minute)
    return pd.to_datetime(date_time)
	
def main():

    historyFilePath = "filename.csv"

    df = pd.read_csv(historyFilePath, header=None)   
    df.columns = ['date', 'time', 'open', 'high', 'low', 'close', 'vol']
    df = df.set_index(parse_datetime(df.loc[:, ['date', 'time']].values, cache={}))
    df = df.drop('date', 1)
    df = df.drop('time', 1)
    
##################################
###           MAIN           ####
##################################

if __name__ == "__main__": main()

What makes data loading so slow? When you load a large CSV file containing random values they are loaded quite quickly by the pandas library. On plain data the read_csv function can read millions of lines in just a few minutes. So what is wrong with loading market data? Why does it take so long? The reason is that market data usually presents datetime information in a format that must be somehow parsed by the pandas library. The parser can be very inefficient – especially if it has to do special conversions – so this makes data loading painfully slow and makes processing of large amounts of OHLC or tick data more difficult. So how can we solve this problem? How can we make parsing faster?

To solve this problem what you want to do is simply load your data into pandas as if it was plain data – just use the read_csv function without specifying any index column – and then use your own parsing function to parse the data in a way that is much more efficient. Since you know the exact format of your datetime data you can create a function that parses this exact format and then uses the pandas to_datetime function to put a date_time object into the proper format for pandas (which is a computationally cheap process).

def parse_datetime(dt_array, cache=None):
    if cache is None:
        cache = {}
    date_time = np.empty(dt_array.shape[0], dtype=object)
    for i, (c_str) in enumerate(dt_array):
        d_str = c_str[:10]
        t_str = c_str[10:]
        try:
            year, month, day = cache[d_str]
        except KeyError:
            day, month, year = [int(item) for item in d_str.split('/')]
            cache[d_str] = day, month, year
        try:
            hour, minute, second = cache[t_str]
        except KeyError:
            hour, minute, second = [int(item) for item in t_str.split(':')]
            cache[t_str] = hour, minute, second
        date_time[i] = datetime(year, month, day, hour, minute, second)
    return pd.to_datetime(date_time)

The first code in this article shows how this can be done using the complicated MT4 CSV export format which expresses date and time data in year.month.day and hour:minute formats where the values are separated into two different columns in the CSV which makes it a especially hard case to process (as usually datetime data is present within a single column). To perform this processing exercise we first load all the data in plain format into pandas, we then give the columns names and we then proceed to feed a custom parser function with the date and time values in a numpy array that is created using the values function in the pandas library. The output of this is used in the set_index function in order to create the index that will be assigned to our data. After this is done we drop the date and time columns – which are no longer needed – and we get a clean data with an index that was parsed almost 10x faster than with the pandas default parsing function (which in this case would not work anyway due to the fact that date and time are in two separate columns).

We can use a similar philosophy to process any type of datetime format. If the data is present within a single column we can simply change our parser to handle everything within a single value, as you can see in the second function modification showed above. This function parses data in day/month/year hour:minute:second format (like 01/01/2005 10:00:00). In this case we just have a single column loaded with the datetime in the pandas dataframe and the dt_array therefore contains only one single ‘datetime’ value per slot. We then split this value manually to get the date and time strings which we can easily parse in the same way as we parsed the data before (only that in this case we reverse the order, add the seconds and include “/” instead of “.” as separator in the datetime part).

Selection_999(216)

With the above information you should be on your way to fast csv processing and loading into python/pandas. After your data is in pandas you can use it to run back-tests, to run statistical analysis or to perform data manipulations like DST changes, timezone adjustments, etc. Of course there are better ways to load data into pandas – you can be even faster if your data is stored in a binary format – but this is something we will discuss in a future post. If you would like to learn more about our data analysis tools and how you too can learn to back-test strategies using a powerful python/C/C++ tester please consider joining Asirikuy.com, a website filled with educational videos, trading systems, development and a sound, honest and transparent approach towards automated trading.strategies.

You can skip to the end and leave a response. Pinging is currently not allowed.

Leave a Reply

Subscribe to RSS Feed Follow me on Twitter!
Show Buttons
Hide Buttons