Twitter historical data and financial time series predictions

In today’s world there are many different possible sources of information that can be used to build strategies to trade the financial markets. From these resources perhaps none is a more vibrant representation of overall human sentiment than twitter and for this reason it is very interesting to use twitter derived data for the construction of trading strategies. However this is no easy task due to the huge difficulty that there is in obtaining twitter historical data. Today I am going to talk about what is required to use twitter for financial market predictions, what the issues are with obtaining historical data and what possible solutions are available to attempt to circumvent this problem.


The use of twitter data for trading is not new. There are a variety of studies (like this one and this one) that show how twitter data can be used to make successful predictions for the directionality and volatility of financial time series. However some of the problems with these studies are that they use data gathered by the researchers – which is usually limited to one or two years of data – and that the data is commonly not available to people who want to reproduce the papers’ results. This means that although there might be a lot of potential in trading using data for twitter the data needed to construct models and perform trading simulations is not readily available. Given that using this data is very interesting I decided to research how this data can be obtained and what the cost of doing so might be.

The Twitter website offers an API that can be used to access tweets. The problem with this API is that the amount of tweets per call is very limited, the number of calls is also limited and the oldest data available is also limited to around 3 weeks. This means that this API is useful if you want to record incoming data but it is not useful to obtain data to use for simulations. The reason for this is quite obvious, the twitter data is valuable and as such the twitter company sells the data using their subsidiary gnpi. Here you can get access to tweets dating back to the first tweet in 2007 but the cost of this is overwhelming. Since twitter sells to large companies the cost is going to be in the tens of thousands of dollars to access data from 2006. In fact asking for a specific tweet search going back to 2006 can cost up to 15-20K USD. This article on doing Twitter research on a shoe string budget sums up the large difficulties that someone faces when trying a research project using twitter data. So far the cheapest service I have been able to find that would do the job is infegy atlas, which charges 4K USD/month for as many searches as you want to perform. This would be useful to get all possible twitter data you would want for research and then start collecting your own.


Another issue with twitter historical data sets is that they are built to be representative rather than complete. This means that tweets from people who do not have many followers, tweets from users who have been deleted or tweets that are not retweeted significantly may be eliminated from the historical database entirely. This means that your data might look different when you are gathering data live through the streaming API compared to the data that you have evaluated historically. This definitely needs to be accounted for when building any type of machine learning algorithm to trade using historical twitter data and the normal twitter feed.

An alternative to using historical data is to attempt to gather data and then use it to do research, the problem with this is analogous to a curve fitting problem. You would be drawing conclusions based only on partial data which are difficult to prove are relevant in the long term. The same issue happens when you attempt to build a strategy using just one year of historical data, your probability to fail in the future is great because your in-sample period covers only a very limited set of potential market conditions. You would need to wait 10 years before being able to do research, case in which it might be better to just bite the bullet and pay the 4K to get access to twitter data and be able to see whether it is actually worth it or not for the instrument you want to predict.


In the end although the use of twitter data for financial series predictions sounds very interesting the difficulty in obtaining complete historical data makes this extremely hard for the retail investor. This in turn may also imply that the building of such models may be very valuable as anyone would have a hard time being able to reproduce the same results due to the inherent difficulty in accessing the data-sets. Right now I am debating whether to take a chance and purchase a 1 month 4K license to see what might come up from a few sample queries. I will publish an update post about this if I go down this road. If you would are interested in trading system building and would like to learn how you too can build trading strategies using historical market data please consider joining, a website filled with educational videos, trading systems, development and a sound, honest and transparent approach towards automated trading.strategies

Print Friendly, PDF & Email
You can leave a response, or trackback from your own site.

2 Responses to “Twitter historical data and financial time series predictions”

  1. Maybe crowdfunding would be a method to raise funds gather enough historical data for research. I know I would donate to have access to the data.

    • admin says:

      Thanks for writing. This might not be an option because it conflicts with their terms of service. Otherwise it would be great to pool resources!

Leave a Reply

internal_server_error <![CDATA[WordPress &rsaquo; Error]]> 500