[Paper Review] Algorithmic Financial Trading with Deep Convolutional Neural Networks: Time Series to Image Conversion Approach
Of the major machine learning algorithms, the convolutional neural network (CNN) is my favourite. CNNs form some of our company’s most cherished elements that give strength to our investment algorithms. My curiosity was therefore piqued when I came across Sezer and Ozbayoglu’s paper titled ‘Algorithmic Financial Trading with Deep Convolutional Neural Networks: Time Series to Image Conversion Approach’, in which the authors also used CNNs as the bedrock of their trading strategy, but in a different way from how we’ve used them.
To understand CNNs, you can start by looking at the picture of the elephant at the top of this page, and asking yourself: how do you know that’s an elephant? “Well,” you might say, “the creature has four legs, flappy ears and a long nose. The only animal that fits that description is an elephant.” Let’s dissect that logic. Say you didn’t see the actual picture of an animal, but knew how many legs, flappy ears and long noses it had; perhaps a boy saw the picture, and described the body parts to you. Would you conclude the animal is an elephant? You probably would, though you wouldn’t be 100% sure. CNNs replicate this thought process - they sense the presence of parts to infer the whole.
The sensory apparatus of CNNs consists of sliding windows of numbers called ‘filters’, and the stage in which the sensing occurs is called the ‘convolutional layer’. This fantastic article offers a detailed peek inside the mechanics of filters, but let me boil down their essence using the elephant as an example.
Imagine printing off a photograph of the elephant, and taking a planchette - a wood panel with a large hole - through which you’d only see a small section of the image at a time. Place the planchette on the top left corner and note, on a scale of 0 to 10, how much the section resembles an elephant leg. Write the number down on the top left corner of a separate piece of paper. This paper will be a “map” of an image that marks the presence of legs. Slide the planchette an inch to the right and note the resemblance again on the corresponding location on the map. Keep sliding and writing until you’ve covered the entire image. Repeat all previous steps, but this time, write whether the section looks like an elephant ear on a new piece of paper. Then do it again for an elephant trunk.
We should now have three maps laying out the locations of legs, ears, and trunks. We could take all the numbers on the maps, line them up as factors, and feed them into a statistical model. But doing so presents a problem: the resulting number of factors is usually large. If each map consisted of 10 by 10 grids, then we’d have 10 x 10 = 100 factors from each map for a total of 300 factors. Statistical models suffer from indigestion if you feed them too many factors. Modelers mitigate this problem by adding another stage called the ‘subsampling layer’, which groups neighbouring numbers and aggregates them, thereby reducing the number of factors.
Of the several types of subsampling layers in use today, the most popular type, called ‘max pooling’, takes the maximum number from each group. If we were to apply this layer to our elephant picture maps, we would group the numbers, say by 2 x 2 square grids, and take the maximum number from each group to form smaller maps. Max pooling has the effect of showing whether a body part was found anywhere within each 2 x 2 grid. The smaller maps formed through max pooling have 4 times fewer numbers, decreasing the number of factors in each map from 100 to 25. The new number of factors would hopefully be small enough for statistical models to digest. If not, modelers could choose bigger grids to reduce the number of factors even further, though at the cost of some loss of information.
We’ve assumed throughout our example that we knew what elephant body parts looked like. Machines lack such knowledge, and modelers wouldn’t teach it to them directly, so you might wonder how machines would know which patterns to detect. Therein lies the power of CNNs - they teach themselves, through trial and error, which patterns to look for. If you train a CNN to identify elephants, it’ll automatically learn to recognize the shapes of elephants’ legs.
I’ve used image recognition to illustrate how CNNs learn, in part because it’s the most popular use case for CNNs. But there’s nothing that limits CNNs’ use cases to images. CNN’s power is its ability to detect useful patterns in data. In image recognition, those patterns consist of arrangements of pixels. But for sentiment analysis, the pattern might consist of a specific sequence of positive and negative words. For stocks, it might be a head and shoulders pattern. Unfortunately, many people hold the misconception that CNNs only work on images. In our recent hiring process for a machine learning quant, we asked candidates how they would apply CNNs to financial data. I was taken aback by the number of people, including many with PhDs, who suggested using literal images of stock charts as inputs. A stock chart is merely a pictorial representation of a 1-dimensional series of prices, so it would be more appropriate to feed the price histories as inputs, and use 1-dimensional filters to detect patterns.
While Sezer and Ozbayoglu avoided the mistake of using pixels as inputs, they didn’t completely break free of the image analysis mindset. Their inputs consisted of 15 different technical indicators including RSI and MACD, calculated across 15 time intervals ranging from 6 days to 20. They organized the indicators into 15 x 15 grids that they called ‘images’. They even showed pixelated representations of these datasets in their paper, which look as if someone took a camera right up against pieces of plastic and took their picture. I’m not a fan of this terminology, but I’ll adopt it throughout this article to remain consistent with their paper.
The authors apply 3 x 3 filters to the images - that is, they look for patterns within three neighbouring technical indicators and three neighbouring time intervals. In choosing this filter, the authors gave significance to the ordering of indicators. If indicators WMA and EMA are next to each other, their CNN will learn whether WMA and EMA have combined effects. The model will note, for example, whether it’s an especially good time to buy when both WMA and EMA emit similar numbers. If WMA and EMA are far apart, however, the CNN won’t recognize any potential combination effects. The authors acknowledged this point, but they accepted it as a necessary inconvenience, and tried to mitigate the issue by placing similar indicators next to each other, which amounted to an implicit bet that similar indicators would combine more strongly than dissimilar indicators would.
But this implicit assumption is unnecessary. The authors could have configured the model in such a way that the ordering didn’t matter. One solution would have been to apply 1D filters for each indicator, and then to use a fully connected layer to find combination effects between any subsets of indicators. Had the authors desired a less radical departure from their original configuration, they could have used filters that were 15 rows deep. Such filters would have covered the gamut of indicators, allowing them to uncover potential combination effects between any subset of indicators.
Speaking of grid sizes, the authors used square images where the numbers of rows and columns were both 15 numbers long. But nothing prevented the authors from using rectangular images. They could have, for example, used images that were 16 indicators deep and 15 time intervals wide. Perhaps the authors were aware of this point, but there was no mention of it in their paper.
The authors applied two convolutional layers to the images - one for detecting patterns among technical indicators, and the other for detecting patterns within those patterns. They then applied a max pooling layer to reduce the number of inputs, and then a fully connected layer to find relationships between the “patterns of patterns” to finally yield three signals containing recommendations on whether to Buy, Hold or Sell.
The authors’ choice to divide output into three signals is unique. Most academics try to predict each stock’s performance more directly. One popular measure, for example, classifies a stock depending on whether it outperforms a benchmark. Such a measure forces models into always picking a side - a stock is either a buy or a sell, never a ‘wait and see’ - and fits well with strategies that are always 100% invested. But human traders behave differently, routinely sitting on the sidelines until they sniff particularly good opportunities. Of the Buy, Hold and Sell labels that Sezer and Ozbayoglu assigned as truths the models were trained to reach for, the vast majority belonged to ‘Hold’, which indicated waiting as the appropriate action. I appreciated the author’s choice of output that aligned more closely with human traders’ behaviour.
The authors evaluated the model’s ability to predict the outputs for two datasets. One dataset consisted of 30 stocks belonging to the Dow Jones index, and the other of 9 popular ETFs. The models trained on 5 years’ worth of data to make predictions for the subsequent year, then repeated the process incrementally each year - i.e. it trained on data from 2003 to 2007 to make predictions for 2008, and on data from 2004 to 2008 to make predictions for 2009.
My first thought upon reading this was that the model would overfit. Overfitting describes the phenomena where machine learning models, instead of learning general principles, memorize past outcomes instead. It’s like a student who, instead of learning the principle behind addition, memorizes that 2 + 2 is 4, and struggles to give an answer for 2 + 3.
Machine learning models are more prone to overfitting when they contain too many parameters. Models, unlike humans, prefer memorizing instead of learning general principles whenever possible, and parameters act like storage space that enables memorization. Overfitting is also easier when the dataset is noisy, as models have greater trouble distinguishing coincidences from causal relationships. Humans also make the same mistakes; “old wives’ tales” originate from inferring too much from coincidences.
Financial data is very noisy, so much so that financial machine learners routinely cite overfitting as their biggest challenge. Sezer and Ozbayoglu’s model contained thousands of parameters, a large number relative to most financial models’. The authors furthermore chose to train on very small datasets, one consisting of 30 stocks and the other consisting of just 9 ETFs. Smaller datasets are easier to memorize, increasing the model’s temptation to overfit. These conditions together didn’t just tip the balance towards overfitting, they yanked it. Though the authors inserted two dropout layers as a way to mitigate overfitting, they were the equivalent of trying to stop a flood with bare hands. I was therefore surprised when I read about the model’s results.
Sezer and Ozbayoglu’s model achieved an accuracy of 58% on the Dow Jones dataset. We need some context to interpret this number, however. Of 107,370 data points in the out of sample data set, only 6,446 (6% of total) data points were labelled ‘Buy’, and 6,284 (6%) data points were labelled ‘Sell’. The rest - 94,640 (88%) data points - were labelled ‘Hold’. This meant that if we predicted ‘Hold’ every single time, our accuracy would have been 88%. We would have therefore expected a higher accuracy than 58%.
The model’s low accuracy primarily stems from its eagerness to make more frequent ‘Buy’ and ‘Sell’ predictions than is warranted by data. Of the model’s 107,370 predictions, 23,867 (22% of total) were Buys, and 28,654 (27%) were Sells, which were about 4 times more frequent than the true incidence rates. There were, in other words, discrepancies in base rates. Accuracy is not the most appropriate yardstick in such situations.
Using ‘recall’ and ‘precision’ measures to evaluate the model prove more useful, though we still need to keep base rate discrepancies in mind. Precision is the batting average indicating how often predictions turn out to be correct. The precision for the model’s Buy predictions was 22%, which means that if traders had bought every time the model recommended a Buy, they would have made substantial money on 22% of their trades. The precision may seem low at first glance, until we remember that just 6% of all data points had true Buy labels. Had we generated predictions randomly, we’d expect precision to be 6%. At 22%, the model showed more than three times better skill than the average monkey throwing darts.
Recall defines the percentage of true labels that the model managed to catch. The model’s Buy recall of 80% indicates that out of 6,446 true Buy labels, it had correctly predicted 6,446 * 0.8 = 5157 of them. Since 22% of predictions were Buys, a monkey would have been expected to score a recall of 22%, so getting 80% also indicated more than three times better skill than the monkey.
I’ve limited my discussion to Buy signals for the Dow Jones data set, but similar results hold for Sell signals in the Dow Jones data set, as well as for the ETFs data set as a whole. I must, at this point, risk the ire of the authors and express skepticism about their results. They are simply too good. Perhaps the CNNs were set up just right to learn the best combinations of technical indicators. Or perhaps technical indicators filter out the noise in financial data, turning them into clear signals that machine learning models can easily pick up on. But another possibility is that the results are a fluke, an example of survivorship bias in academic literature. When I asked Professor Ozbayoglu for comment, he noted that he ran the model several times and got similar statistical results each time. While this does make survivorship bias less likely, it wasn’t enough to quell my skepticism completely.
The authors backtested a strategy that traded using their model’s predictions, and the results conformed with the statistical results. The strategy that traded Dow Jones stocks would have generated 12.6% per year in returns, making money 71% of the time. This latter number is especially impressive - Renaissance Technologies reportedly makes money on 50.75% of its trades - and it’s one more reason why I harbour doubts. The annual return number is understated as well. The stock market went up during the backtest period, yet the strategy spent over 50% of its time holding cash. The strategy’s returns would have been higher if it had invested in index funds instead of idling.
The ‘Buy and Hold’ strategy, by comparison, would have generated 10.5% per year. Note that ‘Buy and Hold’ doesn’t rebalance, resulting in a different performance to that of the Dow Jones Industrial Average. The index returned 7.4% per year during the same period.
Another aspect that caught my eye was the strategy’s lack of alpha decay. In fact, the strategy appears to have had a higher Sharpe ratio from 2012 to 2017, than from 2007 to 2012. CNNs had become very popular in the early 2010s, so there should have been ample opportunity for other traders to have adopted similar models. Yet none decided to do so.
Sezer and Ozbayoglu bring a lot of good ideas in this paper. They used technical indicators as inputs, which have a high likelihood of containing good signals because human traders base their decisions on them. I also appreciated the author’s choice to differentiate between Buy/Hold/Sell actions, which also mimics how human traders conduct trades. But I can’t shake the feeling that the model’s statistical results appear too good to be true. We plan on replicating Sezer and Ozbayoglu’s results, and conducting experiments on various modifications to the strategy. Depending on our own results, we will add the model to our growing suite of components that feed into our umbrella strategies.