[Paper Review] Deep neural networks, gradient-boosted trees, random forests: Statistical arbitrage on the S&P 500
from Wikimedia Commons
Although Harry Potter’s world of magic exists on the same earth as our magic-less “Muggle” world, the worlds might as well be on different planets. Each world is governed by its own sets of rules and values, and their residents hardly ever cross each others’ paths. Academia and industry similarly exist as parallel worlds. In academia, a person’s work is judged by its logical rigour. In industry, practical results reign supreme.
Because of these differences in values, the two worlds often end up painting different portraits of the same subject. Whereas efficient market hypothesis still features front and centre in academics’ debates, financial practitioners have never paid it more than scant attention. But sometimes, the differing views clash, and can shake some foundational beliefs among the losing side. Alan Greenspan, who chaired the Federal Reserve during the formation of the housing bubble, leaned on the efficient market hypothesis to justify loosening banking regulations. The financial crisis that ensued forced him to admit that his beliefs had been “flawed”. Industry practitioners don’t always come out the victor. They have the propensity to use tools without understanding their theoretical underpinning, and those tools have sometimes exploded in their faces when the theoretical rug was pulled from beneath them.
On the other hand, cross-pollinating the logical rigour of academia and the practical observations of industry can lead to useful insights and tools. Krauss et al.’s paper “Deep neural networks, gradient-boosted trees, random forests: Statistical arbitrage on the S&P 500“ is one such cross-pollination effort in the field of finance.
The authors begin their paper by highlighting the gap between academia and industry. They note that the Journal of Finance, a premier academic journal, only contains 17 references to “neural networks” and even fewer attempts to utilize the technique. In contrast, some of the best performing hedge funds today rely heavily on advanced forms of neural networks. While the authors don’t address the reason for this discrepancy, I can hazard a guess as to why. The proper application of machine learning requires a diverse set of expertise, and those who possess them are generally lured by big paychecks into industry. Krauss et al.’s paper represents an ambitious attempt to bridge this gap. They put three popular machine learning algorithms, as well as a technique called ‘ensembling’, to the test.
The first algorithm examined in the paper is the ‘Deep Neural Network’ (DNN). DNNs are stacks of linear regressions with ‘activation functions’ wedged between. Each layer performs multiple linear regression calculations. A simple mathematical formula called the ‘activation function’ is then applied to transform these outputs, and the transformed numbers are then used as inputs to the next layer of linear regressions. The numbers cascade up the layers until the last layer produces the final set of outputs, which are the predictions.
Allow me to be pedantic before we continue further. I have an issue with the authors’ use of terminology. ‘Deep neural networks’ refer to any statistical model consisting of one or more hidden layers (i.e. stacks of linear regressions). There are many different types of DNNs. The specific structure described by the authors is called the ‘Multilayer Perceptron’ (MLP), and it is very different from other DNNs such as convolutional neural networks. I’ll nevertheless adopt the author’s nomenclature throughout this review to avoid confusion.
Let me use an example to highlight the advantages of DNN’s structure. Say we want a model, when given two locations A and B, that estimates the time to travel between them. There are no restrictions on the distance between the locations - B could be just around the corner from A, or it could be thousands of kilometers away on another continent. There are also no restrictions on the mode of transportation available to the traveller - flight and foot are both viable. The model makes use of a wide range of inputs, including driving speed, point A’s distance from the nearest airport, and the total distance between A and B, among other inputs.
Linear regression would do a comically bad job of estimating travel times. It lacks the ability to compartmentalize different scenarios, to deem an input relevant for some scenarios but not for others. If it decides the distance to the nearest airport matters, it would even factor that input for trips to the neighbourhood grocery store.
DNNs are able to avoid this folly because they can condition on different scenarios. A DNN, for example, may subsume a linear regression model that estimates the travel time by plane, as well as another linear regression model that estimates the driving time. These intermediate estimates are stored in memory banks called “hidden nodes”. An activation function could then take the minimum of these travel times to produce the final estimate.
By identifying each scenario, I may have given you the impression that the modeler needs to manually specify each scenario for DNNs. Such manual specification would involve a lot of work. But that work is thankfully unnecessary. We need only to give DNNs the capacity to learn different scenarios, and they’ll figure it out on their own. In fact, they’ll often think of scenarios that we would never think of, delivering better performance over manually specified models.
We can increase a DNN’s capacity to generate scenarios by increasing its numbers of layers and hidden nodes. While this can improve the DNN’s performance, it can also backfire. Like a conspiracy theorist, the model can connect too many data points to see patterns that don’t exist. This phenomenon is called ‘overfitting’.
Sports fans are familiar with the temptation to overfit, even if they don’t know the terminology. Say you invite a new friend named Johnny to watch the Leafs games with you. During his first game, Johnny excuses himself to go to the bathroom, and the Leafs score while he’s there. During Johnny’s second game, he again excuses himself to go to the bathroom, and the Leafs score again while he’s gone. Rational people would chalk these up as coincidences, but a statistical model is more like your irrational fan - it will believe Johnny’s trip to the bathroom bends the universe to favour the Leafs, as long as the data suggests it. Statistical models don’t believe in coincidences.
There are techniques to stop statistical models from falling into overfitted misconceptions, some of which Krauss et al. employ. But those techniques have limited powers. A sufficiently complex model will always overfit to some extent, and ultimately underperform simpler models such as linear regression models. By using DNNs to model stock prices, Krauss et al. are therefore posing a question: are stock prices buffeted by a confluence of different scenarios, which DNNs can model more effectively? Or are they propelled by a single process, which traditional quantitative models would better capture?
Another machine learning algorithm examined by Krauss et al. is the random forest. This algorithm works by repeatedly drawing random samples from the original dataset, and building decision trees that fit each sampled set. The algorithm then averages predictions from each tree to yield the final prediction.
This subsampling technique, officially known as ‘bagging’, mitigates overfitting. Recall, from the Leafs example, that algorithms overfit because of coincidental data. Coincidences tend to be rare, so removing even one or two incidences from the dataset can dissuade a model from believing there’s a pattern. Two goals in two bathroom trips can seem like a pattern, but just one occurrence is usually dismissed as a fluke. Bagging creates datasets that sometimes sheds some coincidental data, yielding models that are free of their influence. While some data subsets will contain the full regiment of coincidences, thus yielding overfitted models, their influence is diluted. In other words, their predictions even out with predictions from non-overfitted models.
Decision trees are constructed for each data subset such that each decision point divides the data most cleanly. To illustrate this concept, suppose we wish to create a model that predicts the sweetness of fresh produce found in a grocery store. Our model can make use of different types of inputs, including the produce’s colour, size and whether it’s botanically considered a fruit or a vegetable (henceforth referred to as ‘taxonomy’). The random forest algorithm will experiment with different criteria to divide the dataset. It will note that separating the produce by size doesn’t create groups with significantly differing sweetness, but that separating by taxonomy does, and select the latter as a decision criteria.
A problem arises, however, if there’s a dominant data category that always splits the trees. In this case, even with bagging, each tree then ends up looking alike. Every decision tree fitted on fresh produce, for example, may split on taxonomy. The presence of a dominant category stifles voices of less dominant categories. The size of fresh produce may have some weak, but real influence on sweetness, but a decision tree that always splits on taxonomy wouldn’t capture this effect. While a tree could split on size after splitting on taxonomy first, this would only be viable with a large enough dataset. If the dataset is too small, splitting on multiple points risks overfitting.
Random forests overcome this problem by forcing trees to consider different subsets of data categories. Some trees would be forbidden to split on taxonomy, allowing other categories to exert their influence. This technique, combined with bagging, ensures diversity among trees.
The data groups that form at the bottom of the decision chains are called ‘leaf nodes’. To make predictions of new data, the decision tree would first find the leaf node that the data would belong to, and use the average value of the members of the node as its prediction. If mango is deemed to belong in the same group as watermelons, apples and grapes, then mango would be predicted to have the same sweetness as the average of those three fruits.
Decision trees have simple structures in comparison to most other machine learning algorithms. They pose multiple yes or no questions of a dataset, and assign fixed values depending on the answers. This simplicity is a double edged sword. On the one hand, it allows humans to intuitively grasp the trees’ thought process. Simplicity also grants trees resistance against overfitting. But on the other hand, decision trees have low ceilings on accuracy because they ignore differences among members of the same leaf nodes.
To see this weakness, imagine how a decision tree would estimate the travel time for a given distance. The tree may divide the total distance several times, resulting in leaf nodes that correspond to different transportation modes. One leaf node, for instance, may capture distances of more than 2,000 kilometers that are best travelled by plane. Another leaf node may contain distances of less than 1 kilometer since they’re walking distance. But once split among these groups, the tree’s travel time estimates would remain fixed for members of the same group. It would predict the same travel time for distances of 5,000 kilometers and 3,000 kilometers.
We could mitigate this problem by allowing trees to split more often, creating more granular leaf nodes. A tree could, for instance, create one leaf node that contains distances of between 2,000 to 4,000 kilometers, and another from 4,000 to 6,000 kilometers. This more complex tree would produce divergent estimates for distances of 3,000 and 5,000 kilometers. But adding split points won’t solve the fundamental problem; distances of 2,500 kilometers and 3,500 kilometers would still fetch the same travel time estimate. While one could increase the split points indefinitely, doing so would invite overfitting. It would also be inefficient and unsatisfactory from a mathematical point of view.
Despite its drawbacks, random forests routinely outperform more sophisticated algorithms on many tasks. While lacking in pinpoint accuracy, the models usually correctly capture the main drivers of a phenomenon.
The next algorithm considered by Krauss et al. is the gradient boosted trees (GBT). GBTs work much like random forests, except that they use ‘gradient boosting’ in place of bagging to sample data subsets. Whereas bagging always uses original, unaltered data to construct trees, gradient boosting uses the 'residuals' of data points. Residuals measure the degree of misclassifcation of each data point by previously constructed trees, and they change over time as new trees are constructed. Grapefruit, for instance, may be predicted to be sweet by trees that split on taxonomy, resulting in much higher residuals than, say, watermelon which are classified correctly most of the time. But Grapefruit's residual may decrease if a new tree correctly classifies Grapefruit as being bitter, reducing the overall degree of misclassification.
In comparison to bagging, boosting will produce models that fit more tightly to the shape of the data. We can be confident that Grapefruit will be classified correctly with boosting, but not necessarily with bagging. The closer fit comes at the cost of potential overfitting, however. A boosted tree may incorrectly infer that another fruit with a similar colour and size as grapefruit, such as a cantaloupe, will be bitter. It’s hard to know if the trade is worth making until we apply boosting to real data.
The last concept that Krauss et al. explore is ‘Ensembling’. This concept involves aggregating the predictions of each base model (DNN, Random Forest, and GBT in Krauss et al.’s case) to produce one final prediction. There are many ensemble methods, but Krauss et al. chose the simplest one that averages the predictions of each of the base learners. If DNN, Random Forest and GBT models predicted values of 0.2, 0.6 and 0.7 respectively, then the ensemble’s prediction would be 0.5, which is the simple average.
Let’s now turn our attention to the data fed into the models. The inputs to each base model consist of short term momentum indicators - that is, percentage change in the price of each stock over the past ‘X’ number of days. A total of 31 such indicators were considered. One set of indicators was spaced 1 day apart, ranging from 1 day, 2 day, all the way up to 20 day spans. Indicators from the other set were spaced 20 days apart, ranging from 40 days, 60 days, up to 240 days. Krauss et al. only counted business days, so 240 days roughly equaled a year.
I feel the authors could have specified better inputs. One problem is that the inputs share the same information. 2-day momentum subsumes 1-day momentum plus one other day. 3-day momentum subsumes both 1-day and 2-day momentums. In fact, 1-day momentum is subsumed by every input, lending that data point much greater influence than, say, a stock’s performance on the 239th day prior which is only included once. Models trained on these inputs may pay too much attention to a stock’s 1-day momentum, potentially leading to overfitting.
The authors could have de-correlated the inputs. They could have, for instance, subtracted the 1-day momentum from the 2-day momentum, and worked with the residuals instead. They could have likewise subtracted the 60-day momentum from the 80-day momentum.
Another thing I would have done differently is that I would have dampened extreme momentum values. It’s well known that stock prices exhibit “fat tails” - that is, they occasionally display extreme swings outside their norm. A biotech stock may see-saw between -2% and +2% most days, then shoot up by 80% on a takeover offer, or crater by 50% on a failed clinical trial. As I mentioned in a previous article, some machine learning algorithms struggle to process data with fat tails. Random Forests and GBTs are luckily unaffected by this issue, but DNNs are heavily impacted by it.
Krauss et al. chose their target output to be an indicator that sets to ‘1’ when a stock outperforms at least half of its peers over a given day, and sets to ‘0’ otherwise. I have no issues with this choice. This scheme yields a balanced number of 1s and 0s, which spares the algorithm from issues that arise when class labels are unbalanced.
Model performances are evaluated through long/short portfolios formed using predictions from each model. The portfolios contain equal numbers of long and short positions, and all positions are drawn from members of the S&P index, ensuring ample liquidity. The authors examined various numbers of positions, from 10 positions for both long and short sides of the trade, to 200 positions. However, they concentrated their discussion to portfolios consisting of 10 positions since those appeared to perform best.
While such concentrated portfolios do well to highlight the performance of machine learning algorithms, they unfortunately lack practical relevance because no professional investor would assume such risks. A portfolio that contains 10 short positions can lose its entire value if one position goes up by 1,000% in a day. That no stock has ever risen so dramatically in one day is of secondary importance - its mere possibility is enough to steer investors clear of managing their portfolios this way.
Unfortunately, expanding the number of positions to 50 significantly reduces the performance of the machine learning models. Whereas the best 10 positions portfolio notched incredibly high daily returns of 0.45% before transaction costs (186% per year annualized), the best 50 positions portfolio generated about half those returns. If we assume transaction costs of 0.2% per day as suggested by the authors, the 50 positions portfolio ends up generating negligible returns.
One could argue, however, that the transaction costs assumed by the authors are overly punitive. We live in the era of Robinhood, where multiple brokers offer commision free trades. Even brokers that charge commissions have drastically reduced their fees over time. If we add back the transaction costs, Krauss et al.’s models may appear to offer some practical value.
Our hopes, however, are dashed upon examining the “sub-period analysis” section of the paper. The authors analyzed the efficacy of the models in four distinct time periods. The first period, spanning from Dec 1992 to Mar 2001, saw the ensemble model achieve extremely high rates of return of over 200% per year. But by the second period, which ended in Aug 2008, the rates of return dropped to just 22 percent per year.
The beginning of the second period coincides with the invention of random forest and stochastic gradient boosting algorithms. The authors believe these algorithms would have posted strong performances during the first period only because they had not been invented by then, giving no one the opportunity to utilize them. As market participants adopted these algorithms during the second period, the authors theorize, their efficacies declined.
If the authors are correct, then market participants were remarkably quick to adopt these technologies. Random Forest was invented in 2001, and gradient boosting in 2002, so significant adoption would have to have occured in less than a couple of years after their invention. I’m skeptical that adoption happened so quickly. My skepticism is supported by the performance of the models during the financial crisis from Sep 2008 to Dec 2009, during which time the ensemble model again enjoyed very high rates of return of over 400% annualized. This leads us to ask the question, did market participants decide to turn off their models during this time, despite them making money hand over fist? If not, then how were the models able to make money despite the unchanged competitive landscape?
Other circumstances may have dampened the models’ performances during the second period. The dominant trading style of the market may have shifted from the first period to the second, with investors in the latter period paying less attention to technical factors and more attention towards the underlying business, depriving momentum indicators of their predictive powers. Any algorithm that uses momentum indicators during this time, regardless of its type, would have seen its efficacy fall during the second period.
I’m more convinced, however, that the poor results of the models during the last period from 2010 to 2015 are the result of widespread adoption. Machine learning had become a hot buzzword by then, giving market participants ample incentive to experiment with algorithms like Random Forests and gradient boosting. All models examined by the authors consequently lost more than 10% per year during this period. If arbitrage is the correct reason for the poor performance, we shouldn’t expect the models to generate useful trading signals today, leaving them without practical value. But that’s not to say the paper is devoid of useful lessons.
The first lesson we can draw is that DNNs don’t work well without good data preprocessing. The paper showed that DNNs underperformed both the Random Forest and GBT by wide margins. The authors blamed DNNs’ poor performance on lack of ‘hyperparameter’ tuning (hyperparameters are configurations that determine a DNN’s learning capacity, such as the number of layers and nodes). While such tuning could have helped, I believe the bigger issue is the shape of the input data’s distributions. I know from experience that correcting this distribution can make a night and day difference to a DNN’s efficacy.
The second lesson is that boosting does more harm than good, at least in the context examined in the paper. Random forest, which uses bagging, outperformed gradient boosted trees. This finding agrees with Marcos de Prado, who argued that bagging generally outperforms boosting in financial contexts where data is very noisy.
The final lesson we can draw is the usefulness of ensembling. The ensemble model outperformed every base learner, even though by taking their simple average, we let the worst base learner and the best base learner have equal say in the final prediction. An ensemble model can only deliver such improved performance if its base learners contain diverse insights. Despite having analyzed the same input data, the distinct structure of each base learner allowed them to learn different lessons. This, more than any part of the paper, proves the power of machine learning. Each base learner, despite its relative sophistication compared to traditional statistical models, missed insight that competing base learners captured. Complex patterns are locked in the datasets, which require sophisticated keys to unlock. It seems highly unlikely that traditional statistical models would work like master keys.
How can we improve Krauss et al.'s models so that they can be useful in the real world? I have a few ideas. One, the model will need to preprocess the data better, as I have suggested before. Two, it would help to mix different types of inputs into the models. Price data is abundant and accessible, which increases the chance that other market participants have already mined most of the useful signals locked in this dataset. Fewer market participants use less accessible data, and their use, either alone or in combination with other data sets, would give the models a better chance of generating alpha today. Third, we could use more sophisticated algorithms to find deeply hidden insights that are invisible to simpler machine learning algorithms used in this paper. Krauss takes this route with a subsequent paper that uses Long-Short Term Memory (LSTM) models, which shows better recent performance.
It’s popular in some corners of the financial world to dismiss machine learning. They half-joke that many people who claim to use machine learning actually use linear regression behind the scenes. But I hope that this paper, by highlighting how machine learning models work, and by showing how different algorithms learn different insights, will help change people’s attitudes towards machine learning.