[Paper Review] The Virtue of Complexity in Return Prediction
Most financial machine learning papers add incremental value to the discipline. Some tweak existing models to gain slight improvements, while others transfer established techniques to new sets of problems. Many papers have been written, for example, on using language translation models to predict security price movements. These papers expand the body of knowledge, but they don’t question prevailing wisdoms. But the rare ones that do challenge prevailing wisdoms can transform our whole worldview. ‘The Virtue of Complexity in Return Prediction’, written by Kelly, Malamud and Zhou, is one such paper.
To appreciate the paper, we first need some context. Arguably the greatest challenge of financial machine learning is the problem of overfitting - the propensity of models to memorize spurious patterns in data instead of learning general principles. An overfitted model may, for example, note the tight correlation between butter production in Bangladesh and S&P 500 returns, and use the former data to predict the latter. Machine learning models can detect many more patterns than humans are able to, but lack the ability to distinguish between coincidental and causal relationships, and so they end up latching on to many coincidences. Overfitting also generally worsens when models are given more parameters to fit. Each parameter acts like a memory bank, so more parameters allow the model to remember even more coincidences.
Finance is an especially perilous terrain on which models can fall prey to overfitting. Not only is there a paucity of financial data, but the data tends to be very noisy too, making it harder for models to distinguish between signal and noise. Financial machine learning modelers have tiptoed around this problem by being skimpy with parameters, thus guiding their models towards only the clearest patterns in the data. Many quants have limited the size of their models so far as to distrust anything with more than a handful of parameters.
But what if there’s another way out of the overfitting conundrum? What if, rather than shying away from the dangers of overfitting, we could fight fire with fire through the use of vastly complex models involving huge numbers of parameters? That’s what Kelly et al. propose in their paper.
The authors first lay the theoretical justification for such an approach through heavy use of mathematics. This section is hard to understand for non-mathematicians, so you may feel tempted to gloss over the theory and skip straight to empirical findings. But without theoretical understanding, you won’t know how to make sense of the empirical findings in different contexts; it would be dangerous, for example, to apply a formula that prescribes the strength of a skyscraper’s pillars to the building of a bridge without understanding the reasoning behind the formula. Rather than skipping it, let me instead explain some of the important principles in plain language.
Kelly et al. first set up a hypothetical world in which the markets behave according to the authors’ design. By using a hypothetical world, the authors risk the chance that their analysis won’t transfer to the real world, but hypothetical worlds have advantages that the real world does not. The gears that turn real markets are hidden from us, obscuring our ability to see how closely models can get to reaching their theoretical limits. Such analysis, however, is feasible in a hypothetical world where the exact market mechanics are known.
In the hypothetical world of Kelly et al., security prices are dictated by a number of ‘factors’, which contain clues about whether a given security will outperform. Academics have long modeled the real world using factors as well, prominent examples of which include book to price, 12 month momentum, and size. There is a difference, however, between the real world factors and the ones governing hypothetical worlds. Whereas the real world factors approximate reality, Kelly et al. have set their factors to be the reality. Because the authors know precisely how those factors influence security prices, they also know how well a perfect model, possessing complete knowledge of these factors, can perform.
Within this hypothetical world, the authors shed their omniscience to become mortals who built two different families of models, each starting from divergent bases of knowledge. For the first family of models, the authors retained knowledge of every factor that drove markets, but became ignorant of the strength and direction applied by each factor. For the second family, the authors didn’t know the identity of the factors, but only had clues about what they were, while also remaining ignorant of how they influenced markets.
The assumptions underpinning the first family of models is unrealistic. In the real world, it’s difficult to even conclude whether a factor is “real” - i.e. that it will continue to predict how markets will behave in the future, let alone believe that we’ve discovered the complete set of all true factors. The authors conceded this caveat but conducted the analysis anyway, as they provided useful context upon which we can interpret the performance of the second family of models, which are shaped from a more realistic set of assumptions.
Within each family of models, the authors focused on the effect that the number of parameters had on model performance. Models with few parameters are said to have low complexity, while those with many parameters are said to have high complexity. Though complexity is a spectrum, models can be categorized into 3 buckets. If a model has fewer parameters than the number of data points it trains on, then the model is said to be ‘under-parameterized’. If the number of parameters equals the number of data points, the model is said to be ‘fully parameterized’. Lastly, ‘over-parameterized’ models have more parameters than the number of data points.
The categories are divided along this line for a reason. As mentioned earlier, parameters act as memory banks, and when a model is fully parameterized, there are just enough parameters for each to memorize one data point. Fully parameterized models therefore tend to display the most deviant behaviour. For instance, in the first family of models (where all true factors are used), fully parameterized models achieve the lowest R2*. In fact, the value is so deeply negative that we aren’t shown the true number on the graph, but it’s safe to say it’s under -5 as we can see by the dark blue line in the chart below labeled ‘Ridgeless’. Such a significantly negative R2 tells us that the model’s predictions tended to be very wide off the mark. (As an aside, the other coloured lines involve something called ‘regularization’, which I won’t discuss in this article as it’s a secondary topic).
*R2 is one of the most popular metrics for measuring a model’s predictiveness where the perfect score of 1 indicates clairvoyance - i.e. the model can predict the future with 100% accuracy. An R2 score of 0, on the other hand, tells us that the model isn’t predictive, but also that its errors aren’t overly large. A negative R2 indicates very large errors.
So why did R2 come up so negative? The authors provide us with an answer. R2 is the sum of two components, one representing the directions of the predictions, and the other representing the confidence in those predictions. Fully parameterized models actually tended to give directionally correct predictions, but were overconfident, and this overconfidence alone resulted in their hugely negative values. If we moderate their confidence, we see better R2 values represented by the other coloured lines on the graph.
The uniqueness of fully parameterized models also shows up in other analyses. To test the performance of models, the authors created backtests of trading strategies that relied on the models. These backtests showed that, in the case where the modeler knew of the existence of all true factors (i.e. models were of the first family of models), strategies backed by fully parameterized models achieved the highest expected return possible. Expected returns are driven by a model’s directional correctness, and as we discussed in the R2 section, fully parameterized models tend to be directionally correct. By contrast, the volatility of the strategies, which are linked to the model’s confidence, tended to be very high. Thus despite the high expected return, the fully parameterized trading strategy recorded low Sharpe ratios because of the high volatility. Under-parameterized models, by comparison, exhibited lower volatility while maintaining high expected returns, and thus higher Sharpe ratios.
But what of over-parameterized models? As the dark blue line in the graph above shows, such models performed better than fully parameterized models, though worse than under-parameterized models. Let’s ignore this latter fact for a minute, and focus on the former. It appears that when we give models enough parameters to memorize the training data many times over, we get improved R2 and Sharpe ratios compared to when we give models enough parameters to memorize the data just once. The authors showed that this phenomenon occurs because, as the models become ever more over-parameterized, they become more “humble”. Let me explain using an analogy.
Suppose we want to train a model that estimates the distance a golf ball will travel. The model notes that on one windy day, the golfer swung his 5 iron and sent the ball 150 yards, and on a calm day, the same golfer, with the same club, hit the ball 160 yards. A fully parameterized model would assign one parameter to memorize each datapoint I just described. The next time the golfer chooses to hit a ball on a windy day using the 5 iron, the model will estimate that the ball will travel 150 yards. The model will be perfectly certain about this outcome because it will only recall one similar situation, and remember that the ball had traveled 150 yards in that instance. Over-parameterized models, on the other hand, will devote multiple parameters to memorize each outcome, with each parameter memorizing different aspects of the data. Some parameters may, for example, tie the distance to weather alone, and expect the ball to travel 150 yards on a windy day, and 160 yards on a calm day. Other parameters may tie the distance to the club, noting that a 5 iron will send the ball traveling 150 yards sometimes, but 160 yards at other times. The next time the golfer picks up the 5 iron on a windy day, the over-parameterized model won’t be so sure that he’ll hit 150 yards because it will recall one instance where a 5 iron had sent the ball 160 yards.
This “humility” of over-parameterized strategies expresses itself through more conservative bets, resulting in lower volatilities during backtests. Over-parameterization, however, also comes with a cost. The authors show that, within the first family of models, strategies underpinned by over-parameterized models had lower expected returns than under- or fully parameterized models. Though lower volatilities were enough to yield higher Sharpe ratios than fully parameterized strategies, they nevertheless fared worse than under-parameterized models that had both low volatilities and high expected returns. These results appear to confirm conventional wisdom, which says that one should use as few parameters as possible. But this goes against the abstract of the paper, which makes the case for using over-parameterized models. Have the authors led us astray? Of course not, as we shall see.
All analyses we’ve seen so far were made under one big assumption, namely, that modelers utilized the complete set of true factors driving the markets. This assumption, as we discussed, is unrealistic. So what happens if we loosen this assumption? What if, instead of using true factors, the modelers used derived factors containing partial clues of the true factors? For example, let’s hypothetically say a company’s free cash flow growth is a true factor driving its stock price. If the modeler uses earnings growth instead, then although it’s not the true factor, using it may still prove helpful because earnings growth is correlated to the true factor. Additionally, we allow for the possibility that there are some true factors for which modelers don’t even have derived factors for; in other words, the modeler is oblivious to some true factors. For instance, insider transactions might influence security prices, but none of the factors that modelers use might be related to them. The authors reran all of their previous analysis under these new sets of assumptions, and here’s what they found.
First, they found that R2 values behaved similarly as before, declining as more parameters were added to under-parameterized models, hitting its trough with fully parameterized models, and ascending as models became more overparameterized. As before, overconfidence was to blame for fully parameterized models’ poor results, and humility the reason for overparameterized models’ better performance.
The expected returns of the strategies, however, took on a drastically different shape. The expected return was lowest for under-parameterized strategies, and peaked for fully parameterized and over-parameterized strategies. This is the opposite behaviour from before, when under-parameterized strategies had higher expected returns than over-parameterized strategies.
What accounts for this difference in behavior? Let me explain using an analogy. Suppose we have a sculpture building competition, where the aim is to build a sculpture that looks the most like Simba from the Lion King. If the contestants are given sets of pieces that look like different parts of Simba, the best strategy would be to simply glue the pieces together. Using a more complicated technique that reshapes the pieces would likely backfire. But if the contestants are given pieces that look only mildly like lion body parts, they’d need to put more effort into reshaping the pieces. The true market process is like Simba - it’s the ideal that models strive to become - and factors are the sculpture pieces. If modelers are given true factors, then as when having Simba’s body parts, it’s best to use a simple model that merely adds them together. But if the modelers are given derived factors, then as when having misshapen pieces, it’s better to employ a complicated model that reshapes their influences.
While expected returns thus showed a rather different relationship with the numbers of parameters, volatility’s relationship remained the same. In other words, volatility rose as models went from being under-parameterized to fully parameterized, and then fell as the models became increasingly over-parameterized.
Since expected returns are high and volatilities are low when models are over-parameterized, it comes as no surprise that Sharpe ratios are highest when the models are over-parameterized.
The conclusions are striking. The results suggest that it would be much more profitable to use extremely complex models with thousands of parameters, than to use simple models with few parameters. Most quantitative models today, in academic literature as well as in practical use, are of the simple kind. But Kelly et al.’s paper shows us that we’ve been eating canned tuna when we could have been eating caviar; we should be able to improve on most published models by simply fitting vastly more complex models to the same data. Skeptics, however, may raise one objection to this conclusion: all analyses so far were conducted in the hypothetical world of the authors’ own design. Is there any evidence that over-parameterized models would perform well in the real world? The authors address this point next.
To test their theory, the authors set out to create a model that predicts the direction of the stock market. At the end of each month, the models learned from 15 factors contained in the preceding 12 months to divine whether the market would go up or down the next month. The authors tried different models with numbers of parameters ranging from 2 to 12,000. Since the models trained on 12 months’ worth of data, they were under-parameterized when the model had fewer than 12 parameters, fully parameterized when it had exactly 12 parameters, and over-parameterized when it contained more than 12.
The authors found remarkable agreement between theoretical and empirical results. The Sharpe ratio of the directional market betting strategy increased as more parameters were added to the model, even after the model became severely over-parameterized. Other performance metrics, including alpha and information ratio, likewise showed consistent improvements as the number of parameters increased.
The highly over-parameterized models outperformed both buy-and-hold and linear regression models by a statistically significant gap. The superiority of over-parameterized models was especially noticeable in the backtests’ downside statistics, as over-parameterized models yielded much lower maximum drawdowns and exhibited positive skewness.
Many people have argued that machine learning will have difficulty catching on in finance. One of the principal reasons cited is the sparsity of financial data. Whereas image and speech recognition domains have trillions of data points to train on, there are only a few hundred thousand daily price data points for equities. The skeptics reason that we can’t raise the complexity of machine learning models very far before they collapse under their own weight. In fact, one of the authors of the paper, Bryan Kelly, had argued this exact point in the past. Since publishing ‘The Virtue of Complexity in Return Prediction’, however, he seems to have changed his mind.
On the contrary, as Kelly’s latter paper shows, data scarcity is not the fearsome foe that we thought it was. Whereas modelers had been afraid to construct complex models for fear of overfitting, we can now dare to venture into much more complex models, and that is the gift that this paper grants us.