Look-Ahead Bias, and Why Backtests Overpromise


Jin Won Choi




Sept. 20, 2021

Building Blocks

from Wikimedia Commons

The Korean drama ‘Sisyphus’ is a story about a couple of heroes who struggle against a villain from the future. Villains need deep pockets to pull off large schemes, and in Sisyphus’ case, the villain amasses his wealth by using his knowledge of the future to make money on the stock market. In one scene, he is seen taking a massive short position on the stock market on the eve of September 11, 2001, to the horror of his brokers.

The villain doesn’t have real stock picking skills, of course. If you take him back to the timeline he came from and tell him to trade, he’d probably generate pedestrian results. His incredible track record is the result of cheating.

Unfortunately, many real trading strategy backtests were likewise created as if time travellers dictated the trades.

Financial professionals get a nagging feeling when they look at backtests. They’re aware that a trading strategy’s live performance often doesn't live up to expectations set by them. Sometimes, the discrepancies between backtests and live performances are easy to explain and guard against. The backtest may not have accounted for trading commissions, for instance, or the strategy may favour illiquid securities. But such faults are generally spotted early and stamped out. What puzzles even professionals is that backtests that pass those inspections still offer poor guidance. A more insidious reason for this is due to ‘look-ahead bias’, which refers to sins where, like the villain in Sisyphus, the strategy utilizes knowledge of the future.

One way that modelers become guilty of this sin is by failing to separate their dataset into ‘in-sample’ and ‘out-of-sample’ datasets. In-sample refers to the dataset that statistical models “train” on - that is, it’s the dataset that models analyze to find patterns useful for trading. Out-of-sample, on the other hand, is the dataset used to evaluate the performance of the models.

Training a model is like letting it live through an experience. It transports the model to the past, where it studies the characteristics of each stock and observes which stocks ascend and which decline. The model repeatedly rewinds to the past, trialing different strategies each time and keeping the one that works best. Modelers then use backtesting to check whether the model learned general principles that will work for years to come. It’s very possible for the model to have instead learned coincidental rules that only work for the in-sample dataset it lived through.

To illustrate this latter scenario, let’s suppose we create a model that predicts each stock's performance based purely on its tickers. We’re reasonably sure that such a model wouldn’t learn general principles. Let’s say we use the data on all US stocks from 2000 to 2020 to train our model. During training, the model would note that the ticker ‘ENRN’ (Enron) does terribly, while the ticker ‘AAPL’ (Apple) performs very well.

If we backtest this model using the same dataset that we used to train the model, we wouldn’t know that this model is bad. At the start of the backtest in 2000, the model would bet on AAPL because it had good memories of this ticker, and it would bet against ENRN because it remembers its fate. The model would, in other words, live through the same timeline during the backtest that it already lived through during training, and post good results just like the villain of Sisyphus.

Models with more reasonable theoretical foundations fall victim to this error just as easily. Take a strategy that depends on dividend yields, as an example. Suppose we trained a model on each company’s dividend yield from 2000 to 2020, and it found that companies with high dividend yields outperformed. If we backtested this model using the same dataset, it would invest in high yielding stocks during the years between 2000 and 2020, and thus produce a good track record. But this would be cheating, since the model already knew how high yielding stocks would perform since the beginning of the backtest. Investing in high yielding stocks may or may not be a winning strategy post-2020, but we wouldn’t hazard a guess based on this stained backtest.

We can create clean backtests by keeping the in-sample and out-of-sample datasets separate, but we must be careful how we separate them or the look-ahead bias will remain. As an example of what not to do, let’s designate the US stock market data from 2000 to 2020 as the in-sample, and the Japanese stock market data during the same time period as the out-of-sample. During training, our model would learn that US stocks performed horribly in 2008, so it decides to take a conservative stance with Japanese stocks in 2008 during the backtest. This model also benefits from foreknowledge, just indirectly this time; the financial crisis that crippled the US economy afflicted the Japanese economy too.

Academics, who should be aware of these dangers, trip on them all too often. The authors of the paper “Predicting the direction of stock market prices using random forest”, for example, trained their model on Apple’s price history and then predicted Samsung, GE and Apple’s own stock prices during the same time period. With the advantage of lookahead-bias, this model correctly predicted the directions of each stock over 90% of the time. To put this number into context, Renaissance Technologies, which is arguably the best quant hedge fund in the world, generally achieves an accuracy of 50.75%.

Though not all data from the future confers clairvoyance to models, it can be tricky to discern which data does and which doesn’t. Knowing who wins Wimbledon in 2030 probably wouldn’t help a model predict stock prices in 2030, but we can’t rule it out completely. The only way to avoid the traps of look-ahead bias is therefore to split the data by a date threshold, training solely on the data before the threshold and backtesting on the data after it. If we want an accurate picture of how a model would have performed in 2011, the model should only be allowed to see data up until 2010 during training.

Though conceptually easy to understand, blindfolding a model to the future can be tricky in practice. Should we, for example, make use of academic papers published in 2016 to backtest a model in 2011? There are no easy answers. If the authors of a paper took care to avoid look-ahead bias, it’s probably okay to incorporate their findings. But if they didn’t, relying on the paper would be like asking a time traveler from the future for stock tips. Unfortunately, we often can’t tell how careful authors were to avoid look-ahead bias.

Splitting datasets by date is also not the perfect guard rail that keeps us from falling into look-ahead bias. Let me highlight this imperfection using an example.

Suppose we took care to only use data until 2015 during a model’s training, and backtested it using data from 2016 onwards. We didn’t like how the model performed, so we tweaked the model configuration, retrained it and reran the backtest. We still weren’t satisfied, so we repeated these steps a few times until we produced a backtest we liked. Can we trust this last backtest to be an accurate guide to the future? Probably not. In choosing the model configuration to use in 2016, we relied on our knowledge of how each configuration performed after 2016. We had merely shifted the scene of our look-ahead crime from the model training stage to the backtesting stage.

Eliminating this type of look-ahead can be very difficult, if not impossible. When we have new ideas, we naturally want to produce new backtests to evaluate their merit. How else would we test new ideas?

One way to mitigate this type of look-ahead bias is to compute each strategy’s “Deflated Sharpe Ratio”. Acknowledging that the best backtests are the children of both skill and luck, the latter of which was introduced through look-ahead bias, the method handicaps the odds of whether the backtest was more the result of luck rather than skill. This method, though useful, is not the perfect remedy; it doesn’t tell us how a model drained of luck is expected to perform.

Look-ahead bias is not the only force that inflates backtests. Using technologies that didn’t exist in the past is one such force, because it doesn’t account for competitors utilizing the same technology in the future. Data availability is yet another such force. It’s not possible to compensate for each and every reason that inflates backtests. But most reasons thankfully hit like minor storms compared to the hurricane force of unchecked lookahead-bias.

The good news regarding look-ahead bias is that we can take some proper precautions against it, reducing its debilitating effects to something more manageable. Policing for look-ahead is tricky, often tedious work. But it’s necessary, or you risk having your backtest lead you astray with overoptimistic expectations.

Quantocracy Badge
Latest posts