Is Low Signal to Noise Ratio Really A Problem For Financial Machine Learning?


Jin Won Choi


Machine Learning


Aug. 10, 2020

AQR recently published a paper with the title ‘Can Machines 'Learn' Finance?’ In the paper, the authors talked about the feasibility of creating machine learning driven investment strategies, and voiced skepticism that it could be done due to several obstacles. You can read some of my thoughts on the paper in this Twitter thread.

One of the main obstacles that the authors mentioned concerned the low signal to noise ratio of financial data. Let’s talk about what this means.

Quants generally divide asset price movements into two components--signal and noise. Signal is the portion we can understand, model and predict. For example, companies that reported good earnings have historically seen their stock prices go up, and the average amount by which companies’ stock prices have risen is considered the signal. The noise component, on the other hand, consists of the unpredictable component of price movements.

Because the signal component is predictable, we can build investment strategies around them. For example, if we build a model that predicts companies’ earnings with a high degree of accuracy, we could build a strategy that invests in companies who are about to make positive earnings reports. However, the extent to which this strategy can work is limited by the strength of the signal relative to noise.

Let’s hypothetically say that signal accounts for 10% of a stock price’s variation, and that the stock’s price swings by roughly 40% per year. In this case, our investment strategy could earn up to 40% x 10%, that is 4% of alpha from this stock. However, if the signal actually only accounts for 5% of the stock’s price variation, then the potential alpha we can generate goes down to 40% x 5%, that is 2%. (Mathematicians may note that this is not how the relationship between R-squared and alpha actually works, but it’s directionally correct. I’ve simplified the math for illustrative purposes).

If we believe that the signal to noise ratio inherent in asset price data is very low, then we could come to the conclusion that traditional quant models already capture most of the signal already. In this case, there would be no point in pursuing machine learning models because there isn’t enough signal left to “mine” to make the effort worthwhile.

The low signal to noise ratio, however, causes another, perhaps thornier, problem. When the signal to noise ratio is low, it becomes particularly difficult to tease apart the signal from the noise. Let’s illustrate using an example.

Let’s say that an asset’s price moves up at the rate of 3% per year on average (the signal component), and that the asset randomly fluctuates around this mean (the noise component). Let’s say that the randomness is Normally distributed with a standard deviation of 2%. In this case, it would be rather easy to tell that the signal is there because the price movements would generally appear close to 3% per year every year.

But what if the asset’s price moves up at the rate of 1% per year, and the randomness is Normally distributed with a standard deviation of 50%? In this case, it would be difficult to even tell whether the signal is above 0. In order to reject the null hypothesis that the signal is 0 or lower, you’d need nearly 10,000 years’ worth of data. In other words, the lower the signal to noise ratio, the more data you need in order to validate a signal.

Unfortunately, finance is a relatively small data field. Most data providers only provide electronic data on US stock prices going back to the 1960s, which translates to roughly 100 million daily data points. While this may sound like a lot, it pales in comparison to the number of data used to train models in other fields, such as natural language processing. OpenAI’s GPT-3 model, for instance, trained on nearly 500 billion data points.

Given the relative scarcity of financial data and the apparent low signal to noise ratio of financial data, some quants, including the AQR authors, expressed skepticism that machine learning models can sufficiently distinguish between signal and noise. This sentiment is reinforced by real world examples. Many machine learning practitioners have built models that mistook noise for signal, and launched strategies that looked good on backtests but failed post-launch.

However, I believe the skeptics of machine learning are making a critical error with regards to how they think about the relationship between signal and noise. The skeptics implicitly appear to believe that the ratio of signal to noise is fixed, but that is not the case.

Let’s take weather, as an example. You would probably agree that weather is inherently unpredictable, but how unpredictable is it? That depends on the technologies that we have at our disposal.

Back a millenia ago, we didn’t have satellite images, weather balloons and other instruments that helped us better predict the weather. The best we could probably have done is to look at cloud patterns at distant horizons. The “signal” component of our weather model would have been very low compared to the “noise” component.

However, that equation has changed with our modern inventions. Satellite images and other helpful tools have helped us make much better predictions, in effect increasing the “signal” and decreasing the “noise”. But how is it possible for technology to turn noise into signal? It’s possible because the weather, as with so many processes that appear random, is not really random. The weather is the result of interplays of heat, moisture, and a few other factors, which if precisely known and understood, allow us to predict it with perfect accuracy.

In most situations, we lack this perfect knowledge and have to make do with partial knowledge. Statistics, as a field, helps us accept the fact that we can’t practically achieve perfect knowledge, and enables us to operate with partial knowledge by treating the unknowable processes as being random. But as our partial knowledge grows, the randomness that represents unknowableness shrinks.

This situation applies to finance. An asset price’s movement may appear random, but it’s actually the result of many investors buying and selling stocks. If we knew exactly how each investor will trade before they trade, then we would have a model that can predict asset prices with 100% accuracy.

Alternative data contains better clues for how investors will trade stocks in the future, while machine learning teases out these clues to predict investor behaviour. In this way, the signal to noise ratio of financial data can be increased.

As more alternative data becomes available and machine learning techniques advance, the signal to noise ratio that’s theoretically achievable will continue to advance. As it does, the amount of data one needs to create good machine learning investment models will decrease, and the potential gain from using those models will increase as well.

In summary, the low signal to noise ratio that’s apparent in financial data sets is not an insurmountable challenge for financial machine learning because the ratio can be increased with more data and better techniques. If you’d like to chat with us about building a machine learning model, please drop us a line at

Quantocracy Badge
Latest posts