The Bias/Variance Trade-off and the Limits of Machine Learning Models
A Machine Learning model is a bit like a marriage - no matter how hard you try, it's never going to be perfect. The same way every marriage has its issues, so does every machine learning model. Real world data is like real world people: unpredictable. Real world events don’t conform precisely to a perfectly predictable pattern, so no matter how clever your algorithm, there will always be some irreducible error. The trick is to focus on the things you can change - the reducible error. If you’re in a marriage where there are some communication issues, that’s a reducible error. You can work on that and improve. But if you’re in a marriage where one of you is deathly allergic to nuts, and the other person is Mr. Peanut, there isn’t much you can do. That’s an irreducible error.
The trick to creating an accurate machine learning model is to identify the error that you can reduce and concentrate on that. Luckily, there are only two main sources of reducible error that we need to investigate: bias and variance.
Bias is a measure of your model’s ability to precisely fit the training data. The closer you fit your model to your training set, the lower the bias for that particular data set, and the better the accuracy of your predictions for that training data. Reducing your bias is usually achieved by adding complexity to your model, allowing it to better represent the data on which it's being trained.
Variance refers to your model’s sensitivity to the specific data on which it was trained.The outputs of a model are strongly linked to the data on which it has been trained. Therefore, when the input data is different, a high variance model’s outputs will be quite different. Hence, the results will be unpredictable and unreliable. To reduce your variance, you often need to simplify your model. This frees the model somewhat from the specific data points, allowing more robustness in its signal generation.
From the outside looking in, some people might think you could just lower the bias and the variance in your machine learning model and then live out the rest of your days in wealth and happiness. If these people had ever been married, they’d know that nothing is that simple. The relationship between bias and variance is far more complicated. There is a real give and take between them that must be understood in order to achieve the best results possible. It takes some complex math to fully describe the interplay of bias and variance, but the concept is surprisingly intuitive. In fact, the tradeoff is analogous to something most of us are quite familiar with: throwing a ball.
Think of yourself as a baseball pitcher trying to throw a fastball past a hitter at the plate. You want your pitch to be as fast as possible: the faster the pitch, the harder it is to hit. But, you also want your pitch to be accurate. If you throw it outside the strike zone, it's a ball. And, if you throw a pitch down the middle of the plate, the batter is likely to hit a homerun - your entire middle school team will laugh at you, and you won't get to kiss anyone for quite some time (I know from experience). Clearly, the stakes are high, so you need your pitch to be as accurate as possible, as well as fast.
However, as most people are aware, the harder you try to throw something, the harder it is to be accurate with it. The inner workings of your body are such that in order to improve your accuracy you need to decrease the speed. The inverse also applies: to increase the speed, you need to accept lower accuracy.
This is similar to the bias-variance trade off. We want a model with low bias and low variance, but the reality is that the only way to lower one is to increase the other. Because of the logic inside machine learning algorithms, the two are intrinsically connected.
In general, the way to reduce bias is to increase the complexity of your model. This fits your model more precisely to your training data. But, the closer you fit your model to the data, the more your model works to find patterns in the data that may be caused by something other than the signal. It could be anything from random noise, to measurement error, to a physical anomaly. Anything that isn’t part of the signal will weaken your results. So, while your model may fit the training data perfectly, it will give you different results with different data. This is an increase in variance. So, if you want to throw the ball harder (reduce your bias error), you end up being less accurate (increasing the variance error).
But, if you go too far the other way and reduce the complexity to reduce the variance, you capture less of the signal and reduce accuracy in your predictions. This is in increase in bias error. You, as the pitcher/data scientist, need to find a compromise between speed and accuracy that minimizes the overall error and gives you the best result. The figure below shows this concept graphically.
Trying to model the stock market is an insanely difficult task. If it weren’t so difficult, someone would have perfected it by now, and the rest of us would have given up. In fact, controlling the bias-variance tradeoff is especially important in finance, where data is scarce. Combine that with the low signal-to-noise ratio that is present in stock prices and it’s easy to find yourself with an unacceptably high variance.
So, when creating your machine learning models, you need to be as vigilant as possible about reducing error. You want to find the point at which the bias and variance errors are at their combined minimum, thus creating the happiest marriage possible. Unfortunately, you can't just buy your model chocolates tell it you love it and make things better, you have to actually put in some work.
The first and most obvious step is to ensure that the algorithm you are using is right for your data. There's no paint-by-numbers guide for this, it takes time and experience to learn how to choose correctly. Then, once you have your algorithm, you want to make sure you have the cleanest, most complete data possible. As you can imagine, a model is only as good as the data it was trained on. Most algorithms, then, contain a set of parameters that control the level of fitting to the data. Through a series of calculations, as well as trial and error, you can tune these precisely to minimize the bias-variance combined error, and maximize your predictive power.
Creating a good model requires understanding of the bias and variance tradeoff, much as understanding the compromises of living together creates a good marriage. And while it might not be as snappy as “Happy spouse, happy house”, “Accurate model, Accurate predictions” is just as good a motto to live by.