Logistic Regression: Repurposing Linear Regression For Probabilistic Outcomes
Agência Brasil, CC BY 3.0 BR
In the summer of 2014, Brazil and Germany - the two most successful national soccer teams in history - squared off against each other in the world cup semi-finals. Sports betters anticipated a tight game, giving each side an even chance of winning before kickoff. Germany had a well-rounded team consisting of disciplined defenders such as Philip Lahm, complemented by creative artists such as Mesut Ozil. Brazil, though without their star player Neymar, was still able to field regular starters for some of the best clubs in the world. Brazil, as the host of the world cup, also held the home field advantage.
It didn’t take long for the betting odds to shift. Germany’s Thomas Muller netted the team’s first goal in the 11th minute, and Germany’s odds of winning jumped to about 70% in the betting markets. Germany’s odds stepped up further to 90%, when Miroslav Klose scored Germany’s second goal in the 23rd minute. When Toni Kroos bagged two more goals in rapid succession by the 26th minute, betters essentially called the game, as Germany’s odds soared past 99%.
Though Germany scored three more goals thereafter, the betting odds barely budged. There wasn’t any more room for the odds to creep higher. Nor did Brazil’s goal towards the end of the match impact the odds. There was no hope of a comeback for Brazil, and even the goalscorer's celebration displayed more feelings of resignation than joy. That match garnered such infamy among Brazilians that it earned a name - Mineirazo - which means “Agony of Mineirão”, the venue where the match took place.
Let’s rewind to the beginning of the match, and suppose we want to create a statistical model that predicts Germany’s chances of winning. The two factors of the model are the goals scored by Germany, and the goals scored by Brazil. We wish to update the model’s predicted probabilities each time a team scores. How should we go about structuring such a model?
Linear regression would be the wrong tool for this task. As I explained in a previous article, linear regression doesn’t work well if there are special numbers around which the prediction target behaves differently. We have to contend with two such special numbers whenever we predict probabilities; that is, probabilities can’t fall below 0% nor rise above 100%. Linear regression can’t be forced to respect such bounds. ‘Less than no chance’, or giving ‘110% effort’ might work well as figures of speech, but they don’t work for scientific models whose results are always interpreted literally.
We could cling to linear regression by making crude modifications to its outputs. Does the model predict a 120% chance of victory? Let’s pretend it’s actually predicting 100% instead. We can similarly upgrade any negative probability predictions to 0%. But this solution is imperfect. If Germany’s chance of victory maxes out at 100% after the team scores three goals, what happens if the team scores a fourth? Linear regression might increase the team’s chances to 120%, which gets clipped back down to 100%, effectively keeping Germany’s chances at a standstill. Yet we know that Germany’s chances should increase with the fourth goal, however slight the increase may be. Teams have come back from a three goal deficit before. Though such events are rare, they’re not unheard of. A fourth goal would make Brazil’s comeback even more unlikely, but our current solution wouldn’t capture this small increase in Germany’s chances.
But what if, instead of trying to predict the winning probability directly, we used linear regression to predict a proxy to the probability instead? If the proxy quantity is more suited for linear regression, we could predict that proxy first and then translate it to a set of probabilities.
The obvious proxy for the Germany / Brazil game is goal difference. Unlike the probability of winning, there are no upper or lower bounds that goal difference can bump up against, making it a suitable target for linear regression. We can thus use linear regression to predict the goal difference at the end of the match, given the goals scored by Germany and Brazil up until the time of our analysis, and then translate that prediction into Germany’s chance of winning. We now just need to figure out how to translate expected goal differences into chances of winning.
Functions that translate proxies into prediction targets are called ‘activation functions’. We can theoretically choose any function to act as the activation function, as long as it satisfies a couple of requirements - the transformed outputs must range between 0 and 1, and their rate of change must slow as proxy numbers get very high or low. Statistical software packages will try their best to find a well fitting model regardless of the activation function chosen, but some activation functions will yield better models than others. Statisticians therefore generally choose from among several popular activation functions, and the most popular of these is called the sigmoid function.
There is mathematical justification for sigmoid’s popularity. Sigmoid is, simply put, the most general-purpose function making the least number of assumptions about the proxy. If a grandfather who knows nothing about computers walks into Best Buy, the salesperson may recommend a MacBook Air because it covers most use cases. Sigmoid is the MacBook Air of activation functions.
Sigmoid does a good job of converting goal differences into chances of winning. Applying sigmoid to a goal difference of 0 (i.e. tied game) gets us a 50% chance of victory for either team. If Germany is winning by one goal, its sigmoid-transformed chance of winning goes up to 73%. Extending that lead to 2 and 3 goals lead to 88% and 95% chances of winning, respectively. These odds align closely with empirical data.
Logistic regression is the combination of linear regression and sigmoid activation function - i.e. the concatenation of the most popular regression model and the most popular activation function. Just as adding syrup transforms water into a completely different drink that we call ‘pop’, adding the sigmoid function transforms linear regression to such an extent that we call the combination a different name.
As with all statistical models, logistic regression works best when its mechanics closely mirror the real world phenomenon. Logistic regression is therefore best for modeling probability outcomes for which proxies can be modeled using linear regression, such as whether or not a student will pass an exam, with the number of correct answers as the proxy, or whether a company will beat analyst consensus, with raw earnings as the proxy.
But what if we can’t identify a proxy by name? Can we still use logistic regression? The answer, thankfully, is yes. The proxy is merely the sum of the influences from factors specified in the linear regression portion of the logistic regression. Having a name attached to this sum is helpful for our intuition, but it’s not strictly necessary. Take, for instance, a model that predicts whether a baby will sleep through the night by the time they’re one year old. Suppose the factors to this model are babies’ eating habits and their propensity to take naps. The sum of these factor influences does not have a name that I’m aware of, but a baby with a high score - possessing good eating habits and a high propensity to take naps - should have a higher chance of sleeping through nights at an earlier age, and thus can be modeled using logistic regression.
Because the proxy is modeled using linear regression, many of linear regression’s quirks and mannerisms carry over to logistic regression. For instance, logistic regression assumes that its factors are at least somewhat independent of each other, and will throw a tantrum when they’re not. The directional relationship between factors and target probabilities must also stay consistent. Germany’s chances of winning can’t decrease after they score, and a baby’s chance of sleeping through the night can’t improve with worsened eating habits. Such developments would go against the grain of normal behaviour, and logistic regression models will discard such oddities even if the data points to their existence. If these phenomena are real and we wish to capture them, we’ll have to choose a more complex model structure.
One of the chief advantages of linear regression is its interpretability, and logistic regression retains some of that advantage, though not as strongly. Interpreting logistic regression involves making sense of its factor sensitivities, because those are the values that are set during model training. But to make sense of them, we must first become familiar with a concept called odds ratio.
Odds ratio is the probability of an event occurring, divided by the probability of the event not occurring. If Germany has a 75% chance of winning a game, it has a 25% chance of losing as well, and the odds ratio is 0.75 / 0.25 = 3.
In logistic regression, factor sensitivities act as volume knobs that amplify changes in factor values into changes in odds ratios. In our soccer game example, factor sensitivities imply that each goal moves the odds ratio by about 2. So if the odds ratio stands at 3 and Germany scores a goal, the odds ratio increases to 3 + 2 = 5, equivalent to an 5 / (5 +1) = 83% chance of victory. Another goal moves Germany’s odds ratio further to 5 + 2 = 7, which equates to an 7 / (7 + 1) = 87.5% chance of victory.
Logistic regression is one of the simplest models available for predicting probabilities, and it therefore benefits from Occam’s Razor; that is, when faced with a choice between logistic regression and a more complex model that appears to have similar efficacy, it’s better to choose logistic regression.
Occam’s Razor, interpretability, and modelers’ general familiarity with logistic regression makes it a very popular model structure among financial professionals for predicting anything from a borrower’s chance of defaulting on a mortgage, to determining the chance that a stock will lose money. But as with linear regression, logistic regression is rather limited in its ability to delve deep into the data to find intricate patterns. Like a cell phone camera, logistic regression will get the job done most of the time, but professionals will often yearn for more sophisticated models.