Objective Functions: Carrots That Lead The Machine Learning Horse
Method acting is the practice of staying in character even off stage. Robert de Niro drove cabs in preparation for the movie ‘Taxi Driver’. Leonardo di Caprio slept in an animal carcass to remain in character for ‘The Revenant’. This practice is not strictly necessary - plenty of actors leave their characters behind when they step off stage. But by planting themselves more firmly in their characters’ shoes, actors discover habits and emotions they wouldn’t have known otherwise, helping them elevate a merely ‘good’ performance to a ‘great’ one.
I’m not an actor. I’m a machine learning researcher whose face has the expressive versatility of stone. I have, however, found the principles of method acting useful when I construct new algorithms. Now, I obviously can’t dress and talk like a number. But I can imagine myself walking beside each number as they transport and transform through the algorithms’ mechanics. Tracing the variables’ steps carefully has yielded insights that I would have missed otherwise.
One detail that I’ve come to obsess over, as a result of this process, is the objective function - a mathematical formula that acts as a yardstick of how ‘good’ an algorithm is. The concept of objective functions is not unique to machine learning. Rhythmic gymnastics’ objective function consists of difficulty and execution scores of gymnastic moves. The objective function for the US presidency is the electoral college votes.
Objective functions’ role in determining outcomes grants them outsized influence on participants' behaviours. Put more weight on difficulty scores, and you’ll see gymnasts attempt trickier moves with less success. Abolish the electoral college in favour of popular vote, and you’ll see presidential hopefuls campaigning more aggressively in population centres like California. Machine learning models, likewise, will tailor their behaviour to suit the objective function assigned.
To see how objective functions influence machine learning models, it helps to understand how they learn. Machine learning models learn much like human students. Imagine a class where the teacher hands each student a set of practice questions. The students scribble some answers, the teacher marks them, and the students try to learn from their mistakes. The instructor hands each student another set of questions, and the learning cycle repeats itself until the students stop improving. Machine learning models’ “answers” are their predictions, such as the prediction that Apple’s stock will rise by 10% next year. The teacher’s answer is the actual outcome - e.g. that Apple’s stock ended up rising by 15%. Machine learning models would make note of the 5% miss, and try to do better next time.
Objective functions define how models react to their prediction misses, by assigning numbers that dictate the amount of attention a model must direct towards correcting each prediction. Some objective functions, by assigning especially large numbers to larger prediction misses, encourage models to focus all their attention on those larger misses to the neglect of smaller misses. Other objective functions force the models to distribute their attention more evenly.
The most popular objective function for regression models is the mean squared error (MSE). This function assigns the squared of the magnitude of misses for each prediction. If a model misses the right answer by two units, then it would assign four (22) units to that prediction. If the model misses by three units, then the function would assign nine (32) units.
MSE is a good choice when you want the model to fit reasonably well to all data points, much like a student determined not to embarrass himself on any of the exam questions. Such students, however, would find themselves focusing most of their efforts on a few difficult questions. Likewise, mean squared error often fixates machine learning models on relatively few outlier data points.
There are situations where this focus on outliers is appropriate. If a covid epidemiological model has trouble predicting infection spikes, you’d want it to spend extra effort understanding those spikes. But the extra attention on outliers is not always warranted. Take, for example, a model that predicts that a soccer team will score one goal for each of the two upcoming games, but the team ends up scoring none in the first game and four in the second. Under MSE, the model would pay nine times as much attention towards the second game as it does the first. Is the disproportionate attention on the second game warranted? Maybe not.
Inappropriate use of MSE is sometimes to blame for the subpar performance of financial machine learning models. For example, many people create models that predict the raw percentage changes of stocks. One tricky aspect of stock prices is that they tend to behave like the waves of the ocean, with long periods of gentle movements punctuated by short bursts of violent storms. Most models have trouble predicting the “storms” given their infrequent nature, so predictions tend to miss widely during these periods. Under MSE, the models become obsessed with improving their predictions during these storms, while neglecting predictions during calmer periods. So when the model is set loose to operate during a calmer period, it fails to perform. This isn’t the only reason why such models fail in practice, but it’s one reason.
One way to quell models’ obsession with outliers is to preprocess the data. The use of dollar bars is one example. Instead of measuring stock price movements on daily intervals, this method breaks price histories into blocks containing equal amounts of traded dollar volumes. If $1 billion worth of Apple stock traded between 9:30am and 11am, and another $1 billion between 11am and 4pm, the modeler would measure price changes from 9:30am to 11am, and then from 11am to 4pm. Techniques such as dollar bars reduce the incidence of outlier data, allowing modelers to continue using MSE.
Preprocessing is not a silver bullet. Using dollar bars, for instance, doesn’t remove outlier incidents altogether; it merely reduces them. Preprocessing can also be painful to implement. An alternative option is to abandon MSE altogether, in favour of an objective function that isn’t so sensitive to outliers.
Mean Absolute Error (MAE) is one popular alternative to MSE. This objective function assigns “attention” numbers equal to the extent of their prediction misses. If a model misses the first prediction by one unit and the second prediction by two units, MAE makes the model pay twice as much attention to the second prediction as the first, whereas MSE would have forced quadruple attention on the second.
Though MAE solves the problem of giving too much focus to outliers, it can incur the opposite problem of giving too much attention to slight prediction misses. Suppose a model predicts that Facebook’s stock will rise by 5%, and it ends up rising by 4%. Finance professionals would be impressed. Perfect predictions are impossible, so they would accept that a 1% miss is as close as it can get. But if the model predicts that Tesla’s stock will rise by 5%, and it ends up declining by -5%? There would be murmurs of skepticism about the model. The modeler would therefore want its creation to spend all of its effort correcting the Tesla prediction. But under MAE, the model would still focus 10% as much effort on the Facebook prediction as it would on the Tesla prediction.
There is, luckily, an objective function that comes between MSE and MAE, and it’s called the Huber loss function. Under Huber, a model will rapidly increase its attention on a prediction miss as the miss gets larger, just like the MSE, but only up to a point. Beyond that point, the amount of attention increases more gradually, much like the MAE. The modeler sets the point, the choice of which requires some discretion.
We’ve limited our discussion to regression models thus far, but the choice of objective functions matters equally for classification models. Cross entropy is the most popular objective function for classification models, but a function called ‘Hinge loss’ is sometimes used as well.
To see the difference between these two functions, let’s take a hypothetical model that predicts whether LA will experience an earthquake in a given year. LA most recently experienced earthquakes in 2014 and 2019. Let’s suppose our model predicted a 3% chance of an earthquake in 2014, and 7% in 2019. If one uses cross entropy, the model would concentrate 32% more attention on the 2014 miss than it would for the 2019 miss. But if the model were to use hinge loss, the model would focus just 4% more attention on 2014 vs 2019. Hinge loss looks at the raw differences between the predicted probabilities, in this case 7% - 3% = 4%. Cross entropy, on the other hand, makes note of the relative scale between the predicted probabilities - i.e. that 7% is more than twice as large as 3% - and punishes the 2014 miss more heavily for having the temerity to predict such a low probability. Relative to Hinge loss, cross entropy thus discourages extreme probability predictions.
In MSE, MAE, Huber, cross entropy and hinge, we’ve looked at some of the most popular objective functions on the menu. But if none of them suits your appetite, you can always cook up your own. Do you want your model to focus solely on correcting overpredictions? You can invent a function that does that. Do you want a loss function that puts more focus on false positives? You can invent that too. But a word of warning to those who would play mad scientist with objective functions - the functions must possess some mathematical properties, such as “smoothness”, for them to work properly, else the models will behave badly when they’re set loose in the real world. Creating new objective functions are therefore best left to mathematicians who understand the theory.
Regardless of whether the objective function is common or custom, the “correct” objective function is always that which aligns with your use case. Do you want a model that’s never off by too much? Use MSE. Do you want a model that works well most of the time, even if its predictions occasionally miss by a lot? Use MAE. Do you need a model that works well during full moons? Write a custom objective function.
Just how important is it to get the objective function right? Let’s go back to the teacher analogy for a minute, and imagine that your goal is to get as many students as you can into Stanford. Your students are very bright, and eager to get into the prestigious university. Even though you’re aware that Stanford looks at extracurriculars in addition to SAT scores, you focus all of your time on improving your students’ SAT scores. Some students end up getting accepted into Stanford, but some of your brightest miss out, due to the lack of extracurriculars. We are like that teacher if we neglect to pay attention to objective functions.