# Training Neural Networks: Why, As With Humans, Teaching Methods Matter

from pxfuel

I achieved my life’s biggest accomplishment in 2004, when I defeated dozens of other contestants to clinch the Canadian Settlers of Catan championship. “Settlers”, as it’s called by its enthusiasts, is a strategy board game where players collect resources, build settlements, trade with and rob from each other to reach 10 points first. It remains a favourite among many board game aficionados to this day.

The final round of the championship was played on a special board. Each hexagonal tile measured a foot in diameter, making the complete board 5 feet wide. Sculpted artwork adorned the tiles, and the board more resembled a miniature village than a board game. The official “rule master” from Mayfair - the board game company that produced Settlers - not only adjudicated the game, but personally handed us our cards during every round.

The game was tense from the start. Every one of us was adept at reading the board, and each other. I needed luck, and God granted it to me. I had just enough resources on my final turn to build a settlement - worth 1 point - to reach the game-winning tally of 10 points. Had I not won that turn, I was almost certain the next player would have reached 10 points on her turn. If *she* hadn’t, I was sure the next player would have.

Upon winning, I was brought to the center of the convention centre, where other board game championships were progressing in earnest. An attractive lady came by my side to announce my victory, and I cowered at the attention I received. For the prize, I was handed a hat, an itchy t-shirt, and a plane ticket to Germany to participate in the world championship.

I’m a board game fanatic. I love dissecting the rules of a new game, forming strategies that optimize my path to victory, and adjusting those strategies based on how they play out on the board. My initial strategies are usually terrible. I often misjudge the importance of some rules, or miscalculate how other players behave. But eventually, I figure it out. I develop a feel for the game, and my strategy becomes increasingly effective.

The way I learn new board games shares a lot in common with how neural networks learn.

Neural networks contain a set of parameters which dictate their behaviour. A neural network that plays Settlers, for example, may contain parameters that dictate its eagerness to collect the ‘brick’ resource.

Neural networks start out with random parameter values which, like my initial board game strategies, usually lead to terrible scores. But they improve through training. They analyze each parameter, determining if changing their values would lead to better outcomes. Perhaps toning down their eagerness to collect ‘brick’ would allow it to reach 10 points earlier. They adopt a new set of parameter values and obtain a new score. They repeat this learning process until the scores no longer improve.

But how do they decide by what extent and in which direction to change the parameters? Methods that determine these are called ‘optimization algorithms’ (not to be confused with ‘portfolio optimization algorithms’, which are finance-specific algorithms that weight investments in portfolios). There are many types of optimization algorithms. The choice of algorithm not only impacts the length of time it takes to train neural networks, but their performance as well. The reason for performance improvements is due to the presence of ‘local minima’ during training. Let me explain this concept through an analogy.

Imagine a hiker whose goal is to climb to the tallest peak in a mountain range. The hiker doesn’t possess any maps, and can’t see very far in any direction. How does she accomplish her goal? One obvious method would be to keep climbing towards the highest point she sees, until she no longer sees a higher point. While this method will get the hiker to a reasonably high point, it doesn’t guarantee that she will reach the highest peak of the entire mountain range, which might be hundreds of kilometers away, far from the boundaries of her vision.

The coordinates of the hiker represent the values of a neural network’s parameters. The altitude of the hiker represents the neural network’s ‘loss’ value, which is a score that indicates the performance of a neural network. The analogy breaks in one respect: lower losses are better, so while hikers seek the highest altitude, optimization algorithms reach for the lowest loss. The algorithms’ job is to herd parameter values towards the lowest possible loss. Like the hiker, optimization algorithms are only aware of the loss value terrain adjacent to its current parameter values. They are thus susceptible to getting trapped in local minima, which are like the tallest peaks of small regions.

Different optimization algorithms are like different strategies that a hiker might use to find the highest peak; each algorithm has its own way of dealing with the trade off between exploration and exploitation. An algorithm that emphasizes exploration too heavily will wander the parameter space forever without homing in on a particular set of values. An algorithm that emphasizes exploitation too much, on the other hand, will be too willing to settle on any local minima, no matter how shallow.

Smart algorithms try to diminish the trade offs, delivering on both good exploration and exploitation. But the efficacy of those algorithms varies depending on the context. If the loss value terrain resembles that of a solitary mountain like Mount Fuji, the modeler may not need to use a sophisticated algorithm at all, and its use might even detract from reaching the optimal point sooner by having the algorithms think too much. But if we want to navigate the equivalent of the Rockies where local minima abound, we’d do well to carefully select one of the sophisticated algorithms.

The first optimization algorithm that came into existence is the ‘Gradient Descent’ (GD) algorithm. The mechanics of GD is simple - it finds the slope of the loss value terrain at the current location, and moves in the downward direction.

Optimization algorithms such as GD have ‘hyperparameters’, which are settings that govern the behaviour of the algorithm. These are not to be confused with neural network ‘parameters’ which are the objects of algorithms’ optimization. In the hiking analogy, parameters are the location of the hiker, while hyperparameters govern how the hiker moves.

GD has an important hyperparameter called the ‘learning rate’, which determines the magnitude of the parameter changes. It is akin to the number of steps that the hiker takes once she decides on the direction she wants to go. Virtually all optimization algorithms incorporate learning rate as one of their hyperparameters.

Although GD works reasonably well to minimize loss values, it has one practical difficulty - the training takes too much time. The major reason for this is GD’s need to analyze every data point before updating its parameters, like a coach who won’t re-examine his tactics until full games have been played. To get around this problem, scientists invented a successor algorithm called ‘Stochastic Gradient Descent’ (SGD), which works just like GD except it updates its parameters after ingesting subsets of data. It’s up to the modeler to choose the size of the subsets. Smaller subsets will result in more frequent parameter updates, which can help the algorithm reach lower loss values sooner. But it also risks having the parameters buffet from the winds of random idiosyncrasies of small samples.

Scientists have since invented numerous modifications to SGD that aim to expedite neural network training and avoid local minima. These modifications usually build on two major concepts - momentum and adaptive learning rates.

Momentum inclines algorithms to keep updating parameters in the same direction as they have recently. If a hiker found herself climbing higher by going west for the past hour, momentum would encourage her to continue west, even though the direction takes her downhill for the moment. Momentum thus helps prevent the neural network from getting stuck in local minima.

There is a drawback to using momentum, however. The hiker might find, after going west further, that the downhill slope was not an anomaly but the beginning of a new trend. The hiker would thus have to go back the way she came, and end up spending more time than she otherwise would have. But lost time is not the only risk with using momentum. The technique can mislead ‘online learning’ models, which are models that continuously update their parameters as they receive new information. An algorithm that uses momentum assumes that past trends will continue into the future, and may thus miss sudden regime changes where the future breaks with the past.

Whereas momentum influences the *direction* of the parameter changes, adaptive learning rates adjust their *magnitudes*. If the optimization algorithm thinks the neural network parameters are far from optimal, it will make large changes to the parameters to hurry them closer to the optimal point. But if the algorithm thinks the parameters are close to optimal, it will try to fine tune the parameters instead.

Finding the right adaptation mechanism is tricky. If the technique results in learning rates that are too large, parameters will criss-cross the space of possible values forever. If the learning rates are too small, the parameters will gravitate strongly towards local minima.

Newer optimization algorithms introduce sophisticated ways of calibrating momentum and adaptive learning rates. The Nesterov Accelerated Gradient method, for example, nudges the direction of momentum. Adadelta and RMSprop prevent adaptive learning rates from becoming too small. Adam, which is perhaps the most popular optimization algorithm today, incorporates both momentum and adaptive learning rates. Then there’s a plethora of newer optimization algorithms that further tweak Adam or any of the predecessor algorithms.

It’s tempting to always use the newest, most complex optimization algorithm, especially since most of them accompany academic papers that show compelling improvements over older algorithms. But using newer algorithms indiscriminately is a mistake because the improvements shown are context-specific - i.e. they work well on some neural network structures but not for others. It’s therefore worthwhile to try older, better known algorithms first before trying out shiny new optimization algorithms.

The benefits of selecting the right algorithm can be substantial. It’s hard to quantify the benefits because they’re different for each neural network, but I’ve found that well-suited algorithms produce 10 to 20% better performance than merely workable algorithms. Using ill-suited algorithms or hyperparameters, on the other hand, will yield impotent neural networks, even if the structure of the networks are sound.

One piece of good news regarding optimization algorithms is that we generally don’t need a deep understanding of how each one works. You can usually get away with throwing different algorithms at the wall until you find one that sticks, though having an understanding will help the modeler select the right algorithm sooner. Regardless of how you do it, exploring the use of different optimization algorithms is an important step that shouldn’t be skipped.

Financial machine learning papers often gloss over the choice of optimization algorithm. I understand why - the structure of the neural network is much more interesting than the methods used to train them. But this is a shame. The time it takes to train a model matters in the real world. It’s also helpful to know whether researchers have explored the full potential of their neural networks structure by trying different optimization algorithms.