Statistical Distributions and the Costliness of Hidden Assumptions
By the third year of my PhD program, I was impatient. I had endured 8 years of lectures, exams, and keeping close watch over my bank account’s balance. Meanwhile, my colleagues from undergrad had embarked on interesting projects with big potential, and were getting paid well to do so. Their lives were going somewhere. Mine felt at a standstill. I wanted a taste of what they had, so I took a few months off to work at a bank.
My manager asked me to create a statistical model that estimated the bank’s exposure to ‘operational’ losses, which didn’t categorize under either market risk (arising from changes in prices of stocks, bonds, and other securities) or credit risk (arising from defaults on loans). Examples of such risks included bank robberies, employee litigation, and frauds.
The impetus to create such a model was new, having been spurred on by new regulations, so no work had been done on it previously. Although some literature discussed how other banks were approaching the same problem, the ideas they contained were embryonic. I had to grow my own ideas.
It took me two months to develop the first prototype. I presented the preliminary results in a meeting, which showed worst case scenario losses of $1 trillion dollars. I snorted a laugh as I read the numbers out loud. I knew my prototype was deeply flawed, and that my final model would pull the numbers closer to earth. My superior, however, didn’t find any humor in my results. He probably worried that I was going to dent his career.
My challenge boiled down to finding the appropriate statistical distribution to fit my data. Statistical distributions are formulas that describe the ranges of numbers we expect data to take on. We ask ourselves - if there were a loss today, how much would we expect it to be? If we expect all losses to range from $1,000 to $10,000, then the appropriate statistical distribution would express this belief mathematically by assigning negligible probabilities to losses outside of this range. Conversely, if we expect $1 billion losses to occur on a semi-regular basis, the appropriate statistical distribution would assign meaningful probabilities towards $1 billion losses.
Statisticians often visualize statistical distributions as histograms to aid their intuition. The x-axis of the graph shows the ranges of numbers, while the y-axis shows the probability of the data coming in that range. The regions with extreme x-values far from the center where most data appear, are called the “tails” of a distribution.
Thin-tailed distributions are ones where the histograms tug the floor (i.e. the y=0 line) in tail regions. They describe situations where extreme values almost never occur. Fat-tailed distributions are the opposite, where extreme values, though still not common, do pop up with more regularity. The chart belows shows a thin tailed distribution in a black line, and a fat tailed one in a red line.
Eitanlees, CC BY-SA 4.0, via Wikimedia Commons
Historical operational loss data possess very fat tails. Bank theft, for instance, rarely costs more than a few thousand dollars. But there have also been a few storied events where rogue traders bypassed internal security to place gigantic bets on the markets, only for those bets to go south and leave the banks on the hook for billions. One such loss was big enough to sink Barings bank, the second oldest merchant bank in the world.
I leafed through many fat tailed statistical distributions and considered their candidacy, but had difficulty selecting because of the paucity of data. The best distribution is generally that which fits the historical data best. But different statistical distributions have varying degrees of flexibility, and several distributions were so flexible that they were all equally able to curl themselves to fit the data well. I had to base my decision not solely on hard data, but also on each distribution’s expectations of never-before-occurred extreme events. Some distributions, upon learning that there were two $10 million losses last year, will infer a 50% chance of a $1 billion loss this year. A different distribution, given the same data, will presume a 10% chance for the same event. Statistical distributions, in other words, have different personalities: some will panic and expect the worst to occur given a modicum of bad news, while others will remain more sanguine.
My prototype model warned that a $1 trillion dollar catastrophe was possible because it used a rather panicky distribution. To solve this problem, I invented a new distribution that dialed its neuroticism down from a ‘10’ to a ‘9’. My higher-up looked a lot happier during the next meeting, when I presented results that showed maximum losses of around $3 billion dollars.
Building a machine learning model, as with building a bridge, requires us to check our assumptions lest our model (or bridge) falls apart. The choice of statistical distributions is one such assumption we must check. In theory, we should use distributions that mirror empirical data faithfully. But in reality, simpler distributions are often chosen over more accurate distributions for mathematical convenience. Simple distributions yield simple formulas that can be penned in a line or two on an academic paper, whereas similar formulas using more complex distributions may have trouble fitting into one page. Complex models are also generally more computationally expensive to calculate.
Financial mathematicians are especially fond of the Gaussian distribution (also called the ‘Normal distribution’), which has tails that are neither thick nor thin. The Gaussian is routinely used to model stock prices and interest rates, even though empirical data suggests fatter tailed distributions are more representative. Investors, however, are often unaware of the Gaussian assumptions embedded in their models. The consequences have sometimes been costly. The 2008 financial crisis, for example, is partly blamed on the uncritical use of the Gaussian copula function. By assuming that risks took a Gaussian shape, this model lulled investors into underestimating the risk that many mortgages could default at the same time. When investors were shaken out of their illusion, the financial system nearly collapsed.
Thanks to the financial crisis, investors today view the Gaussian copula as a burn victim eyes a hot stove. But the Gaussian distribution continues to proliferate in many other financial models. The lure of Gaussian mathematical elegance is often too hard to resist. Those who succumb, wittingly or not, often pay for their mistake with subpar performance.
Machine learning models that predict stock prices are one example. Many models I’ve seen train on past price changes using the mean squared error (MSE) objective function. But as I’ve explained in my article on objective functions, MSE is very sensitive to outliers, and this sensitivity becomes an even bigger issue for fat tailed distributions. Let me use an example to illustrate.
Suppose there’s a stock that moves between -1% to 1% most days. Since machine learning models learn from past data, they’ll make predictions within the range as well. One day, the stock crashes by 10% because of missed earnings, and the event causes models to reexamine their prediction generating processes. Under MSE, the emphasis they would give to a missed prediction is the squared of the magnitude of the error, so if a model predicted the stock to sit still (i.e. change by 0%) that day, the attention given to the event would be (10 - 0)^2 = 100 units. But if the stock had crashed by 20% instead, the attention given to the event would increase to (20 - 10)^2 = 400 units; in other words, the amount of attention given to events rises exponentially as the events become more extreme. Fat tailed statistical distributions indicate the presence of more extreme events, and machine learning models would come to obsess over those events to the exclusion of data from more typical days.
There are several cures to deal with the fat tailed nature of stock price changes. One is to use an objective function that’s less sensitive to outliers. I had mentioned this possibility in my article on objective functions. This solution, however, is no panacea. Extreme events, though thorns in our side during training, would nonetheless prove extremely valuable if we could predict their occurrence. The possibility that a stock might pop by 30% is much more interesting than the possibility of its rising by 3%. The choice of objective functions, unfortunately, would not induce a model to predict such extreme events.
A model’s reticence to predict extreme events doesn’t stem from its learning habits, but on how it forms predictions. Many machine learning models implicitly or explicitly utilize ensembling, which involves training many smaller models and averaging their predictions. Random forests and multilayer perceptrons are examples that belong to this category. Averaging predictions triggers a phenomenon highlighted by the ‘central limit theorem’, which states that averages of predictions will form a Gaussian distribution (one more reason why mathematicians love the Gaussian). Modelers who wish to generate fat tailed predictions will need to modify the original ensemble models.
The insidious influence of fat tails extends to portfolio optimization algorithms as well. Let’s say, as an example, that we want to allocate between two securities, A and B. A’s price has historically exhibited lower volatility than B’s. Both have similarly small but realistic chances of crashing down to 0, suggesting that price changes have fat tailed distributions.
Most popular portfolio optimization algorithms implicitly assume that price changes take on a Gaussian distribution (there it is again), and assign weights that minimize the portfolio’s volatility on typical days. Since A is less volatile, such an algorithm may assign ⅔ of its weight to A, and ⅓ to B. But this weighting scheme leaves the portfolio vulnerable to the event that A crashes to 0, as the portfolio would lose ⅔ of its value. The best way to insure against such crashes is to equal weight the securities, limiting the loss from either stock crashing to 50%. This example is exaggerated - large companies almost never lose their entire value overnight - but it does show how equal weighted allocations provide better protection against extreme events. This is one reason why equal weighted portfolios often outperform more sophisticated optimization algorithms in practice.
I’ve thus far harped on the pitfalls of ignoring fat tails, a prevalent problem across many domains and not just in finance, but they’re by no means the only holes to steer clear of. Some data distributions, for example, have multiple ‘humps’ and need special handling. It’s impossible, in a short article, to detail every headache caused by the wide ranges of distributional shapes and prescribe remedies for each. My advice must therefore remain general: whenever modelers create machine learning models, they must keep the distributional shapes of both inputs and outputs in mind, and be aware how distributional shapes change as data journeys through the model.
It’s unfair to place 100% of the blame of the financial crisis on the Gaussian copula. But if investors had used more realistic models, they’d probably have become wise to the dangers of the financial instruments they’d purchased, and might have mitigated some of the losses. Machine learning is increasingly being adopted in areas where their misuse can have dire consequences, such as in self driving cars and in medicine. We modelers must guard against all potential sources of problems, and that includes keeping an eye on the distributional shapes of our data.