Choi KPCA: A Tool To Help You Create Many Factor Quant Models


Jin Won Choi


Machine Learning


April 26, 2021

Tools"Tools" by Daniel Y. Go is licensed under CC BY-NC 2.0

In the early 2000s, before superhero shows sprouted up like weeds, there was a show called ‘Heroes’ that featured characters with superpowers. While every character possessed unique powers, two of the characters could replicate other characters’ powers, and that ability made them stronger than all the others.

Have you ever wondered why most quantitative models only incorporate a few factors? The famous Fama French model only uses three factors. While Fama is working on increasing that number to five, even that number pales in comparison to the vast number of factors dotting the academic landscape. This paper alone, for instance, identifies 101 such factors. ‘Heroes’ makes the point that having many powers trumps having just a few. Shouldn’t quant models also use as many discovered factors as possible?

Some quants retort that many of the factor discoveries are suspect. Those discoveries are made on the basis of statistical significance, and there’s often a greater than 5% chance that a factor appears useful only by fluke. Rolling ‘12’ twice in a row doesn’t necessarily mean the dice are loaded, because there’s still a decent chance that you just got lucky. But why is 5% significance the established hurdle? No one has a good answer, and many of the discovered factors would be deemed significant with a modestly easier hurdle such as 10%.

You, as an investor, may not care about such academic debates. You just want an investment strategy that makes money, so you might nudge your portfolio manager to incorporate those borderline significant factors.

Let’s say your portfolio manager does incorporate dozens of factors into a new quantitative model. Your manager, however, would likely encounter a problem - the model doesn’t work! Not that the model errors out - it would still produce numbers, but those numbers would defy intuition. For example, the new model may find that value stocks will forever underperform, or that a slightly higher current ratio dramatically raises the outlook of a stock. The new model would also disappoint in trading, and may even underperform the old model that uses just a few factors. Why is this likely to be the case?

Problems Plaguing Many Factor Models

Statisticians have names for the problems plaguing the new model, and the names are ‘multicollinearity’ and the ‘curse of dimensionality’.

Multicollinearity describes situations where factors are highly correlated to one another. Let’s use an example to illustrate the issue. Suppose we want to predict the price of a house. It would be natural to use the square footage of the house, and the number of bedrooms, as factors of our model.

If we construct a model that only uses the square footage of the house, we’d expect the value of the house to increase in proportion to the square footage, say by $100,000 for each additional 700 sqft. If we construct a model using only the number of bedrooms instead, we’d also expect the house price to increase with the number of bedrooms, say by $100,000 for each additional bedroom. But what happens if you have a model that incorporates both factors? How would the house price change in relation to each factor?

We know that factor sensitivities wouldn’t stay constant. To see why, compare a 3 bedroom, 2,000 sqft home, with a 4 bedroom, 2,700 sqft home. A model that only takes square footage into account would value the bigger home by $100,000 more. The model that only takes bedrooms into account would also value the bigger home by $100,000 more. But what of the double factor model? If it reuses factor sensitivities from the single factor models, it would suggest that the bigger home is worth $200,000 ($100,000 for additional sqft + $100,000 for additional bedroom) more.

Everyone, except perhaps the selling real estate agent, would know that the double factor model is wrong. But why is it wrong? The problem is that both the square footage and the number of bedrooms indicate the size of the house, and by representing that size twice in our model, we double count its effect.

Statistical software packages fortunately avoid the mistake of double counting signals. But, it doesn’t escape a related problem. Rather than double count the ‘house size’ signal, the software would try to split its effect among the factors. It may, for instance, suggest that each 700 sqft adds $60,000 to the value of the house, while an additional bedroom adds $40,000. Or, it may decide that the split should be $20,000 and $80,000, respectively.

Multicollinearity causes problems because it allows factor sensitivity splits to land among a wide range of possibilities. In fact, factor sensitivities can even go negative. A model may suggest that each additional 700 sqft causes the house price to decrease by $40,000 while each additional bedroom causes the house price to increase by $140,000. Sensitivities can traverse even more extreme ranges, so long as they add up to $100,000. Such situations occur frequently in financial models because multicollinearity is present in many financial factors. High profitability, for instance, is often associated with high stock price momentum.

The curse of dimensionality is a separate problem that arises when a model considers too many types of data. If you think that having more data is always better, think again. Here’s an example to illustrate this problem.

Suppose you want to build a model that predicts who’ll become the best soccer player in the world. Your dataset contains the top two best players today. One player is short and grew up in Brazil. The other player is tall and grew up in Portugal. Both can run very fast. Both have the last name ‘Ronaldo’.

Let’s say your first model uses player height and speed as factors. Since the two best players have different heights but are similarly fast, the model predicts that a player’s speed will determine their career potential. We’re happy with this model. But what if we also mix the country of origin and the players’ last names into the model as factors? While the model would discard the country of origin, it would latch on to the player name as a significant predictor of a player’s career potential. In fact, the model may even place more emphasis on players’ last names over their speeds, such that it predicts a slow player named ‘Fernando Ronaldo’ to have a better chance of becoming a world class player than a fast player named ‘Bukayo Saka’.

Statistical models fall prey to such problems because they don’t have the ability to distinguish between causation and coincidence, and the more factors you introduce into a model, the greater the chance that one of those factors will correlate with your predictive targets through sheer coincidence. Hollywood often depicts scenarios where a character uses magic to gain some benefit, but pays a price in return. The price paid for adding a new factor to a model is the ‘curse of dimensionality’.

Tools To Solve The Problems

The twin monsters of multicollinearity and the curse of dimensionality are powerful enough to keep us from models that utilize multitudes of factors. Fortunately though, there are weapons that can defeat them. Two such weapons are called the Principal Component Analysis (PCA), and the Kernel Principal Component Analysis (KPCA). Both work using the same principles, by transforming the original dataset into a fewer number of ‘orthogonalized’ factors.

Orthogonalization is a fancy word for being independent of, and unrelated to, another factor. Let’s go back to our house price model that uses ‘square footage’ and ‘number of rooms’ as factors. It’s hard for us to imagine that the two factors would move independently of each other; a house with higher square footage would almost certainly contain more bedrooms. But what if we transformed the factors into ‘total square footage’ and ‘square footage per room’? We can now imagine how these numbers might move independently, which means that the new factors would contain different signals. A model that uses the new factors doesn't have to worry about double counting the same signals, and thus orthogonalization solves the problem of multicollinearity.

PCA and KPCA also help us deal with the curse of dimensionality by shrinking the number of factors. Each factor transformed through these methods carries a prominence score, and we can use such scores to keep only the most prominent factors. For instance, house sizes may carry more prominence than room sizes, so we may decide to discard room sizes from our models.

Although PCA and KPCA achieve the same goals, their methods differ, and these differences impact the qualities of the transformed factors. Like higher fidelity photographs, higher quality transformations retain more of the relevant features of the original data points, and models that use higher quality factors produce better results.

PCA uses factor correlations as the gears upon which its machinery turns. The square footage of a house generally increases with the number of rooms, so PCA extracts the concept of ‘house size’ from the common direction of those two factors. But while using correlations works in the house price example, there are situations where it does not. For example, a person’s comfort level goes up with temperature if the person is initially feeling cold, but the opposite happens if the person is initially feeling hot. To describe such situations, mathematicians say that a person’s comfort level is “non-linear” with temperature, and PCA has trouble processing factors that have such non-linear relationships.

KPCA is designed to handle non-linear situations better. There are several types of KPCA, so I’ll limit my discussion to the most popular one,the Gaussian KPCA. Whereas PCA looks at factor correlations, the Gaussian KPCA looks at the similarities between data points. Three bedroom detached houses will score as being similar to each other, as would one bedroom condominiums. The KPCA extracts several housing archetypes from the data, and creates new factors based on the similarity of each house to each archetype.

As with PCA, the KPCA-generated factors are orthogonal to each other, and since each archetype is scored based on its prominence, we can choose to keep only those factors associated with prominent archetypes. Using the new factors thus helps us deal with multicollinearity and the curse of dimensionality, even in situations where factors are non-linear. But even the Gaussian KPCA still has a flaw - it can’t handle missing data.

Real world datasets tend to have patchworks of holes. Sometimes, data is missing because of measurement errors. We may not know who won the poker game last Saturday because Billy forgot to tally up everyone’s chips. In such situations, we can fill in the missing data using reasonable guesses (Susan always wins, so let’s assume she won that game too). Other data points, however, are missing because they have to be. The average price of books sold on Monday is non-existent if no books were sold that day.

Unfortunately, there’s no obvious way to handle such intentionally missing data. So, I (Jin Choi) invented one. I took the original blueprint of the Gaussian KPCA, which measures distances between data points, and modified the distance definition to work with missing data. If two data points don’t contain missing data, the distance remains the same in the new definition. But if only one of the data points has missing data, then I assign a fixed distance, a relatively large number in order to convey that missing data is very different from non-missing data. If both data points have missing data, then I assign another fixed distance, a smaller number to indicate their similarity. I then indulged my sense of vanity and called it the ‘Choi KPCA’.

The use of this novel KPCA allows quants to distill potentially thousands of factors, riddled with missing data, into relatively few orthogonal factors. The method thus allows quants to create models that take advantage of a multitude of factors, without dealing with the headaches of multicollinearity and the curse of dimensionality. I did file a patent on this method, however, so you can’t use it without my permission. But give me a call - there’s a good chance I’ll grant permissions for free in exchange for the pleasure of meeting bright minds.

Quantocracy Badge
Latest posts