Choi KPCA: A Tool To Help You Create Many Factor Quant Models

Author

Jin Won Choi

Problems Plaguing Many Factor Models

Statisticians have names for the problems plaguing the new model, and the names are ‘multicollinearity’ and the ‘curse of dimensionality’.

Multicollinearity describes situations where factors are highly correlated to one another. Let’s use an example to illustrate the issue. Suppose we want to predict the price of a house. It would be natural to use the square footage of the house, and the number of bedrooms, as factors of our model.

If we construct a model that only uses the square footage of the house, we’d expect the value of the house to increase in proportion to the square footage, say by $100,000 for each additional 700 sqft. If we construct a model using only the number of bedrooms instead, we’d also expect the house price to increase with the number of bedrooms, say by $100,000 for each additional bedroom. But what happens if you have a model that incorporates both factors? How would the house price change in relation to each factor?

We know that factor sensitivities wouldn’t stay constant. To see why, compare a 3 bedroom, 2,000 sqft home, with a 4 bedroom, 2,700 sqft home. A model that only takes square footage into account would value the bigger home by $100,000 more. The model that only takes bedrooms into account would also value the bigger home by $100,000 more. But what of the double factor model? If it reuses factor sensitivities from the single factor models, it would suggest that the bigger home is worth $200,000 ($100,000 for additional sqft + $100,000 for additional bedroom) more.

Everyone, except perhaps the selling real estate agent, would know that the double factor model is wrong. But why is it wrong? The problem is that both the square footage and the number of bedrooms indicate the size of the house, and by representing that size twice in our model, we double count its effect.

Statistical software packages fortunately avoid the mistake of double counting signals. But, it doesn’t escape a related problem. Rather than double count the ‘house size’ signal, the software would try to split its effect among the factors. It may, for instance, suggest that each 700 sqft adds $60,000 to the value of the house, while an additional bedroom adds $40,000. Or, it may decide that the split should be $20,000 and $80,000, respectively.

Multicollinearity causes problems because it allows factor sensitivity splits to land among a wide range of possibilities. In fact, factor sensitivities can even go negative. A model may suggest that each additional 700 sqft causes the house price to decrease by $40,000 while each additional bedroom causes the house price to increase by $140,000. Sensitivities can traverse even more extreme ranges, so long as they add up to $100,000. Such situations occur frequently in financial models because multicollinearity is present in many financial factors. High profitability, for instance, is often associated with high stock price momentum.

The curse of dimensionality is a separate problem that arises when a model considers too many types of data. If you think that having more data is always better, think again. Here’s an example to illustrate this problem.

Suppose you want to build a model that predicts who’ll become the best soccer player in the world. Your dataset contains the top two best players today. One player is short and grew up in Brazil. The other player is tall and grew up in Portugal. Both can run very fast. Both have the last name ‘Ronaldo’.

Let’s say your first model uses player height and speed as factors. Since the two best players have different heights but are similarly fast, the model predicts that a player’s speed will determine their career potential. We’re happy with this model. But what if we also mix the country of origin and the players’ last names into the model as factors? While the model would discard the country of origin, it would latch on to the player name as a significant predictor of a player’s career potential. In fact, the model may even place more emphasis on players’ last names over their speeds, such that it predicts a slow player named ‘Fernando Ronaldo’ to have a better chance of becoming a world class player than a fast player named ‘Bukayo Saka’.

Statistical models fall prey to such problems because they don’t have the ability to distinguish between causation and coincidence, and the more factors you introduce into a model, the greater the chance that one of those factors will correlate with your predictive targets through sheer coincidence. Hollywood often depicts scenarios where a character uses magic to gain some benefit, but pays a price in return. The price paid for adding a new factor to a model is the ‘curse of dimensionality’.

Tools To Solve The Problems

The twin monsters of multicollinearity and the curse of dimensionality are powerful enough to keep us from models that utilize multitudes of factors. Fortunately though, there are weapons that can defeat them. Two such weapons are called the Principal Component Analysis (PCA), and the Kernel Principal Component Analysis (KPCA). Both work using the same principles, by transforming the original dataset into a fewer number of ‘orthogonalized’ factors.

Orthogonalization is a fancy word for being independent of, and unrelated to, another factor. Let’s go back to our house price model that uses ‘square footage’ and ‘number of rooms’ as factors. It’s hard for us to imagine that the two factors would move independently of each other; a house with higher square footage would almost certainly contain more bedrooms. But what if we transformed the factors into ‘total square footage’ and ‘square footage per room’? We can now imagine how these numbers might move independently, which means that the new factors would contain different signals. A model that uses the new factors doesn't have to worry about double counting the same signals, and thus orthogonalization solves the problem of multicollinearity.

PCA and KPCA also help us deal with the curse of dimensionality by shrinking the number of factors. Each factor transformed through these methods carries a prominence score, and we can use such scores to keep only the most prominent factors. For instance, house sizes may carry more prominence than room sizes, so we may decide to discard room sizes from our models.

Although PCA and KPCA achieve the same goals, their methods differ, and these differences impact the qualities of the transformed factors. Like higher fidelity photographs, higher quality transformations retain more of the relevant features of the original data points, and models that use higher quality factors produce better results.

PCA uses factor correlations as the gears upon which its machinery turns. The square footage of a house generally increases with the number of rooms, so PCA extracts the concept of ‘house size’ from the common direction of those two factors. But while using correlations works in the house price example, there are situations where it does not. For example, a person’s comfort level goes up with temperature if the person is initially feeling cold, but the opposite happens if the person is initially feeling hot. To describe such situations, mathematicians say that a person’s comfort level is “non-linear” with temperature, and PCA has trouble processing factors that have such non-linear relationships.

KPCA is designed to handle non-linear situations better. There are several types of KPCA, so I’ll limit my discussion to the most popular one,the Gaussian KPCA. Whereas PCA looks at factor correlations, the Gaussian KPCA looks at the similarities between data points. Three bedroom detached houses will score as being similar to each other, as would one bedroom condominiums. The KPCA extracts several housing archetypes from the data, and creates new factors based on the similarity of each house to each archetype.

As with PCA, the KPCA-generated factors are orthogonal to each other, and since each archetype is scored based on its prominence, we can choose to keep only those factors associated with prominent archetypes. Using the new factors thus helps us deal with multicollinearity and the curse of dimensionality, even in situations where factors are non-linear. But even the Gaussian KPCA still has a flaw - it can’t handle missing data.

Real world datasets tend to have patchworks of holes. Sometimes, data is missing because of measurement errors. We may not know who won the poker game last Saturday because Billy forgot to tally up everyone’s chips. In such situations, we can fill in the missing data using reasonable guesses (Susan always wins, so let’s assume she won that game too). Other data points, however, are missing because they have to be. The average price of books sold on Monday is non-existent if no books were sold that day.

Unfortunately, there’s no obvious way to handle such intentionally missing data. So, I (Jin Choi) invented one. I took the original blueprint of the Gaussian KPCA, which measures distances between data points, and modified the distance definition to work with missing data. If two data points don’t contain missing data, the distance remains the same in the new definition. But if only one of the data points has missing data, then I assign a fixed distance, a relatively large number in order to convey that missing data is very different from non-missing data. If both data points have missing data, then I assign another fixed distance, a smaller number to indicate their similarity. I then indulged my sense of vanity and called it the ‘Choi KPCA’.

The use of this novel KPCA allows quants to distill potentially thousands of factors, riddled with missing data, into relatively few orthogonal factors. The method thus allows quants to create models that take advantage of a multitude of factors, without dealing with the headaches of multicollinearity and the curse of dimensionality. I did file a patent on this method, however, so you can’t use it without my permission. But give me a call - there’s a good chance I’ll grant permissions for free in exchange for the pleasure of meeting bright minds.

RSS Feed

Choi KPCA: A Tool To Help You Create Many Factor Quant Models

Problems Plaguing Many Factor Models

Tools To Solve The Problems

Join Our Newsletter