Principal Components Analysis Just Blew My Mind
I like to think of myself as a pretty data-savvy individual. I’m a 3-star black belt in spreadsheet jujitsu and I’m becoming more and more of a SAS-hole, but it turns out that there was a huge hole in my game.
Principal Components
I had heard the term “principal components” before, probably while reading a paper about some super-sexy analysis that someone else did, but I’ve never really known what the heck it meant or how to use it myself. In fact, I’ve sat and struggled with the very issue that principal components analysis is meant to solve without ever realizing that the solution was oh-so-simple and intuitive.
The big idea is this: when you have a lot of independent variables, you start to run into problems because most statistical techniques require uncorrelated inputs. When you start to see high levels or correlation between variables you have to either drop some variables (and lose the information they contain) or risk corruption in the model due to the correlation. At least that’s what I thought. Then along came a spreadsheet… (isn’t that how all romantic comedies should start?)
Principal components analysis lets you create a new variable that contains the information from multiple, correlated, independent variables. This eliminates the issue of correlation while preserving the maximum amount of information.
So how does it work?
In a simple, 2-dimensional context, think of principal components as fitting a new set or axes to a scatterplot (as shown above) to minimize the variance along one axis. This is similar (intuitively, but not mathematically) to fitting a linear regression line. The resulting vector becomes the first principal component, and the new axes are called eigenvectors. After finding the eigenvectors, each point can be translated to create a new variable, thus limiting the number of variables for consideration in the model. This results in a set of uncorrelated variables that capture the largest amount of variance possible.
The math behind principal components analysis is scary looking, as you can see on Wikipedia, but luckily, just about every statistical software package will do the calculations for you.
So there you have it. How many problems will that solve for me? Tons. How many will it solve for you?