On this page… (hide)
Dimensionality reduction aims to change the representation of the data into a low-dimensional one while preserving the structure of the data. It is usually unsupervised. It’s key uses are similar to clustering (e.g. K-Means and Mixtures of Gaussians clustering, which we will see later). In fact, we will eventually see that PCA, is like linear regression in having a probabilistic interpretation. Uses include
Principal Component Analysis is a linear dimensionality reduction technique: it transforms the data by a linear projection onto a lower-dimensional space that preserves as much data variation as possible. Here’s a simple example of projecting 2D points into 1 dimension.
Here’s a really nice applet illustration of the idea: projection of three dimensional information about countries into 2D.
Suppose our data consists, as usual, of n m-dimensional examples with . We are looking for an orthonormal basis that defines our projection:
We center the data by subtracting the mean since the offset of the data is not relevant to the projection. Given a basis, a projected point is
We can approximately reconstruct the original point from by the inverse mapping:
Let the reconstruction or distortion error of our projection be standard squared error:
So just like in K-means, we’re looking for a low-distortion representation of the data. PCA finds a k-basis that minimizes this distortion.
If we use bases, the number of original dimensions, the reconstruction is perfect: the distortion is zero.
Rewriting distortion using this definition of , we have:
Using orthonormality (), we have:
In words, the distortion is the sum of squares of full projection coefficients we discard.
We can express the distortion error in terms of the covariance of the features. Recall that and that the covariance matrix for the data is:
Then we have:
It can be shown that minimizing the distortion is equivalent to setting ’s to be the eigenvectors of . Here’s a proof sketch. Consider the eigen-decomposition of . Suppose , so we’re only looking to throw away one basis function, . The distortion is
To minimize this expression, (since are non-negative for a positive semi-definite matrix ), we need to put all the weight on the smallest eigenvalue: . More formally, we can use Lagrangian formulation. Our problem is to minimize the such that is a unit vector:
The Lagrangian is Taking the derivative with respect to and setting it equal to zero, we get:
Hence the stationary points of our problem are eigenvectors of . Note that our problem has a non-convex constraint set, so there are multiple local minima. The smallest such minimum is the one with the smallest eigenvalue. The proof for the general case uses essentially this idea.
If we let be eigenvectors of , then and , the eigenvalue of . Hence,
Instead of thinking of PCA as minimizing distortion, we can look at it from the opposite angle and think of it as maximizing the variance of the projected points. The variance of the projected points is
Maximizing the variance of projected points or minimizing the distortion of reconstruction leads to the same solution, using the eigenvectors of . Note that when s are eigenvectors, we have:
Note that we only need the top eigenvectors not all of them, which is a lot faster to compute. In Matlab, the eigs function returns the top k eigenvectors of a matrix.
PCA can be also found using the SVD command, which avoids explicitly constructing .
PCA via SVD
Formally, the SVD of a matrix is just the decomposition of into three other matrices, which we’ll call , , and . The dimensions of these matrices are given as subscripts in the formula below:
The columns of are orthonormal eigenvectors of . The columns of are orthonormal eigenvectors of . The matrix is diagonal, with the square roots of the eigenvalues from (or ; the eigenvalues of are the same as those of ) in descending order. These eigenvalues are called the singular values of . For a proof that such a decomposition always exists, check out this SVD tutorial.
The key reason we would prefer to do PCA through SVD is that it is possible to compute the SVD of without ever forming the matrix matrix. Forming the matrix and computing its eigenvalues using one of the methods suggested at the end of the eigen-things section above can lead to numerical instabilities. There are ways to compute the decomposition of SVD that have better numerical guarantees. (We won’t get into the particulars in this class though.)
SVD is a useful technique in many other settings also. For example, you might find it useful when working on the project data to try latent semantic indexing. This is a technique that computes the SVD of a matrix where each column represents a document and each row represents a particular word. After SVD, each row in the matrix represents the word with a “low dimensional vector”. Words that are related (i.e. that tend to show up in the same document) will have similar vector representations. For example, if our original set of words was [hospital nurse doctor notebook Batman], hospital, nurse, and doctor would all be closer to each other than to notebook or Batman. Similary, one can compute eigenwords by taking the SVD between words and their contexts. If one uses the words before and after each target words as context, words that are “similar” are close - i.e. nurse and doctor would be similar, but not hospital. Examples here.
Suppose we had a set of face images as input data for some learning task. We could use each pixel in an image as an individual feature. But this is a lot of features. Most likely we could get better performance on the test set using fewer, more generalizable features. PCA gives us a way to directly reduce and generalize the feature space. Here’s an example of how to do this in MATLAB: Eigenfaces.m.
Here is an example where we applied PCA to a set of face images:
And here is an example of reconstruction [from Turk and Pentland, 91] using different numbers of bases.
Unlike in K-means or Gaussian mixtures, we can compute all the principal components at once and then choose appropriately. Often, is chosen based on computational needs, to minimize memory or runtime. We can use the eigenvalue spectrum as a guide to good values of . If the values drop off quickly after certain , then very little is lost by not including later components.
PCA can be re-formulated in a probabilistic framework as a latent variable model with continuous latent variables. Bishop Chapter 12.2 has a full account of the approach, but the basic idea is to define a Gaussian latent variable of dimension and let , where is a m-by-k matrix. Maximum likelihood estimation of (as ) leads to setting to be the eigenvectors of , up to an arbitrary rotation.
In some cases, we want to reduce the dimensionality of a large amount of data or data with many, many dimensions. Performing PCA may be too costly (both in time and memory) in these cases. An alternative approach is to use random projections. The procedure for this is as follows:
This will not yield the optimal basis we get with PCA, but it will be faster and there are theoretical results (take a look at the Johnson-Lindenstrauss lemma if you’re curious) proving it is not too far from the PCA result as long as is large enough.
If we have data that is sparse in the original feature space, then subtracting the data mean, , will make the data matrix non-sparse and thus perhaps intractible to work with (it will take up too much memory). While you could just move from PCA to the random projection approach in order to get around this difficulty, it is also possible to simply run PCA without doing the mean subtraction step. Assuming the mean of the data is larger than the actual meaningful variance of the data, then this mean value would simply be captured by the first eigenvector. So, you could get the top eigenvectors and throw away the topmost, using the rest as your principal components. Back to Lectures