Given the SVD {$X = UDV^T$} {$ Z = UD$} are the principal components (PCs) or “component scores” or sometmes “factor scores”, and the columns of {$V$} are the corresponding loadings of the principal components. (Last time instead of {$D$} for the matrix with singular values on the diagonal, I used {$\Lambda$}.)
We write each vector {$x_i$} as {$x_i = \sum_k \alpha_{ik} v_k$}.
What is {$\alpha_{ik}$}?
As described last time, PCA can be viewed as an {$L_2$} optimization, minimizing distortion, the reconstruction error {$||X -\hat{X}||_F = ||X -ZV^T||_F$} where we use the thin SVD, keeping only the {$k$} largest singular vectors.
This can be generalized to other loss functions, giving sparse PCA.
The idea behind sparse PCA is simple:
{$argmin_{Z,V} ||X - Z V^T||^2$}
with constraints {$||v_i||_1 < c_1$} for {$i = 1...k$} {$||z_i||_1 < c_2$} for {$i = 1...k$}
You can also view this as a sort of penalized regression PCA.
How do the constraints end up in the main loss function?
We can also use PCA as a way of generating features for use in a supervised setting. The simplest approach is to just project the feature vectors {$x_i$} onto the largest principal components {$v_j$},
The approximation {$x_i = \sum_j z_{ij} v_j$} or {$X= ZV^T$}
getting a new set of features {$z_i = x_i V^T$} (or {$Z = XV^T$}) , and then run the regression with {$z_i$} (or {$Z$}) as the features.
Keeping only the largest principal components is a form of regularization. (why?)
Recall that OLS regression estimates {$\hat{w} = (X^TX)^{-1} X^T y$} or
{$\hat{y}^{OLS} = X(X^TX)^{-1} X^T y$}
but in a singular value decomposition with {$X = UDV^T$} , we have
covariance matrix {$X^TX = (UDV^T)^T(UDV^T) = VD^2V^T$}
and hat matrix
{$X(X^TX)^{-1}X^T = UDV^T (VD^{-2}V^T) VDU^T = UU^T$}
Aside: {$U^TU$} is the k*k identity matrix; {$UU^T$} in the n*n hat matrix
so
{$\hat{y}^{OLS} = UU^T y$}
Ridge regression estimates {$\hat{w} = (X^TX + \gamma I)^{-1} X^T y$} or
{$\hat{y}^{ridge} = X(X^TX + \gamma I)^{-1} X^T y$}
{$X(X^TX + \gamma I)^{-1}X^T = UDV^T V(D^2 + \gamma I)^{-1}V^T VDU^T$} = = {$UD(D^2 + \gamma I)^{-1}DU^T$} = {$\sum_j u_j (\lambda_j^2 / (\lambda_j^2 + \gamma) u_j^T$}
or, equivalently {$X(X^TX + \lambda I)^{-1} X = V \Gamma V^T$} where {$\Gamma$} is a diagonal matrix with diagonal elements {$\lambda_j^2/(\lambda_j^2 + \gamma)$}
so Ridge does shrinkage in the singular value space, effectively shrinking the small singular values more than the big ones. Note that PCR does a “feature selection” in the singular value space, just zeroing out all the small singular values.
Often we have multiple {$y$}’s for each observation, so that we might want to predict a vector {$y_i$} for each observation {$x_i$}. This can be done using standard (OLS) regression, or principal components regression (PCR), but can also be done in other ways.
For example, we can try to simultaneously minimize the criteria for linear regression {$||y - \hat{y}||_2$} or, equivalently {$||y - Zw||_2$} and for PCA {$||X - \hat{X}||_2$} or equivalently {$||X - ZV||_2$} PLS can be viewed as doing something like this. (Note that there is a whole class of methods that range from OLS to PCA, depending on how much they weight the two components.
As we’ve seen, PCA can be viewed as trying to find the directions that capture the maximum variance in the {$X$} space. This is done by finding the eigenvectors of {$X^TX$} (Assume through-out this that we have mean centered {$X$} and {$Y$})
If one has two sets of observations {$X$} and {$Y$}, one can instead try to maximize the covariance between them. This is done by finding the singular vectors of {$X^TY$}. This method is called canonical covariance.
Having two sets of observations allows an important extension. Instead of maximizing the covariance, one can maximize the correlation. This leads to ‘ ’canonical correlation analysis” or CCA, which finds the left and right singular vectors of
{$(X^TX)^{-1/2} X^TY (Y^TY)^{-1/2} $}
The largest left and right singular vectors {$u_1$} and {$v_1$} are the projections that maximize the correlation between {$X u$} and {$Y v$}. One can view the pre- and post-multiplication of the covariance {$X^TY$} as whitening it, both rescaling and removing redundancy.
Before, in PCA, we found a way to map from {$X$} to a (in the thin SVD case) reduced dimension coordinate system, finding new coordinates for each observation {$x_i$} CCA does the same, but now finds new coordinates for both {$X$} and {$Y$}, in a way that maximizes the correlation between them.
Note that because it uses correlation, CCA (unlike PCA or canonical covariates) is ‘ ’scale invariant”.