CIS520 Machine Learning | Lectures / SVD

SVD (singular value decomposition)

One can generalize eigenvalues/vectors to non-square matrices, in which case they are called singular vectors and singular values. The equations remain the same, but there are now both left and right singular values and vectors. {$A^TA\mathbf{v} = \lambda\mathbf{v}$}

{$ AA^T \mathbf{u} = \lambda\mathbf{u}$}

{$A = U \Lambda V^T$}.

where {$A$} is {$n*p$},{$U$} is {$n*n$}, {$\Lambda$} is {$n*p$}, and {$V^T$} is {$p*p$}, but {$U$} and {$V$} can be at most rank {$min(n,p)$}, so one of them will have {$n-p$} or {$p-n$} zero eigenvalues.

Exercise: show that the nonzero left and right singular values are identical.

Any matrix can be decomposed into a weighted sum of it’s singular vectors (giving a singular value decomposition).

In practice, we often don’t care about decomposing {$A$} exactly, but only approximating it. For example, we will often take {$A$} to be our “design matrix” of observations {$X$}, and approximate it by the thin SVD obtained when one only keeps the top {$k$} singular vectors and values.

{$A \sim U_k \Lambda_k V_k^T$} where {$A$} is {$n*p$}, {$U_k$} is {$n*k$}, {$\Lambda_k$} is {$k*k$}, and {$V_k^T$} is {$k*p$}

The thin SVD gives the best rank {$k$} approximation to a matrix, where “best” means minimum Frobenious norm, which is basically the {$L_2$} norm of the elements of the matrix (i.e., the square root of the sum of the squares of all the elements of the matrix, which magically equals the sum of the squares of the singular values).

This is often useful. Consider a matrix {$A$} where each student is a row, and each column is set of test result for that student (e.g. a score on an exam question, or to an IQ test, or anything. (More generally, rows are observations, and columns are features.) We can now approximate the student grades by {$\hat{A} \sim U_k \Lambda_k V_k^T$}. Here we have approximated each original {$p$}-dimensional vector by a {$k$}-dimensional one, called the scores. Each student’s grade vector {$x$} can be written in terms of a weighted combination of the basis vectors {$V_k$}. The weights are the “scores”, and the components of the basis vectors are called the loadings of the original features.

For example, in a battery of tests of intellectual abilities, the score on the “largest singular vector”—which means the singular vector corresponding to the largest singular value — after all, all the singular vectors have been normalized to be of length 1—is often called g, for “general intelligence”.

Why is all this useful? We’ll see many examples over the next couple classes, but here are a few:

insight and visualization; often you can characterize a feature vector by a couple of scores, and often the scores can be given names. For example the famous “Five factor” model of personality psychology
dimension reduction — instead of working in the {$p$}-dimensional space, one works is a much smaller {$k$} dimensional space
regularization — we’ll see that using the scores instead of the original features in linear regression (this is called Principle components regression (PCR) has a similar effect to Ridge regression,

but is sometimes a little better — or a little worse.

Generalized inverses

Linear regression requires estimating {$w$} in {$y = Xw$}, which can be viewed as computing a pseudo-inverse (or “generalized inverse”) {$X^+$} of {$X$}, so that

{$w = X^+y$}

Thus far, we have done this by

{$X^+ = (X^TX)^{-1} X^T$}

But one could also compute a generalized inverse using the SVD decomposition

{$X^+ = (U \Lambda^{-1} V^T)^T = V \Lambda^{-1} U^T$}

then {$X X^+= U \Lambda V^T V \Lambda^{-1} U^T= U U^T=I$} if {$U$} and {$V$} are orthonormal and {$\Lambda$} is invertible. similarly, {$ X^+ X = V \Lambda U^T U \Lambda^{-1} V^T= V V^T=I$}

Note that {$ U^T U =U U^T = I$} and similarly with {$V$}.

But {$\Lambda$} is often not invertible — since it is, in general, rectangular. So we use a “thin SVD” {$X^+ = V_k \Lambda_k^{-1} U_k^T$} In the thin SVD, we only keep {$k$} nonzero singular values, so that inverse {$\Lambda^{-1}$} is well-defined

Fast SVD (Randomized singular value decomposition)

This is an extremely useful algorithm that shows how one really does SVD on big data sets.

Input: matrix {$A$} of size {$n \times p$}, the desired hidden state dimension {$k$}, and the number of “extra” singular vectors, {$l$}

Generate a {$(k+l) \times n$} random matrix {$\Omega$}
Find the SVD {$U_1 D_1 V_1^T$} of {$\Omega A$}, and keep the {$k+l$} components of {$V_1$} with the largest singular values
Find the SVD {$U_2 D_2 V_2^T$} of {$A V_1$}, and keep the ‘largest’ {$k+l$} components of {$U_2$}
Find the SVD {$U_3 D_3 V_{final}^T$} of {$U_2^T A $}, and keep the ‘largest’ {$k$} components of {$V_{final}$}
Find the SVD {$U_{final} D_4 V_4^T$} of {$A V_{final}$} and keep the ‘largest’ {$k$} components of {$U_{final}$}

Output: The left and right singular vectors {$U_{final}$}, {$V_{final}^T$}

Why does randomized PCA keep extra components?

Practice exercises

Show how to compute {$e^A$}, and explain why this works.
Prove that a matrix with only positive eigenvalues is positive definite in the sense that {$x^TAx > 0$} for all nonzero vectors {$x$}
Given an {$n*p$} matrix with {$n$} much bigger that {$p$}, how would you efficiently compute the left singular vectors without using a randomized SVD algorithm?

Back to Lectures