One can generalize eigenvalues/vectors to non-square matrices, in which case they are called singular vectors and singular values. The equations remain the same, but there are now both left and right singular values and vectors. {$A^TA\mathbf{v} = \lambda\mathbf{v}$}
{$ AA^T \mathbf{u} = \lambda\mathbf{u}$}
{$A = U \Lambda V^T$}.
where {$A$} is {$n*p$},{$U$} is {$n*n$}, {$\Lambda$} is {$n*p$}, and {$V^T$} is {$p*p$}, but {$U$} and {$V$} can be at most rank {$min(n,p)$}, so one of them will have {$n-p$} or {$p-n$} zero eigenvalues.
Exercise: show that the nonzero left and right singular values are identical.
Any matrix can be decomposed into a weighted sum of it’s singular vectors (giving a singular value decomposition).
In practice, we often don’t care about decomposing {$A$} exactly, but only approximating it. For example, we will often take {$A$} to be our “design matrix” of observations {$X$}, and approximate it by the thin SVD obtained when one only keeps the top {$k$} singular vectors and values.
{$A \sim U_k \Lambda_k V_k^T$} where {$A$} is {$n*p$}, {$U_k$} is {$n*k$}, {$\Lambda_k$} is {$k*k$}, and {$V_k^T$} is {$k*p$}
The thin SVD gives the best rank {$k$} approximation to a matrix, where “best” means minimum Frobenious norm, which is basically the {$L_2$} norm of the elements of the matrix (i.e., the square root of the sum of the squares of all the elements of the matrix, which magically equals the sum of the squares of the singular values).
This is often useful. Consider a matrix {$A$} where each student is a row, and each column is set of test result for that student (e.g. a score on an exam question, or to an IQ test, or anything. (More generally, rows are observations, and columns are features.) We can now approximate the student grades by {$\hat{A} \sim U_k \Lambda_k V_k^T$}. Here we have approximated each original {$p$}-dimensional vector by a {$k$}-dimensional one, called the scores. Each student’s grade vector {$x$} can be written in terms of a weighted combination of the basis vectors {$V_k$}. The weights are the “scores”, and the components of the basis vectors are called the loadings of the original features.
For example, in a battery of tests of intellectual abilities, the score on the “largest singular vector”—which means the singular vector corresponding to the largest singular value — after all, all the singular vectors have been normalized to be of length 1—is often called g, for “general intelligence”.
Why is all this useful? We’ll see many examples over the next couple classes, but here are a few:
but is sometimes a little better — or a little worse.
Linear regression requires estimating {$w$} in {$y = Xw$}, which can be viewed as computing a pseudo-inverse (or “generalized inverse”) {$X^+$} of {$X$}, so that
{$w = X^+y$}
Thus far, we have done this by
{$X^+ = (X^TX)^{-1} X^T$}
But one could also compute a generalized inverse using the SVD decomposition
{$X^+ = (U \Lambda^{-1} V^T)^T = V \Lambda^{-1} U^T$}
then {$X X^+= U \Lambda V^T V \Lambda^{-1} U^T= U U^T=I$} if {$U$} and {$V$} are orthonormal and {$\Lambda$} is invertible. similarly, {$ X^+ X = V \Lambda U^T U \Lambda^{-1} V^T= V V^T=I$}
Note that {$ U^T U =U U^T = I$} and similarly with {$V$}.
But {$\Lambda$} is often not invertible — since it is, in general, rectangular. So we use a “thin SVD” {$X^+ = V_k \Lambda_k^{-1} U_k^T$} In the thin SVD, we only keep {$k$} nonzero singular values, so that inverse {$\Lambda^{-1}$} is well-defined
This is an extremely useful algorithm that shows how one really does SVD on big data sets.
Input: matrix {$A$} of size {$n \times p$}, the desired hidden state dimension {$k$}, and the number of “extra” singular vectors, {$l$}
Output: The left and right singular vectors {$U_{final}$}, {$V_{final}^T$}