MoreKernels
Recap: Kernels
In the last class, we introduced the concept of kernels and kernel functions. In the specific case of L2 regularized regression (aka, ridge regression or MAP estimation with Gaussian prior), we formulate the problem as a regularized minimization of squared error (below, C corresponds to {$\sigma^2/\lambda^2$} we had for MAP with Gaussian prior see regression lecture):
{$ \min_{\mathbf{w}} \frac{1}{2}(\mathbf{X}\mathbf{w} - \mathbf{y})^\top(\mathbf{X}\mathbf{w} - \mathbf{y}) + \frac{1}{2}C \mathbf{w}^\top\mathbf{w} $}
and we find two equivalent ways of solving the problem:
{$ \begin{array}{rcl} \mathbf{w} & = & (\mathbf{X}^\top\mathbf{X} + C\mathbf{I})^{-1}\mathbf{X}^\top\mathbf{y} \qquad \textrm{(primal)} h(\mathbf{x}) = \mathbf{w}^\top\mathbf{x} \\ \alpha & = & (\mathbf{X}\mathbf{X}^\top + C\mathbf{I})^{-1}\mathbf{y} \qquad \textrm{(dual)} h(\mathbf{x}) = \mathbf{x^\top}\mathbf{X}^\top\alpha \end{array} $}
If we have {$n$} examples and {$m$} features, then the first (primal) equation requires inverting a {$m \times m$} matrix (i.e. it's dependent on the # of features) and the second (dual) requires inverting a {$n \times n$} matrix (i.e. it is dependent on the number of examples). Intuitively, the dual formulation converts operations over features into operations that depend only on the dot products between examples: in the {$n \times n$} matrix {$\mathbf{X}\mathbf{X}^\top$}, known as the Gram matrix, and the {$1 \times n$} vector {$\mathbf{x^\top}\mathbf{X}^\top$}:
{$(\mathbf{X}\mathbf{X}^\top)_{ij} = \mathbf{x}_i\cdot\mathbf{x}_j, \qquad (\mathbf{x^\top}\mathbf{X}^\top)_i = \mathbf{x}\cdot\mathbf{x}_i.$}
We usually have more examples than features, so at first blush it seems like all we've done is come up with a much less efficient way of solving regularized linear regression. The upside is that we get to use the Kernel Trick: we can replace {$\mathbf{X}\mathbf{X}^\top$}, with a Kernel matrix {$\mathbf{K}$}, defined as:
{$\mathbf{K}_{ij} = k(\mathbf{x}_i,\mathbf{x}_j) = \phi(\mathbf{x}_i)\cdot\phi(\mathbf{x}_j), \qquad \phi: \mathcal{X} \rightarrow \mathcal{R}^M.$}
The Kernel matrix is, in a sense, the Gram matrix of {$\phi(\mathbf{X})$}, but the "trick" is that we can compute every element in {$\mathbf{K}$} without ever explicitly computing {$\phi(\mathbf{X})$}.
So long as {$k(\cdot,\cdot)$} is efficiently computable, this means that we can solve the regression problem without ever explicitly running computations on the features, and therefore we can potentially solve a regression problem with an infinite number of features!
Note that if {$\phi(\mathbf{x}) = \mathbf{x}$}, then the Kernel matrix is exactly equal to the Gram matrix:
{$(\mathbf{X}\mathbf{X}^\top)_{ij} = \mathbf{x}_i^\top\mathbf{x}_j = k(\mathbf{x}_i, \mathbf{x}_j),$}
and in this case solving the dual and solving the primal forms of regression are exactly equivalent.
When we substitute the kernel, the resulting method of solving linear regression is called kernelized Ridge regression. (KRR)
Kernelized methods vs. nearest neighbors
The training and prediction for KRR is found by substituting the kernel function for dot products in the rules for the dual form of linear regression,
{$\alpha = (\mathbf{K} + C\mathbf{I})^{-1}\mathbf{y} \qquad h(\mathbf{x}) = \sum_{i=1}^n k(\mathbf{x}, \mathbf{x}_i) \alpha_i.$}
The prediction rule should look familiar to something we have already seen in class before. Recall that the prediction rule for kernel regression looks like this:
{$h(\mathbf{x}) = \frac{\sum_{i=1}^{n} k(\mathbf{x}, \mathbf{x}_i) y_{i}}{\sum_{i=1}^{n} k(\mathbf{x}, \mathbf{x}_i)},$}
which is a weighted average of the labels of the training data, weighted by the kernel function.
If we rearrange the equation a little, we can relate kernel regression to KRR:
{$h(\mathbf{x}) = \sum_{i=1}^{n} k(\mathbf{x}, \mathbf{x}_i) \overbrace{\frac{y_i}{\sum_{i=1}^{n} k(\mathbf{x}, \mathbf{x}_i)}}^{\alpha_i(\mathbf{x})},$}
where we see that the terms highlighted as {$\alpha_i(\mathbf{x})$} take a similar role to the {$\alpha_i$}'s in the KRR prediction rule. In other words, we can see that {$\alpha_i$} in KRR provides two important roles:
- {$\alpha_i$} provides information about the label {$y_i$}. For example, if we use a very peaked Gaussian kernel, {$k(\mathbf{x}_i, \mathbf{x}_i) = 1$} and {$k(\mathbf{x}_i, \mathbf{x}_j) \approx 0$} for all other points. In this case, it is necessary that {$\alpha_i \approx y_i$} to ensure that {$h(\mathbf{x}_i)$} is close to {$y_i$}. However,
- {$\alpha_i$} must also provide information about the relative weighting of the {$i$}'th example, since there is no automatic normalization by {$\sum_{i=1}^n k(\mathbf{x},\mathbf{x}_i)$} as in kernel regression.
Finally, we observe that we learn the values for {$\alpha_i$} that minimize the squared prediction error on the training data. In other words, KRR is like an intelligent nearest neighbors' algorithm: rather than weighting neighbors automatically according to their kernel function, KRR can optimize {$\alpha_i$} to take discover (1) which examples are useful for prediction, and (2) how those examples should influence the output of another example.
Kernel regression vs. KRR: Example
Let's look at the following data as an example:
This data looks very linear, aside from the clump from the center, and in fact a standard linear regression would actually work pretty well here. But let's see what kernel regression does on this data, using a Gaussian kernel with width {$\sigma^2 = 2.5$}:
Attach:clust_example_kreg_small.png Δ Δ
What's happening is that at each point we predict along the curve, we predict a weighted sum of the other points' values, where the weighting is determined by the kernel function. However, the points in the center are clustered close enough that we can think of them as having roughly the same location. When we take a weighted sum over all training points, the center point gets added as many times to the sum as there are points clustered in the center; in this case, it has 6 times as much influence over the output as any other point in the training data. The result is that the cluster drags the values of all other points towards its center, and the predictions are highly skewed.
The role of {$\alpha_i$} suggests an immediate remedy to this problem: we could predict much more accurately if the {$\alpha$}'s of those points in the center canceled each other, reducing the influence of the center cluster and treating it more closely to any other point. This is exactly what happens if we run KRR on this dataset:
Attach:clust_krr2_small.png Δ Δ
We see that the curve is no longer dominated by the cluster in the center. We can investigate this further by plotting the alphas on the same x-axis as the data:
Attach:clust_alphas_small.png Δ Δ
Although it's not perfect, we can make two observations about the {$\alpha's$}:
- Since {$y$} is increasing as we increase our {$x$} value, {$\alpha_i$} needs to increase as well, to reflect this.
- The values of the center points fluctuate between positive and negative; this leads to a canceling out of the "influence" of the cluster values.
Locally weighted regression
The linear regression weights {$w$} at each point {$x$} in locally weighted regression is given by a regression computed by giving more weight to points near the point, {$x$} , where the prediction is being made and less weight to points further away.
{$\hat{w}(x) = argmin_w \frac{ \sum_i k(x, x_i)(w^Tx_i - y_i)^2 }{\sum_i k(x, x_i)}$}
Of course "nearness" is measured using the kernel function {$k(x,x_i)$} and the summation runs over all {$n$} observations in the training set.
SVMs: the Smarter Nearest Neighbor (TM)
This discussion leads naturally to support vector machines (SVM)s. All the kernel methods we've discussed so far require storing all of the training examples to use during prediction. If we have a lot of training data, that is a huge burden! As we will see in class, SVMs find a push the "intelligent nearest neighbors" concept even further, by finding a compact set of examples that we can use for prediction: in other words, the most of the {$\alpha_i$} that SVM finds will be zero. The useful examples are called support vectors for reasons that will become clear.
However, SVMs bring a new principle to our discussion, know as max-margin learning. (We first mentioned margins in our discussion of Boosting.) The objective in SVMs is a new type of loss that is not differentiable called the "hinge loss", and we express the principle of maximizing the margins on training data through constraints on this new objective. The practical result is that the numerical and analytical methods we've been using so far for MLE and MAP, (taking the derivative and setting to zero or using gradient descent) will no longer suffice for our needs.
In order to solve the SVM problem, we need concepts from the theory of constrained optimization: namely, Lagrange Duality. Back to Lectures