Sampling

Sampling in n

A naive version is just to pick a subsample of the data, but one can do much better!

The Uluru method is sketched in NIPS poster. Note that {$r$} is the fraction of the data included in the subsample; i.e. {$rn$} observations are included in the subsample.

Both this method and the one below make optional use of the Hadamard Transform; a sort of FFT to spread the signal over the observations.

Sampling in p

Start with the Dual version of Ridge regression (very much like we did for SVMs).

Derive the dual formulation of Ridge Regression: (For details, see this paper, but note that they use {$a$} for {$\lambda$} and {$t$} for {$i$})

Primal:

min {$\sum_i (y_i - w^T x_i)^2 + \lambda ||w||^2$}

Re-express as

min {$\sum_i \xi_i^2 + \lambda ||w||^2$}

with constraints (on the n residuals {$\xi_i$})

{$y_i - w^T x_i - \xi_i^2 > 0$}

Introduce Lagrange multipliers {$\alpha_i$}, which will be the dual variables to get

{$\sum_i \xi_i^2 + \lambda ||w||^2 = \sum_i \alpha_i (y_i - w^T x_i - \xi_i)$}

Find the extremum in {$w$} by setting the derivative w.r.t. {$w$} to zero

{$2 \lambda w - \sum_i \alpha_i x_i = 0$}

{$ w = (1/2\lambda) \sum_i \alpha_i x_i$}

{$ w = (1/2\lambda) X^T\alpha$}

more math shows

{$\xi_i = \alpha_i /2$}

i.e. the importance of the ith constraint is proportional to the corresponding residual

and

{$\alpha = 2 \lambda (K + \lambda I)^{-1} y$}

and thus the prediction at a new point {$x$} is

{$\hat{y} = w^T x = (1/2\lambda) \sum_i \alpha_i x_i) x = (1/2\lambda) \alpha^T k = y^T (K + \lambda I)^{-1} k$}

where {$k = X^T x$} (the dual version of the x at which y is evaluated)

The above is the starting point for the NIPS poster

Random Features for SVM

As you noticed, running SVM with kernels is really slow for data of any reasonable size. A clever approach is to use randomization is done in this paper. which shows how to construct feature spaces that uniformly approximate popular shift-invariant kernels k(x − y) to within {$\epsilon$} accuracy with only {$D = O(d \epsilon^{-2} log (1/\epsilon^2))$} dimensions (where {$d$} is the dimension of the original observations {$x$}). Note that {$y$} is a second observed feature vector, not a label!

The method uses Randomized Fourier Features:

The core algorithm is really simple:

Require: A positive definite shift-invariant kernel k(x, y) = k(x − y).

For example, a Gaussian kernel.

Ensure: A randomized feature map {$z(x) : R^d \rightarrow R^D$} so that {$z(x)^Tz(y) \approx k(x-y)$}.

Compute the Fourier transform p of the kernel k.

For example, given the Gaussian kernel {$e^{\frac{||x-y||_2^2}{2}}$}, the Fourier transform is {$(2\pi)^{-D/2} e^{-\frac{||\omega||_2^2}{2}}$}

Draw D iid samples {$\omega_1, \omega_2 ... \omega_D$} in {$R^d$} from p and D iid samples {$b_1,b_2 . . . , b_D \in R$} from the uniform distribution on {0, 2\pi$}.

Let {$z(x) = \sqrt{2/D} [cos(\omega_1^T x+b_1) ... cos(\omega_D^Tx+b_D) ]^T$}.

Back to Lectures