Lectures /
SamplingSampling in nA naive version is just to pick a subsample of the data, but one can do much better! The Uluru method is sketched in NIPS poster. Note that {$r$} is the fraction of the data included in the subsample; i.e. {$rn$} observations are included in the subsample. Both this method and the one below make optional use of the Hadamard Transform; a sort of FFT to spread the signal over the observations. Sampling in pStart with the Dual version of Ridge regression (very much like we did for SVMs). Derive the dual formulation of Ridge Regression: (For details, see this paper, but note that they use {$a$} for {$\lambda$} and {$t$} for {$i$}) Primal: min {$\sum_i (y_i - w^T x_i)^2 + \lambda ||w||^2$} Re-express as min {$\sum_i \xi_i^2 + \lambda ||w||^2$} with constraints (on the n residuals {$\xi_i$}) {$y_i - w^T x_i - \xi_i^2 > 0$} Introduce Lagrange multipliers {$\alpha_i$}, which will be the dual variables to get {$\sum_i \xi_i^2 + \lambda ||w||^2 = \sum_i \alpha_i (y_i - w^T x_i - \xi_i)$} Find the extremum in {$w$} by setting the derivative w.r.t. {$w$} to zero {$2 \lambda w - \sum_i \alpha_i x_i = 0$} or {$ w = (1/2\lambda) \sum_i \alpha_i x_i$} or {$ w = (1/2\lambda) X^T\alpha$} more math shows {$\xi_i = \alpha_i /2$} i.e. the importance of the ith constraint is proportional to the corresponding residual and {$\alpha = 2 \lambda (K + \lambda I)^{-1} y$} and thus the prediction at a new point {$x$} is {$\hat{y} = w^T x = (1/2\lambda) \sum_i \alpha_i x_i) x = (1/2\lambda) \alpha^T k = y^T (K + \lambda I)^{-1} k$} where {$k = X^T x$} (the dual version of the x at which y is evaluated) The above is the starting point for the NIPS poster Random Features for SVMAs you noticed, running SVM with kernels is really slow for data of any reasonable size. A clever approach is to use randomization is done in this paper. which shows how to construct feature spaces that uniformly approximate popular shift-invariant kernels k(x − y) to within {$\epsilon$} accuracy with only {$D = O(d \epsilon^{-2} log (1/\epsilon^2))$} dimensions (where {$d$} is the dimension of the original observations {$x$}). Note that {$y$} is a second observed feature vector, not a label! The method uses Randomized Fourier Features: The core algorithm is really simple: Require: A positive definite shift-invariant kernel k(x, y) = k(x − y). For example, a Gaussian kernel. Ensure: A randomized feature map {$z(x) : R^d \rightarrow R^D$} so that {$z(x)^Tz(y) \approx k(x-y)$}. Compute the Fourier transform p of the kernel k. For example, given the Gaussian kernel {$e^{\frac{||x-y||_2^2}{2}}$}, the Fourier transform is {$(2\pi)^{-D/2} e^{-\frac{||\omega||_2^2}{2}}$} Draw D iid samples {$\omega_1, \omega_2 ... \omega_D$} in {$R^d$} from p and D iid samples {$b_1,b_2 . . . , b_D \in R$} from the uniform distribution on {$[0, 2\pi]$}. Let {$z(x) = \sqrt{2/D} [cos(\omega_1^T x+b_1) ... cos(\omega_D^Tx+b_D) ]^T$}. |