Lectures /
Online LearningOn this page… (hide) {$\newcommand{\ind}{\mathbf{1}} \newcommand{\bw}{\mathbf{w}} \newcommand{\bx}{\mathbf{x}} $} The Experts Problem (not covered)
”Separable” caseAssumption: there exists a perfect expert. Question: How many mistakes will you (Player) make in {$T$} rounds? “Naive algorithm:” choose a single expert per round, eliminate if wrong. Leads to {$N-1$} mistakes. Halving algorithmHalving algorithm: {$p_t$} = majority vote of remaining experts. Eliminate experts in the majority if is wrong. Theorem. We will make at most {$\log_2 N$} mistakes. Proof. If there is a mistake, more than half of remaining experts are wrong, and therefore the number of surviving experts halves at each mistake. More difficult example - experts on the L2 ballConsider the {$L_2$} ball of radius 1, {$b = \{x \in \mathbf{R}^d \mid ||x||_2 \le 1\}$}. Choose a {$w$} in the ball. Expert is defined by {$f_{w,t}(x_t) = sign(w\cdot x_t)$}. Uncountable number of experts: the surface of the sphere. Does the Halving algorithm apply? No, so we are stuck. However, suppose there is a perfect expert {$w^\star$}, i.e. {$\forall t \quad sign(w\cdot x_t) = y_t.$} Assumption. There is a {$\gamma > 0$} such that, {$y_t(w^\star_t \cdot x_t) \ge \gamma, \qquad \gamma > 0.$} Margin assumption lets us discretize the sphere in intervals of {$\gamma/2$}: there will be one point in the discretized set which also separates the data by {$\gamma/2$}. Reason: we know that the minimum difference between the {$x$}’s is at least {$\gamma$}. Therefore we can run the Halving algorithm. What is {$N$}? Discretize to {$\approx\gamma$} level. In a box, with sides 1 and {$d$} dimensions, there will be {$1/\gamma^d$} different discretized squares. So we can say that there is a similar order of magnitude for partitioning the ball. Therefore we make at most {$N$} mistakes, where {$N$} is the order of, {$N = O\left(\log \left(\frac{1}{\gamma}\right)^d \right) = O\left(d \log \frac{1}{\gamma}\right).$} In general, margin allows us to be certain that we have reached the right answer. Otherwise, we can get infinitely close to the correct answer and still be wrong. Pros: logarithmic dependence on {$\frac{1}{\gamma}$}. Cons: Need to discretize the space. Linear dependence on the dimension of the data. i.e., not efficient. The Perceptron AlgorithmHistory: In the beginning of machine learning, it was invented in the early 70’s by Rosenblatt. He was interested in brain, learning, connections, etc. Then MIT found out could not learn everything — for instance XOR. Killed machine learning for a decade. Then people showed that with two or three layers of neural networks, one can learn much richer functions. Algorithm:
Theorem. The perceptron algorithm makes at most {$R^2/\gamma^2$} mistakes, where, {$R = \max_{1, \dots, T} ||x_t||_2, \qquad \gamma > 0, \quad y_t(\bw_t\cdot\bx_t) \ge \gamma.$} Proof. Look at “correlation” or angle, {$\bw_{t+1}\cdot\bw^\star$}. Hopefully. this angle is decreasing, and the dot product increasing. If there is no mistake, nothing happens. If there is a mistake, then {$\bw_{t+1}\cdot\bw^\star = (\bw_t + y_t\bx_t)\cdot\bw^\star = \bw_t\cdot\bw^\star + \underbrace{y_t(\bw^\star\cdot\bx_t)}_{\gamma}.$} Now {$\bw^\star$} is perfect with a margin of {$\gamma$}. Therefore {$\bw_{t+1}\cdot\bw^\star \ge \bw_t\cdot\bw^\star + \gamma.$} In other words, every time we make a mistake, the dot product increases and the angle therefore decreases. But we need to make sure that the dot product increase is not outweighted by an increase in the norm of {$\bw_t$}. If there are {$m$} mistakes, {$\bw{T+1}\cdot\bw^\star \ge m\gamma.$} Now, {$||\bw_{t+1}||^2 = ||\bw_t+\bx_ty_t||^2 = ||\bw_t||^2 + ||\bx_t||^2 + \underbrace{2y_t\bw_t\bx_t}_{<0} \le ||\bw_t||^2 + ||\bx_t||^2$} therefore {$||\bw_{T+1}||^2 \le mR^2, \qquad ||\bw_{T+1}||_2 \le \sqrt{m}R.$} Therefore, $m\gamma \le \bw_{T+1}\cdot\bw^\star & \le & ||_2 \cdot ||\bw^\star||_2 Non-separable caseInstead of considering {$\{-1,1\}$}, we now consider the interval. Instead of indicator binary loss, we now suffer a loss function that is typically more smooth.
Goal. We did not make any assumptions about the expert. So if the experts all make mistakes, we can’t possibly make few mistakes. Therefore we change the objective to Regret. {$\textrm{Regret} = \sum_{t=1}^T l(p_t,y_t) - \min_{i=1,\dots,N} \sum_{t=1}^N l(f_{i,t}, y_t),$} This is the difference between our predictions and the predictions we would have had if we had listened only to the best expert for the entire time. This can only be computed at the end of the game. But we still want an algorithm that makes regret not too large. Our final goal is to have regret {$R = O(\sqrt{T})$}. This implies that the per round regret, {$\frac{1}{T} R = O\left(\frac{1}{\sqrt{T}}\right),$} which means that the per-round error is converging to zero at a rate of {$\frac{1}{\sqrt{T}}$}. Idea. We keep a distribution over experts that represents our “belief” or confidence in each expert. It is surprising that with this strategy we can eventually do as well as the best of the experts. Weighted majority algorithm (AKA “Exponential Weights”)
{$w_{t+1} = w_t + \exp\left{ -\eta l(f_{i,t}, y_t) \right}.$} Thus, each expert is exponentially weighted according to the loss it suffered on that example. Theorem. The weighted majority algorithm has regret, {$R \le \sqrt{\frac{T}{2}\log N}, \qquad \eta = \sqrt{\frac{8 \log N}{T}},$} for any sequence of expert advice, outcomes. Note: this assumes that you know the “time horizon” {$T$}. Can be fixed by doubling {$T$} if the game runs on past the initial guess {$T$}. This is a pretty amazing result: without knowing who is the best expert, we can get within {$\sqrt{\frac{1}{T} \log N}$} of the best expert, with this incredibly simple algorithm. I.e., even if experts are trying to mislead you, you will still have this bound on your regret. Keep in mind that regret is defined as the difference between our loss and the loss of the best expert. Note: connection to boosting. Online Gradient Descent
{$ R = \sum l_t\cdot\bw_t - \min_{\bw^\star \in \textrm{simplex}} \sum l_t\cdot\bw^\star $} Comparison to other predictions: instead of outputting a single prediction, we output just the distribution that we would have used instead, and suffer a loss on that. Then regret is comparing to the best possible probability distribution over experts. If we generalize even further so that we have arbitrary weights in {$\mathbf{R}^N$} in a convex set. |