OnlineLearning

$$\newcommand{\ind}{\mathbf{1}} \newcommand{\bw}{\mathbf{w}} \newcommand{\bx}{\mathbf{x}} $$

Online learning applet

The Experts Problem

At each step $t = 1, \dots, T$:
- Player observes advice of $N$ experts $f_{1,t}, \dots, f_{N,t} \in \{-1,1\}$
- Player predicts $p_t \in \{-1,1\}$
- Outcome $y_t \in \{-1,1\}$ is revealed
- Player suffers loss $\ind[p_t \ne y_t]$

"Separable" case

Assumption: there exists a perfect expert. Question: How many mistakes will you (Player) make in $T$ rounds?

Naive algorithm: choose a single expert per round, eliminate if wrong. Leads to $N-1$ mistakes.

Halving algorithm

Halving algorithm: $p_t$ = majority vote of remaining experts. Eliminate majority if wrong.

Theorem. We will make at most $\log_2 N$ mistakes.

Proof. If there is a mistake, more than half of remaining experts are wrong, and therefore the number of surviving experts halves at each mistake.

More difficult example - experts on the L2 ball

Consider the $l_2$ ball of radius 1, $b = \{x \in \mathbf{R}^d \mid ||x||_2 \le 1\}$. Choose a $w$ in the ball. Expert is defined by $f_{w,t}(x_t) = sign(w\cdot x_t)$.

Uncountable number of experts: the surface of the sphere. Does Halving algorithm apply? We cannot apply, so we are stuck.

However, suppose there is a perfect expert $w^\star$, i.e. $$\forall t \quad sign(w\cdot x_t) = y_t.$$ Assumption. There is a $\gamma > 0$ such that, $$y_t(w^\star_t \cdot x_t) \ge \gamma, \qquad \gamma > 0.$$

Margin assumption lets us discretize the sphere in intervals of $\gamma/2$: there will be one point in the discretized set which also separates the data by $\gamma/2$. Reason: we know that minimum difference between $x$'s is at least $\gamma$.

Therefore we can run the Halving algorithm. What is $N$? Discretize to $\approx\gamma$ level. In a box, with sides 1 and $d$ dimensions, there will be $1/\gamma^d$ different discretized squares. So we can say that there is a similar order of magnitude for partitioning the ball.

Therefore we make at most $N$ mistakes, where $N$ is the order of, $$N = O\left(\log \left(\frac{1}{\gamma}\right)^d \right) = O\left(d \log \frac{1}{\gamma}\right).$$

In general, margin allows us to be certain that we have reached the right answer. Otherwise, we can get infinitely close to the correct answer and still be wrong.

Pros: logarithmic dependence on $\frac{1}{\gamma}$.

Cons: Need to discretize the space. Linear dependence on the dimension of the data. i.e., not efficient.

The Perceptron Algorithm

History: beginning of machine learning, invented in the early 70's by Rosenblatt. Interested in brain, learning, connections, etc. Then MIT found out could not learn everything -- for instance XOR. Killed machine learning for a decade. Then people showed that with two or three layers of neural networks, can learn much richer functions.

Algorithm:

Start with $\bw_1 = 0$
As $t = 1, \dots, T$:
- Receive $\bx_t \in \mathbf{R}^d$ and predict $sign(\bw_t \cdot \bx_t)$
- Outcome $y_t$ is revealed
- If $sign(\bw_t \cdot \bx_t) \ne y_t$, then update $w_{t+1} = \bw_t + y_t \bx_t$

Theorem. The perceptron algorithm makes at most $R^2/\gamma^2$ mistakes, where, $$R = \max_{1, \dots, T} ||x_t||_2, \qquad \gamma > 0, \quad y_t(\bw_t\cdot\bx_t) \ge \gamma.$$ Proof. Look at "correlation" or angle, $\bw_{t+1}\cdot\bw^\star$. Hopefully this angle is decreasing, dot product increasing. If there is no mistake, nothing happens. If there is a mistake, then $$\bw_{t+1}\cdot\bw^\star = (\bw_t + y_t\bx_t)\cdot\bw^\star = \bw_t\cdot\bw^\star + \underbrace{y_t(\bw^\star\cdot\bx_t)}_{\gamma}.$$ Now $\bw^\star$ is perfect with a margin of $\gamma$. Therefore $$\bw_{t+1}\cdot\bw^\star \ge \bw_t\cdot\bw^\star + \gamma.$$ In other words, every time we make a mistake, the dot product increases and the angle therefore decreases. But we need to make sure that the dot product increase is not outweighted by an increase in the norm of $\bw_t$. If there are $m$ mistakes, $$\bw{T+1}\cdot\bw^\star \ge m\gamma.$$ Now, $$||\bw_{t+1}||^2 = ||\bw_t+\bx_ty_t||^2 = ||\bw_t||^2 + ||\bx_t||^2 + \underbrace{2y_t\bw_t\bx_t}_{<0} \le ||\bw_t||^2 + ||\bx_t||^2$$ therefore $$||\bw_{T+1}||^2 \le mR^2, \qquad ||\bw_{T+1}||_2 \le \sqrt{m}R.$$ Therefore, $$ \begin{eqnarray} m\gamma \le \bw_{T+1}\cdot\bw^\star & \le & ||\bw_{T+1}||_2 \cdot ||\bw^\star||_2
& = & \sqrt{m}R\cdot 1
m\gamma & \le & \sqrt{m}R
m & \le & \left(\frac{R}{\gamma}\right)^2 \end{eqnarray} $$

Non-separable case

Instead of considering $\{-1,1\}$, we now consider the interval. Instead of indicator binary loss, we now suffer a loss function that is typically more smooth.

At $t = 1, \dots, T$
- Player observes predictions of experts $f_{1,t}, \dots, f_{N,t} \in [0,1]$ and predicts $p_t \in [0,1]$.
- Outcome $y_t$ is revealed.
- Player suffers loss $l(p_t,y_t)$, for some convex loss function

Goal. We did not make any assumptions about the expert. So if the experts all make mistakes, we can't possibly make few mistakes. Therefore we change the objective to Regret. $$\textrm{Regret} = \sum_{t=1}^T l(p_t,y_t) - \min_{i=1,\dots,N} \sum_{t=1}^N l(f_{i,t}, y_t),$$ This is the difference between our predictions and the predictions we would have had if we had listened only to the best expert for the entire time. Can only be computed at the end of the game. But we still want an algorithm that makes regret not too large. Our final goal is to have regret $R = O(\sqrt{T})$. This implies that the per round regret, $$\frac{1}{T} R = O\left(\frac{1}{\sqrt{T}}\right),$$ which means that the per-round error is converging to zero at a rate of $\frac{1}{\sqrt{T}}$.

Idea. We keep a distribution over experts that represents our "belief" or confidence in each expert. It is surprising that with this strategy we can eventually do as well as the best of the experts.

Weighted majority algorithm (AKA "Exponential Weights")

Maintain a distribution $\bw_t$.
On round $t$, predict $p_t = \frac{1}{\sum_{i=1}^n w_{i,t}} \sum_{i=1}^N w_{i,t} f_{i,t}$.
Update,

$$w_{t+1} = w_t + \exp\left{ -\eta l(f_{i,t}, y_t) \right}.$$

Thus, each expert is exponentially weighted according to the loss it suffered on that example.

Theorem. The weighted majority algorithm has regret, $$R \le \sqrt{\frac{T}{2}\log N}, \qquad \eta = \sqrt{\frac{8 \log N}{T}},$$ for any sequence of expert advice, outcomes. Note: this assumes that you know the "time horizon" $T$. Can be fixed by doubling $T$ if the game runs on past the initial guess $T$.

This is pretty amazing result: without knowing who is the best expert, can get within $\sqrt{\frac{1}{T} \log N}$ of the best expert, with this incredibly simple algorithm. I.e. even if experts are trying to mislead you, you will still have this bound on your regret. Keep in mind that regret is defined as the difference between our loss and the loss of the best expert.

Note: connection to boosting.

Online Gradient Descent

For $t = 1, \dots, T$,
- Player predicts $\bw_t$, a distribution over $N$ experts.
- Vector of losses $l_t \in \mathbf{R}^N$ is revealed
- Player suffers $\bw_t \cdot l_t$
Goal: min regret,

$$R = \sum \l_t\cdot\bw_t - \min_{\bw^\star \in \textrm{simplex}} \sum \l_t\cdot\bw^\star$$

Comparison to other predictions: instead of outputting a single prediction, we output just the distribution that we would have used instead, and suffer a loss on that. Then regret is comparing to the best possible probability distribution over experts.

If we generalize even further so that we have arbitrary weights in $\mathbf{R}^N$ in a convex set. Back to Lectures