OnlineRecitation
We have focused thus far on the "batch" learning setting, where all the training data is available to the learning algorithm at once. Consider instead the case where we don't have access to the data ahead of time. For instance, let's look at online algorithms that make predictions on binary sequences.
Suppose we have access to a set {$H$} of {$m$} experts (or hypotheses). An expert {$h_j(x) \in H$} predicts the outcome at every step, after receiving input {$x_i$}.
Let {$\hat{y}_i$} be our prediction and {$y_i$} be the outcome at step i (adversary's move). Suppose the loss is the usual 0/1 loss: whether there was a mistake or not. More formally, we can think of learning with expert advice as the following game:
At each time step i = 1 to n,
- Player observes the expert predictions {$h_{1}(x_i), . . . , h_{m}(x_i)$} and predicts {$\hat{y}_i \in \{-1,1\}$}
- Outcome {$y_i \in \{-1,1\} $} is revealed
- Player incurs loss {$ 1(y_i \neq y_i) $}
Halving algorithm: Recap
Suppose there is an expert which makes no mistakes. Then there is a very simple algorithm called "halving" which make no more than log m mistakes. The algorithm is really very very simple: at each time step, use majority vote and then throw away all the experts that are wrong.
Mistake bound
Let {$ H_i \subseteq \{h_1,\ldots,h_m\} $} and {$H_1 = \{h_1,\ldots,h_m\} $}.
Theorem. If player predicts {$ \hat{y}_i = majority(H_i) $} and sets {$H_{i+1} = \{h_j \in H_i : h_{j}(x_i) = \hat{y}_i\},$} she will suffer no more than {$\log_2 m $} mistakes.
Proof. Each time the player makes a mistake, at least half of {$H_i$} is wrong. Hence, whenever a mistake occurs, {$|H_{i+1}| \leq |H_i|/2$}. Thus, if {$M$} mistakes have occurred by the end of step {$n$}, then {$|H_n| \leq |H|/2^M$}. Since the size of {$H_n$} cannot ever be less than 1 (under our assumption that at least one expert is perfect), {$2^M$} cannot be larger than {$|H|$}. That is, {$2^M \leq m$}. So, taking the log of both sides, we see that the number of mistakes, {$M$}, must be {$\leq \log_2 m$}.
Weighted Majority
You've seen that the halving algorithm will make at most {$\log_2 N$} mistakes. However, if there are no perfect experts, then this algorithm will not work: since every expert might make mistakes, they will all eventually be eliminated. Instead of throwing away the experts that make mistakes, we can use the weighted majority algorithm and weight each expert by some {$0\leq \beta \leq 1$} if they make a mistakes. The algorithm is:
At each time step {$t = 1$} to {$T$}:
- Player observes the expert predictions {$h_{1,t},\dots,h_{N,t}$}, each having a weight {$w_{i,t}$} and predicts
{$\hat{y_t} = \arg\max_y \sum_i w_{i,t}{\bf 1}[h_{i,t}=y]$}
- Outcome {$y_t$} is revealed.
- Weights are updated using using the following formula:
{$ \; $ w_{i,t+1} = \begin{cases} \beta w_{i,t} & \text{if }h_{i,t}\not=y_t \\ w_{i,t} & \text{if }h_{i,t}=y_t \end{cases}$ \; $}
Mistake bound
For the following questions, assume {$y \in \{0,1\}$}.
- For what value of {$\beta$} is this algorithm equivalent to the halving algorithm?
(:ans:) {$\beta = 0$} (:ansend:)
- Let {$W_t=\sum_i w_{i,t}$} be the sum of all weights at time {$t$}. If {$\beta = 1/3$}, show that when a mistake is made at time {$t$} then {$W_{t+1} \leq 2/3W_{t}$}.
(:ans:) If a mistake is made, then there must have been at least half the weight voting for it. The new weight is at most {$1/2 + (1/3)(1/2) = 3/6 + 1/6 = 4/6 = 2/3$}, giving {$W_{t+1} \leq 2/3W_{t}$}. (:ansend:)
- Suppose that there are {$N$} experts, each initialized with a weight of 1. Show that if {$M$} mistakes were made by time {$t$}, then {$W_{t} \leq N(2/3)^M$}.
(:ans:) The total weight at the beginning of the algorithm is {$n$}. Each time a mistake is made, the weight is reduced by at least 1/3. Since {$M$} mistakes were make, this happened {$M$} times, and so {$W_{t} \leq N(2/3)(2/3)\dots(2/3) = N(2/3)^M$}. (:ansend:)
- It turns out that this algorithm doesn't make many more mistakes than the best expert, which you will now show. If the best expert makes {$m$} mistakes, then its weight at time {$t$} will be {$(1/3)^m$}, and so the total weight will be at least that much: {$W_t \geq (1/3)^m$}. Combining this equation with your result from the previous question implies that {$(1/3)^m \leq N(2/3)^M$}. Use this inequality to show the following bound on the number of mistakes:
{$ M \leq \frac{1}{\log_3 3/2}[m+\log_3 N].$}
(:ans:) Starting with {$(1/3)^m \leq N(2/3)^M$}, we can take the log of both sides, giving {$ m\log_3(1/3) \leq \log_3 N + M\log_3(2/3).$}
Negating both sides gives {$m \geq -\log_3 N + M\log_3(3/2),$} and rearranging gives the desired result: {$M \leq \frac{1}{\log_3 3/2}\left[m+\log_3 N\right].$} (:ansend:) Back to Lectures