Lectures /
Online RecitationWe have focused thus far on the “batch” learning setting, where all the training data is available to the learning algorithm at once. Consider instead the case where we don’t have access to the data ahead of time. For instance, let’s look at online algorithms that make predictions on binary sequences. Suppose we have access to a set {$H$} of {$m$} experts (or hypotheses). An expert {$h_j(x) \in H$} predicts the outcome at every step, after receiving input {$x_i$}. Let {$\hat{y}_i$} be our prediction and {$y_i$} be the outcome at step i (adversary’s move). Suppose the loss is the usual 0/1 loss: whether there was a mistake or not. More formally, we can think of learning with expert advice as the following game: At each time step i = 1 to n,
Halving algorithm: RecapSuppose there is an expert which makes no mistakes. Then there is a very simple algorithm called “halving” which make no more than log m mistakes. The algorithm is really very very simple: at each time step, use majority vote and then throw away all the experts that are wrong. Mistake boundLet {$ H_i \subseteq \{h_1,\ldots,h_m\} $} and {$H_1 = \{h_1,\ldots,h_m\} $}. Theorem. If player predicts {$ \hat{y}_i = majority(H_i) $} and sets {$H_{i+1} = \{h_j \in H_i : h_{j}(x_i) = \hat{y}_i\},$} she will suffer no more than {$\log_2 m $} mistakes. Proof. Each time the player makes a mistake, at least half of {$H_i$} is wrong. Hence, whenever a mistake occurs, {$|H_{i+1}| \leq |H_i|/2$}. Thus, if {$M$} mistakes have occurred by the end of step {$n$}, then {$|H_n| \leq |H|/2^M$}. Since the size of {$H_n$} cannot ever be less than 1 (under our assumption that at least one expert is perfect), {$2^M$} cannot be larger than {$|H|$}. That is, {$2^M \leq m$}. So, taking the log of both sides, we see that the number of mistakes, {$M$}, must be {$\leq \log_2 m$}. Weighted MajorityYou’ve seen that the halving algorithm will make at most {$\log_2 N$} mistakes. However, if there are no perfect experts, then this algorithm will not work: since every expert might make mistakes, they will all eventually be eliminated. Instead of throwing away the experts that make mistakes, we can use the weighted majority algorithm and weight each expert by some {$0\leq \beta \leq 1$} if they make a mistakes. The algorithm is: At each time step {$t = 1$} to {$T$}:
{$\hat{y_t} = \arg\max_y \sum_i w_{i,t}{\bf 1}[h_{i,t}=y]$}
{$ \; $ w_{i,t+1} = \begin{cases} \beta w_{i,t} & \text{if }h_{i,t}\not=y_t \\ w_{i,t} & \text{if }h_{i,t}=y_t \end{cases}$ \; $} Mistake boundFor the following questions, assume {$y \in \{0,1\}$}.
(:ans:) {$\beta = 0$} (:ansend:)
(:ans:) If a mistake is made, then there must have been at least half the weight voting for it. The new weight is at most {$1/2 + (1/3)(1/2) = 3/6 + 1/6 = 4/6 = 2/3$}, giving {$W_{t+1} \leq 2/3W_{t}$}. (:ansend:)
(:ans:) The total weight at the beginning of the algorithm is {$n$}. Each time a mistake is made, the weight is reduced by at least 1/3. Since {$M$} mistakes were make, this happened {$M$} times, and so {$W_{t} \leq N(2/3)(2/3)\dots(2/3) = N(2/3)^M$}. (:ansend:)
{$ M \leq \frac{1}{\log_3 3/2}[m+\log_3 N].$} (:ans:) Starting with {$(1/3)^m \leq N(2/3)^M$}, we can take the log of both sides, giving {$ m\log_3(1/3) \leq \log_3 N + M\log_3(2/3).$} Negating both sides gives {$m \geq -\log_3 N + M\log_3(3/2),$} and rearranging gives the desired result: {$M \leq \frac{1}{\log_3 3/2}\left[m+\log_3 N\right].$} (:ansend:) Back to Lectures |