On this page… (hide)
(based on Sasha Rakhlin’s notes)
We have focused thus far on the “batch” learning setting, where all the training data is available to the learning algorithm at once. Many learning tasks are better described as incremental or “online”. Let us first compare and contrast the two. In “batch” learning, an algorithm takes the data D and returns a hypothesis h(x). Batch learning assumes a “static” nature of the world: a new example that we will encounter will be similar to the training data. More precisely, we assume that all the training samples, as well as the test point, are independent and identically distributed. The questions addressed by the previous lecture on generalization and statistical learning theory more broadly are: how many examples are needed to have small expected error with certain confidence? what is the lowest error that can be guaranteed under certain conditions on the class of hypothesis? etc.
If the world is not static, it might be difficult to take advantage of large amounts of data. We can no longer rely on statistical assumptions. In fact, we take an even more dramatic view of the world. We suppose that there are no correlations whatsoever between any two days. As there is no stationary distribution responsible for the data, we no longer want to minimize some expected error. All we want is to survive no matter how adversarial the world is. By surviving, we mean that we do not do too badly relative to other “agents” in the world. In fact, the goal is to do not much worse than the best “agent” (this difference will be called regret). Note that this does not guarantee that we will be doing well in absolute terms, as the best agent might be quite bad. However, this is the price we have to pay for not having any coherence between today and tomorrow. The goal of “online learning” is, therefore, to do almost as well as the best competitor.
We’ll start with the simplest and most famous setting, “prediction with expert advice” and then the good old perceptron.
Here’s a little game where the computer is learning online to read your mind: MindReader.
Suppose we have access to a set {$H$} of {$m$} experts (or hypotheses). An expert {$h_j(x) \in H$} predicts the outcome at every step, after receiving input {$x_i$}.
Let {$\hat{y}_i$} be our prediction and {$y_i$} be the outcome at step i (adversary’s move). Suppose the loss is the usual 0/1 loss: whether there was a mistake or not. More formally, we can think of learning with expert advice as the following game:
At each time step i = 1 to n,
Suppose there is an expert which makes no mistakes. Then there is a very simple algorithm called “halving” which make no more than log m mistakes. The algorithm is really very very simple: at each time step, use majority vote and then throw away all the experts that are wrong. Let {$ H_i \subseteq \{h_1,\ldots,h_m\} $} and {$H_1 = \{h_1,\ldots,h_m\} $}.
Theorem. If player predicts {$ \hat{y}_i = majority(H_i) $} and sets {$H_{i+1} = \{h_j \in H_i : h_{j}(x_i) = \hat{y}_i\},$} she will suffer no more than {$\log_2 m $} mistakes.
Proof. Each time the player makes a mistake, at least half of {$H_i$} is wrong. Hence, whenever a mistake occurs, {$|H_{i+1}| \leq |H_i|/2$}. Thus, if {$M$} mistakes have occurred by the end of step {$n$}, then {$|H_n| \leq |H|/2^M$}. Since the size of {$H_n$} cannot ever be less than 1 (under our assumption that at least one expert is perfect), {$2^M$} cannot be larger than {$|H|$}. That is, {$2^M \leq m$}. So, taking the log of both sides, we see that the number of mistakes, {$M$}, must be {$\leq \log_2 m$}.
Suppose there are infinitely many experts parametrized by vectors {$\mathbf{w}$} on the unit sphere in {$\Re^m$}. This looks like an difficult scenario, so let’s assume some linear structure on the way experts make predictions. On round i, input {$\mathbf{x}_i$} is revealed and prediction {$h_{\mathbf{w}(x_i)} = sign(\mathbf{w}^\top \mathbf{x}_i) $} The outcome {$y_i$} is then the label of {$\mathbf{x}_i$}. As in the previous section, assume that there is a perfect expert {$h_{\mathbf{w}^*}$} such that the data {$(\mathbf{x}_1, y_1),\ldots, (\mathbf{x}_n , y_n)$} are linearly separable: {$y_i \mathbf{x}_i^\top\mathbf{w}^*\geq 0$}.
For our purposes, however, this “separability” assumption is not enough. Suppose that
this perfect expert’s Euclidean norm is 1 ({$||\mathbf{w}^*||_2 = 1$}) and there is some
margin {$ \gamma $} >
{$ 0 $} such that for all t:
{$y_i \mathbf{x}_i ^\top \mathbf{w}^*> \gamma $}
This second condition is called the margin assumption. Because of the margin assumption, we can discretize the uncountable space of all experts (i.e. the sphere) and pick a finite number of them for our problem. It’s easy to show that if we discretize at a fine enough level, there will be an expert which has margin at least {$\gamma/2$} on the data, i.e. it is also a perfect expert. The size of the covering is {$ N = O(1/\gamma^m)$} and the previous section shows that we will make at most {$ \log N = O(m \log 1/\gamma)$} mistakes if we predict according to majority. However, this algorithm is inefficient, as we are ignoring important linear structure of the space. The nice thing about the reduction is its simplicity. It is transparent how the margin assumption allows us to discretize and only consider a finite number of experts. Another good thing - the dependence on the margin is {$ \log \gamma^{-1} $}. The bad aspects: our majority algorithm over {$O(1/\gamma^m)$} experts is computationally infeasible, and the dependence on the dimension is linear. The truth is that through a reduction to the Halving Algorithm, we forgot that we have a nice linear space. The Perceptron algorithm is a nice efficient algorithm for this problem. And it has some remarkable history. Invented by Rosenblatt in 1957, this algorithm is the father of neural networks and the starting point of machine learning. A 1969 book by Minsky and Papert showed that perceptrons are limited in what they can learn, and this caused a decline in the field for at least a decade. It was later shown that neural networks with several layers are significantly more powerful and the field of machine learning started to blossom.
Instead of reducing to the Halving Algorithm, we will do the following.
Start with {$\mathbf{w}_1 = 0$}. At each time step i = 1 to n,
Note that no update is performed when the correct label is predicted. Let’s look at how the “correlation” between {$w^*$} and {$w_i$} is evolving. If a mistake is made on round i,
{$ \; $ \begin{align*} \mathbf{w}_{i+1}^\top \mathbf{w}^* & = (\mathbf{w}_i + y_i \mathbf{x}_i)^\top \mathbf{w}^* \\ & =\mathbf{w}_i^\top \mathbf{w}^* + y_i \mathbf{x}_i^\top \mathbf{w}^* \ge \mathbf{w}_i^\top \mathbf{w}^* + \gamma \end{align*}$ \; $}.
Thus, every time there is a mistake, the dot product of our hypothesis {$\mathbf{w}_i$} with the unknown {$\mathbf{w}^*$} increases by {$\gamma$}. Note that {$\gamma$} is always positive if the data is separable, because the {$\mathbf{x}_i^\top \mathbf{w}^*$} in that case will have the same sign as {$y_i$}. If {$M$} mistakes have been made over {$n$} rounds, we have {$\mathbf{w}_{n+1} ^\top \mathbf{w}^* \ge \gamma M$} since we started with a zero vector. Now, the main question is whether the increase in the dot product due to the smaller angle between the vectors (i.e. we are indeed getting close to the unknown {$\mathbf{w}^*$} in terms of direction), or is it just because of an increase in the magnitude of {$\mathbf{w}_i$}, {$||\mathbf{w}_i||_2$}? While the “correlation” with {$\mathbf{w}^*$} increases with every mistake, we can show that the size of the hypothesis {$||\mathbf{w}_i||_2$} cannot increase too fast. If there is a mistake on round {$i$},
{$ \; $ \begin{align*} ||\mathbf{w}_{i+1}||_2^2 & = ||\mathbf{w}_i + y_i \mathbf{x}_i||_2^2 \\ & = ||\mathbf{w}_i||_2^2 + 2y_i(\mathbf{w}_i^\top \mathbf{x}_i) + ||\mathbf{x}_i||_2^2 \\ & \le ||\mathbf{w}_i||_2^2 + ||\mathbf{x}_i||_2^2 \end{align*} $ \; $}
where the last inequality follows because we are assuming a mistake was made, which means {$sign(\mathbf{w}_i^\top \mathbf{x}_i) \ne y_i$} and so {$y_i(\mathbf{w}_i^\top \mathbf{x}_i)$} is negative. (Note that little {$m$} here is just the number of features, which is different from big {$M$}, the number of mistakes.)
Suppose that for all {$i$}, {$||\mathbf{x}_i||_2 \le R$}. Then after {$n$} rounds and {$M$} mistakes,
{$||\mathbf{w}_{n +1}||_2^2 \le M R^2$}.
Combining the two arguments,
{$ \; $ \begin{align} \gamma M & \le \mathbf{w}_{n+1}^\top \mathbf{w}^* \\ & \leq ||\mathbf{w}_{n+1}||_2 ||\mathbf{w}^*||_2 \cos(\theta) \\ & \leq ||\mathbf{w}_{n+1}||_2 ||\mathbf{w}^*||_2 \\ & \leq ||\mathbf{w}_{n+1}||_2 \le \sqrt{M R^2} \end{align} $ \; $}
Equation (2) follows from the definition of the dot product, where {$\theta$} is the angle between {$\mathbf{w}_{n+1}$} and {$\mathbf{w}^*$}. Equation (3) follows from the fact that {$\cos(\theta) \leq 1$}. Finally, equation (4) follows from the fact that we can consider {$\mathbf{w}^*$} a unit vector, so {$||\mathbf{w}^*||_2 = 1$}. Thus, the number of mistakes is bounded as follows:
{$M \le \frac{R^2}{\gamma^2}$}.
The non-separable case is a little bit harder to analyze, but we will not cover it in this class. See this survey for details. Back to Lectures