Perceptrons
The Perceptron Algorithm
Start with {$\mathbf{w}_1 = 0$}. At each time step i = 1 to n,
- Receive {$\mathbf{x}_i \in \Re^m$} and predict {$sign(\mathbf{w}_i ^\top \mathbf{x}_i) $}
- Outcome {$y_i \in \{-1, 1\}$} is revealed
- If {$sign(\mathbf{w}_i ^\top \mathbf{x}_i) \ne y_i$}, update {$\mathbf{w}_{i+1} = \mathbf{w}_i + y_i \mathbf{x}_i$}.
Note that no update is performed when the correct label is predicted. Let's look at how the "correlation" between the (unknown) true {$w^*$} and current estimate {$w_i$} is evolving. If a mistake is made on round i,
{$ \begin{align*} \mathbf{w}_{i+1}^\top \mathbf{w}^* & = (\mathbf{w}_i + y_i \mathbf{x}_i)^\top \mathbf{w}^* \\ & =\mathbf{w}_i^\top \mathbf{w}^* + y_i \mathbf{x}_i^\top \mathbf{w}^* \ge \mathbf{w}_i^\top \mathbf{w}^* + \gamma \end{align*}$}.
Thus, every time there is a mistake, the dot product of our hypothesis {$\mathbf{w}_i$} with the unknown {$\mathbf{w}^*$} increases by {$\gamma$}. Note that {$\gamma$} is always positive if the data is separable, because the {$\mathbf{x}_i^\top \mathbf{w}^*$} in that case will have the same sign as {$y_i$}. If {$M$} mistakes have been made over {$n$} rounds, we have {$\mathbf{w}_{n+1} ^\top \mathbf{w}^* \ge \gamma M$} since we started with a zero vector. Now, the main question is whether the increase in the dot product due to the smaller angle between the vectors (i.e. we are indeed getting close to the unknown {$\mathbf{w}^*$} in terms of direction), or is it just because of an increase in the magnitude of {$\mathbf{w}_i$}, {$||\mathbf{w}_i||_2$}? While the "correlation" with {$\mathbf{w}^*$} increases with every mistake, we can show that the size of the hypothesis {$||\mathbf{w}_i||_2$} cannot increase too fast. If there is a mistake on round {$i$},
{$ \begin{align*} ||\mathbf{w}_{i+1}||_2^2 & = ||\mathbf{w}_i + y_i \mathbf{x}_i||_2^2 \\ & = ||\mathbf{w}_i||_2^2 + 2y_i(\mathbf{w}_i^\top \mathbf{x}_i) + ||\mathbf{x}_i||_2^2 \\ & \le ||\mathbf{w}_i||_2^2 + ||\mathbf{x}_i||_2^2 \end{align*} $}
where the last inequality follows because we are assuming a mistake was made, which means {$sign(\mathbf{w}_i^\top \mathbf{x}_i) \ne y_i$} and so {$y_i(\mathbf{w}_i^\top \mathbf{x}_i)$} is negative. (Note that little {$m$} here is just the number of features, which is different from big {$M$}, the number of mistakes.)
Suppose that for all {$i$}, {$||\mathbf{x}_i||_2 \le R$}. Then after {$n$} rounds and {$M$} mistakes,
{$||\mathbf{w}_{n +1}||_2^2 \le M R^2$}.
Combining the two arguments,
{$ \begin{align} \gamma M & \le \mathbf{w}_{n+1}^\top \mathbf{w}^* \\ & \leq ||\mathbf{w}_{n+1}||_2 ||\mathbf{w}^*||_2 \cos(\theta) \\ & \leq ||\mathbf{w}_{n+1}||_2 ||\mathbf{w}^*||_2 \\ & \leq ||\mathbf{w}_{n+1}||_2 \le \sqrt{M R^2} \end{align} $}
Equation (2) follows from the definition of the dot product, where {$\theta$} is the angle between {$\mathbf{w}_{n+1}$} and {$\mathbf{w}^*$}. Equation (3) follows from the fact that {$\cos(\theta) \leq 1$}. Finally, equation (4) follows from the fact that we can consider {$\mathbf{w}^*$} a unit vector, so {$||\mathbf{w}^*||_2 = 1$}. Thus, the number of mistakes is bounded as follows:
{$ M \le \frac{R^2}{\gamma^2} $}.
The non-separable case is harder.