Warning: "continue" targeting switch is equivalent to "break". Did you mean to use "continue 2"? in /cgihome/cis520/html/dynamic/2017/wiki/pmwiki.php on line 691

Warning: "continue" targeting switch is equivalent to "break". Did you mean to use "continue 2"? in /cgihome/cis520/html/dynamic/2017/wiki/pmwiki.php on line 694

Warning: Use of undefined constant MathJaxInlineCallback - assumed 'MathJaxInlineCallback' (this will throw an Error in a future version of PHP) in /cgihome/cis520/html/dynamic/2017/wiki/cookbook/MathJax.php on line 84

Warning: Use of undefined constant MathJaxEquationCallback - assumed 'MathJaxEquationCallback' (this will throw an Error in a future version of PHP) in /cgihome/cis520/html/dynamic/2017/wiki/cookbook/MathJax.php on line 88

Warning: Use of undefined constant MathJaxLatexeqrefCallback - assumed 'MathJaxLatexeqrefCallback' (this will throw an Error in a future version of PHP) in /cgihome/cis520/html/dynamic/2017/wiki/cookbook/MathJax.php on line 94
CIS520 Machine Learning | Lectures / Perceptrons
Recent Changes - Search:

Home

Perceptrons

 

The Perceptron Algorithm

Start with {$\mathbf{w}_1 = 0$}. At each time step i = 1 to n,

  • Receive {$\mathbf{x}_i \in \Re^m$} and predict {$sign(\mathbf{w}_i ^\top \mathbf{x}_i) $}
  • Outcome {$y_i \in \{-1, 1\}$} is revealed
  • If {$sign(\mathbf{w}_i ^\top \mathbf{x}_i) \ne y_i$}, update {$\mathbf{w}_{i+1} = \mathbf{w}_i + y_i \mathbf{x}_i$}.

Note that no update is performed when the correct label is predicted. Let’s look at how the “correlation” between the (unknown) true {$w^*$} and current estimate {$w_i$} is evolving. If a mistake is made on round i,

{$ \begin{align*} \mathbf{w}_{i+1}^\top \mathbf{w}^* & = (\mathbf{w}_i + y_i \mathbf{x}_i)^\top \mathbf{w}^* \\ & =\mathbf{w}_i^\top \mathbf{w}^* + y_i \mathbf{x}_i^\top \mathbf{w}^* \ge \mathbf{w}_i^\top \mathbf{w}^* + \gamma \end{align*}$}.

Thus, every time there is a mistake, the dot product of our hypothesis {$\mathbf{w}_i$} with the unknown {$\mathbf{w}^*$} increases by {$\gamma$}. Note that {$\gamma$} is always positive if the data is separable, because the {$\mathbf{x}_i^\top \mathbf{w}^*$} in that case will have the same sign as {$y_i$}. If {$M$} mistakes have been made over {$n$} rounds, we have {$\mathbf{w}_{n+1} ^\top \mathbf{w}^* \ge \gamma M$} since we started with a zero vector. Now, the main question is whether the increase in the dot product due to the smaller angle between the vectors (i.e. we are indeed getting close to the unknown {$\mathbf{w}^*$} in terms of direction), or is it just because of an increase in the magnitude of {$\mathbf{w}_i$}, {$||\mathbf{w}_i||_2$}? While the “correlation” with {$\mathbf{w}^*$} increases with every mistake, we can show that the size of the hypothesis {$||\mathbf{w}_i||_2$} cannot increase too fast. If there is a mistake on round {$i$},

{$ \begin{align*} ||\mathbf{w}_{i+1}||_2^2 & = ||\mathbf{w}_i + y_i \mathbf{x}_i||_2^2 \\ & = ||\mathbf{w}_i||_2^2 + 2y_i(\mathbf{w}_i^\top \mathbf{x}_i) + ||\mathbf{x}_i||_2^2 \\ & \le ||\mathbf{w}_i||_2^2 + ||\mathbf{x}_i||_2^2 \end{align*} $}

where the last inequality follows because we are assuming a mistake was made, which means {$sign(\mathbf{w}_i^\top \mathbf{x}_i) \ne y_i$} and so {$y_i(\mathbf{w}_i^\top \mathbf{x}_i)$} is negative. (Note that little {$m$} here is just the number of features, which is different from big {$M$}, the number of mistakes.)

Suppose that for all {$i$}, {$||\mathbf{x}_i||_2 \le R$}. Then after {$n$} rounds and {$M$} mistakes,

{$||\mathbf{w}_{n +1}||_2^2 \le M R^2$}.

Combining the two arguments,

{$ \begin{align} \gamma M & \le \mathbf{w}_{n+1}^\top \mathbf{w}^* \\ & \leq ||\mathbf{w}_{n+1}||_2 ||\mathbf{w}^*||_2 \cos(\theta) \\ & \leq ||\mathbf{w}_{n+1}||_2 ||\mathbf{w}^*||_2 \\ & \leq ||\mathbf{w}_{n+1}||_2 \le \sqrt{M R^2} \end{align} $}

Equation (2) follows from the definition of the dot product, where {$\theta$} is the angle between {$\mathbf{w}_{n+1}$} and {$\mathbf{w}^*$}. Equation (3) follows from the fact that {$\cos(\theta) \leq 1$}. Finally, equation (4) follows from the fact that we can consider {$\mathbf{w}^*$} a unit vector, so {$||\mathbf{w}^*||_2 = 1$}. Thus, the number of mistakes is bounded as follows:

{$M \le \frac{R^2}{\gamma^2}$}.

The Non-Separable Case is harder

Back to Lectures

Edit - History - Print - Recent Changes - Search
Page last modified on 11 October 2017 at 11:20 AM