CIS520 Machine Learning | Lectures / Bias Variance

See also the unusually clear wikipedia article.

This nice applet for polynomial regression is a lot of fun (as far as regression applets go). It illustrates the fundamental trade-off in learning predictive models: bias versus variance. In a nutshell, simple models don’t fit data too well, but are likely to generalize better, while more complex models can fit the data perfectly, but are not likely to predict well on new examples. Let’s define this more precisely.

Let’s assume our regression data comes from some probabilistic process (nature) which we don’t know anything about except that it’s i.i.d.

{$ (x,y) \sim P(x,y), \;\;\;\ P(D) = \prod_{i=1}^n P(x_i,y_i) \equiv P^n $}

We have an algorithm that takes a dataset D and learns a model to predict y given x. We will call this model h(x;D), for example, a linear regression model: {$h(x;D) = w_D^\top x$}, where {$w_D$} was estimated from D using MLE or MAP or whatever else we invent. Now consider the expected squared error of our model on a new independent sample (x,y):

{$ \; \begin{align*} \mbox{Expected Error of h given D} & = \\ \mathbf{E}_{(x,y)\sim P}[(h(x;D) - y)^2] & = \int_x\int_y (h(x;D) - y)^2 P(x,y)dx dy \end{align*} \; $}

Since we don’t in general know P(x,y), this error is not computable, but hang on. We are usually interested in (minimizing) the expected error of our algorithm, and there is a variability in what data set D we get to observe in order to learn the model. So if we average over all the datasets we could have gotten from nature, we get the average model we would learn:

{$\overline{h}(x) = \textbf{E}_{D\sim P^n}[(h(x;D)]$}

Now we are interested in how much squared error we will get by using our learning algorithm, on average. This average is over both datasets and the new sample:

{$ \; \begin{align*} \mbox{Expected Error of h} & = \\ \textbf{E}_{(x,y)\sim P, D\sim P^n}[(h(x;D) - y)^2] & = \int_x\int_y \int_D (h(x;D) - y)^2 P(x,y) P(D)dx dy dD \end{align*} \; $}

Again, in general, this is not computable, unless we know P(x,y). It is however very instructive to see how it breaks down; we’ll see that it decomposes into three intuitive terms: {$\textrm{variance}\; +\; \textrm{bias}^2\; +\; \textrm{noise}$}, where variance is the variability of the model our algorithm learns as the dataset varies, bias is how ‘wrong’ our learned model is, even if it’s given infinite data, and finally, noise is the average variance of nature, {$P(y\mid x)$}.

We’ll use the following trick below. Suppose {$Z_1 \; {\rm and}\; Z_2$} are independent random variables and {$\textbf{E}[Z_1] = \overline{Z}_1$}, {$\textbf{E}[Z_2] = \overline{Z}_2$}, . Consider computing expected squared distance between them:

{$ \textbf{E}_{Z_1,Z_2}[ (Z_1-Z_2)^2] = Var(Z_1) + Var(Z_2) + (\overline{Z}_1-\overline{Z}_2)^2 $}

First, let’s add and subtract the average predictor ({$\overline{h}(x)$}) in the expression above:

{$ \textbf{E}_{x,y, D}[\left((h(x;D) - \overline{h}(x)) + (\overline{h}(x) - y)\right)^2] = \\ \textbf{E}_{x,y, D}[ (h(x;D) - \overline{h}(x))^2 + (\overline{h}(x) - y)^2 + 2(h(x;D) - \overline{h}(x))(\overline{h}(x) - y)] $}

Note that the final term, {$2(h(x;D) - \overline{h}(x))(\overline{h}(x) - y)$}, vanishes when we compute expectation over D because {$\textbf{E}_D[h(x;D) - \overline{h}(x)]=0$} and the other part of term does not depend on D. So we have

{$ \textbf{E}_{x,y,D}[(h(x;D) - y)^2] = \underbrace{\textbf{E}_{x,D}[(h(x;D) - \overline{h}(x))^2]}_{\rm Variance} + \textbf{E}_{x,y}[(\overline{h}(x) - y)^2] $}

The first term is variance of our regression procedure. The second term further decomposes into {$\textrm{bias}^2$} and noise. Let’s denote the average y at every x as:

{$\overline{y}(x) = \textbf{E}_{y|x}[y]$}

Then, using the same trick as above, by subtracting and adding {$\overline{y}(x)$}, we get:

{$\textbf{E}_{x,y}[(\overline{h}(x) - \overline{y}(x) + \overline{y}(x) - y)^2] = \\ \textbf{E}_{x,y}[(\overline{h}(x) - \overline{y}(x))^2 + (\overline{y}(x) - y)^2 + 2(\overline{h}(x) - \overline{y}(x))(\overline{y}(x) - y)]$}

We again note that the cross-term, {$2(\overline{h}(x)-\overline{y}(x))(\overline{y}(x) - y)$}, vanishes when we take expectation over y given x (and then x) since {$\textbf{E}_{y|x}[\overline{y}(x) - y] = 0$} for each x and {$\overline{h}(x)-\overline{y}(x)$} is constant with respect to y for each x. Hence we get:

{$\textbf{E}_{x,y}[(\overline{h}(x) - y)^2] = \underbrace{\textbf{E}_{x}[(\overline{h}(x) - \overline{y}(x))^2]}_{{\rm Bias}^2} + \underbrace{\textbf{E}_{x,y}[(\overline{y}(x) - y)^2]}_{\rm Noise}$}

The first term is the {$\textrm{bias}^2$} of our regression model and the second term is simply the noise of the data.

So in summary:

{${\rm \textbf{Error Decomposition:}}\;\; \textbf{E}_{x,y, D}[(h(x;D) - y)^2] = \underbrace{\textbf{E}_{x,D}[ (h(x;D) - \overline{h}(x))^2]}_{\rm Variance} + \underbrace{\textbf{E}_{x}[(\overline{h}(x) - \overline{y}(x))^2]}_{{\rm Bias}^2} + \underbrace{\textbf{E}_{x,y}[(\overline{y}(x) - y)^2]}_{\rm Noise}$}

Variance: {$ \textbf{E}_{x,D}[ (h(x;D) - \overline{h}(x))^2] $} — we can reduce this term by using simpler models or increasing the size of the data.

{$\textrm{Bias}^2$}: {$ \textbf{E}_{x}[(\overline{h}(x) - \overline{y}(x))^2] $} — we can reduce this term by using more complex models.

Noise: {$\textbf{E}_{x,y}[(\overline{y}(x) - y)^2]$} — this term we cannot control at all.

Here’s a very simple example where we can compute all these quantities: https://alliance.seas.upenn.edu/~cis520/wiki/index.php?n=Recitations.BiasVariance

Here’s a (draft) version of the trade-off as presented in class.
Back to Lectures