Lectures /
Bias VarianceSee also the unusually clear wikipedia article. This nice applet for polynomial regression is a lot of fun (as far as regression applets go). It illustrates the fundamental trade-off in learning predictive models: bias versus variance. In a nutshell, simple models don’t fit data too well but are likely to generalize better, while more complex models can fit the data perfectly, but are not likely to predict well on new examples. Let’s define this more precisely. Definition of Bias and Variance An unbiased estimator is one which has a bias of zero. Let’s assume our regression data comes from some probabilistic process (nature) which we don’t know anything about except that it’s i.i.d. {$ (x,y) \sim p(x,y), \;\;\;\ p(D) = \prod_{i=1}^n (x_i,y_i) \sim p^n $} We have an algorithm that takes a dataset D and learns a model to predict y given x. We will call this model h(x;D), for example, a linear regression model: {$h(x;D) = w_D^\top x$}, where {$w_D$} was estimated from D using MLE or MAP or whatever else we invent. Now consider the expected squared error of our model on a new independent observataion {$(x,y$}): {$ \; \begin{align*} \mbox{Expected Error of h given D} & = \\ \mathbf{E}_{(x,y)\sim p}[(h(x;D) - y)^2] & = \int_x\int_y (h(x;D) - y)^2 p(x,y)dx dy \end{align*} \; $} Since we don’t in general know p(x,y), this error is not computable, but hang on. We are usually interested in (minimizing) the expected error of our algorithm, and there is variability in what dataset D we get to observe in order to learn the model. So if we average over all the datasets we could have gotten from nature, we get the average model we would learn: {$\overline{h}(x) = \textbf{E}_{D\sim p^n}[(h(x;D)]$} Now we are interested in how much squared error we will get by using our learning algorithm, on average. This average is over both datasets and the new sample: {$ \; \begin{align*} \mbox{Expected Error of h} & = \\ \textbf{E}_{(x,y)\sim p, D\sim p^n}[(h(x;D) - y)^2] & = \int_x\int_y \int_D (h(x;D) - y)^2 P(x,y) p(D)dx dy dD \end{align*} \; $} Again, in general, this is not computable, unless we know P(x,y). It is however very instructive to see how it breaks down; we’ll see that it decomposes into three intuitive terms: {$\textrm{variance}\; +\; \textrm{bias}^2\; +\; \textrm{noise}$}, where variance is the variability of the model our algorithm learns as the dataset varies, bias is how ‘wrong’ our learned model is, even if it’s given infinite data, and finally, noise is the average variance of nature, {$p(y\mid x)$}. We’ll use the following trick below. Suppose {$Z_1 \; {\rm and}\; Z_2$} are independent random variables and {$\textbf{E}[Z_1] = \overline{Z}_1$}, {$\textbf{E}[Z_2] = \overline{Z}_2$}, . Consider computing expected squared distance between them: {$ \textbf{E}_{Z_1,Z_2}[ (Z_1-Z_2)^2] = Var(Z_1) + Var(Z_2) + (\overline{Z}_1-\overline{Z}_2)^2 $} First, let’s add and subtract the average predictor ({$\overline{h}(x)$}) in the expression above: {$ \textbf{E}_{x,y, D}[\left((h(x;D) - \overline{h}(x)) + (\overline{h}(x) - y)\right)^2] = \\ \textbf{E}_{x,y, D}[ (h(x;D) - \overline{h}(x))^2 + (\overline{h}(x) - y)^2 + 2(h(x;D) - \overline{h}(x))(\overline{h}(x) - y)] $} Note that the final term, {$2(h(x;D) - \overline{h}(x))(\overline{h}(x) - y)$}, vanishes when we compute expectation over D because {$\textbf{E}_D[h(x;D) - \overline{h}(x)]=0$} and the other part of term does not depend on D. So we have {$ \textbf{E}_{x,y,D}[(h(x;D) - y)^2] = \underbrace{\textbf{E}_{x,D}[(h(x;D) - \overline{h}(x))^2]}_{\rm Variance} + \textbf{E}_{x,y}[(\overline{h}(x) - y)^2] $} The first term is variance of our regression procedure. The second term further decomposes into {$\textrm{bias}^2$} and noise. Let’s denote the average y at every x as: {$\overline{y}(x) = \textbf{E}_{y|x}[y]$} Then, using the same trick as above, by subtracting and adding {$\overline{y}(x)$}, we get: {$\textbf{E}_{x,y}[(\overline{h}(x) - \overline{y}(x) + \overline{y}(x) - y)^2] = \\ \textbf{E}_{x,y}[(\overline{h}(x) - \overline{y}(x))^2 + (\overline{y}(x) - y)^2 + 2(\overline{h}(x) - \overline{y}(x))(\overline{y}(x) - y)]$} We again note that the cross-term, {$2(\overline{h}(x)-\overline{y}(x))(\overline{y}(x) - y)$}, vanishes when we take expectation over y given x (and then x) since {$\textbf{E}_{y|x}[\overline{y}(x) - y] = 0$} for each x and {$\overline{h}(x)-\overline{y}(x)$} is constant with respect to y for each x. Hence we get: {$\textbf{E}_{x,y}[(\overline{h}(x) - y)^2] = \underbrace{\textbf{E}_{x}[(\overline{h}(x) - \overline{y}(x))^2]}_{{\rm Bias}^2} + \underbrace{\textbf{E}_{x,y}[(\overline{y}(x) - y)^2]}_{\rm Noise}$} The first term is the {$\textrm{bias}^2$} of our regression model and the second term is simply the noise of the data. So in summary: {${\rm \textbf{Error Decomposition:}}\;\; \textbf{E}_{x,y, D}[(h(x;D) - y)^2] = \underbrace{\textbf{E}_{x,D}[ (h(x;D) - \overline{h}(x))^2]}_{\rm Variance} + \underbrace{\textbf{E}_{x}[(\overline{h}(x) - \overline{y}(x))^2]}_{{\rm Bias}^2} + \underbrace{\textbf{E}_{x,y}[(\overline{y}(x) - y)^2]}_{\rm Noise}$} Variance: {$ \textbf{E}_{x,D}[ (h(x;D) - \overline{h}(x))^2] $} — we can reduce this term by using simpler models or increasing the size of the data. {$\textrm{Bias}^2$}: {$ \textbf{E}_{x}[(\overline{h}(x) - \overline{y}(x))^2] $} — we can reduce this term by using more complex models. Noise: {$\textbf{E}_{x,y}[(\overline{y}(x) - y)^2]$} — this term we cannot control at all. Here’s a very simple example where we can compute all these quantities. Bias Variance Decomposition for Linear regression Let’s see how bias and variance show up in linear regression. For OLS, {$Bias(\hat{w})=0$}. On average, the parameters on neither under or overestimated. If one applies regularization, {$w$} will shink and the estimator becomes biased. However, is there is noise in the training data, {$\hat{\theta}$} will have nonzero variance. Shrinkage reduces this variance. Overall prediction accuracy is a tradeoff between bias and variance. It is best understood by looking at the bias and variance of the prediction {$\hat{y}$} Definitions So: draw infinitely many training sets, {$D$}, each of size {$n$}, from the distribution {$(x,y)$}. Estimate a model {$\hat{y}$} (called “h” above) for each training set. We sometimes write {$\hat{y}(x;D)$} to emphasize that {$y$} is a function of {$x$} and was estimated on {$D$}. Then repeatedly draw {$(x,y)$} pairs and compute the difference between the prediction and the truth. Average this difference over training sets {$D$} and over test data {$(x,y)$} (it includes the training data, but if there are an infinite number of points, then the training data is a vanishingly small fraction of them.) This gives the bias. For OLS, {$E[y] = wx$} and {$Bias(\hat{y})=0$}. We can view the expected Error (Sum of squared error, here) from a regression model as: Error = {$E[(y − \hat{y})^2]$} = {$Bias(\hat{y})^2 + Var(\hat{y}) + \sigma^2$} Proof sketch {$E[(y − \hat{y})^2] = E[(y − E[\hat{y}] + E[\hat{y}] − \hat{y})^2]$} Now do a similar trick to pull out the bias and noise from within the first term. {$E[(y − E[\hat{y}])^2] = E[(y − E[y] + E[y] − E[\hat{y}])^2]$} To show that the cross term {$ E[ 2(y − E[\hat{y}])(E[\hat{y}] − \hat{y})]] = 0 $} vanishes, note that the expectation is over all samples of training data sets {$D$}, (as well as over x and y). Take {$E_D$} first. Note that {$2(y − E[\hat{y}])$} does not depend on the training data {$D$}, and that {$E[\hat{y}] − \hat{y})]$} is zero |