To determine the complexity (the “description length”, measured in bits) of a model and associated data set, we code two things: the residual {$ r = y-\hat{y} $} and the model. For linear regression, the model will further be broken down into coding which features are present or absent from the model, and what the values of the associated weights are. This then gives us a loss function with two terms. The first, coding the residual, is proportional to the training err. The second term measures the model complexity. Minimizing their sum, i.e. finding the “minimum description length”, minimizes (in expectation) the test error.
There are two largely equivalent ways to think about coding the residual. Conceptually, the idea is that since the data are generated by a noisy process {$ y \sim N(w^\top x, \sigma^2) $}, there is no point in coding each residual {$ r $} any more accurately than to precision of roughly {$ \sigma $}. We can also look at the the entropy of the residual in terms of the log-likelihood of {$ y $} given {$x$} and {$w$}:
{$ - \log P(D \mid \textbf{w},\sigma) = - n\log(\frac{1}{\sigma\sqrt{2\pi}}) + \frac{1}{2\sigma^2}\sum_i (y_i - \textbf{x}_ī\textbf{w})^2 = n\log(\sigma\sqrt{2\pi}) + \frac{1}{2\sigma^2} Err $}
where {$ Err$} is, as usual, the sum of squared errors.
If we know {$\sigma$}, this is all well and good, but in practice we don’t know it. There are two ways to deal with this. Both rely on estimating it by noting that if we know the true model {$w$}, then {$ \sigma^2 = E((y-\hat{y})^2) = \frac{1}{n} Err $}. We don’t know {$w$}, so we’ll need to use an estimate of it.
If we’re doing stepwise (or streamwise) regression, at each iteration {$t$}, we can use the estimate of {$\sigma^2 = \frac{1}{n} Err^{(t-1)}$} from the previous iteration. Coding the residuals then costs {$\frac{Err^{(t)}}{2 Err^{(t-1)}/n}$}, which is reasonable. The question at each iteration is whether the ratio of the errors (times n) goes down faster than the complexity goes up.
If we use the estimate from the current iteration, then we get {$\frac{Err^{(t)}}{2 Err^{(t)}/n} = \frac{n}{2}$}, which is not so helpful, since n is a constant in this analysis. But if we look back at the full log-likelihood, we are see that although the “main” part of the Gaussian cancels out, we still have the “normalization factor: {$n\log(\sigma\sqrt{2\pi}) $} which is equal to the sum of {$n\log(\sigma) $} and a second tem, which we don’t care about (it does not change as we change the model complexity or accuracy. But {$ n\log(\sigma) $} is well approximated by {$ n\log( \frac{1}{n} Err^{1/2} )$} which is a “constant term” (only depending on n) plus {$ n\log( Err^{1/2} ) = \frac{n}{2} log(Err)$}
First, we code for each weight whether or not it is present. If we expect {$ q $} of the {$p$} features to enter the model, the entropy of this is
{$ - \sum_i \frac{q}{ p} log (\frac{q}{p}) + \frac{p-q}{p} log(\frac{p-q}{p}) $}
We consider two special cases:
(1) if {$q = p/2$} then we expect half the features to come in, then the cost of coding each feature presence or absence is 1 bit. Or, we expect on average to spend two bits to code each feature that is present.
(2) if {$q=1$} then we expect one feature to come in, then {$log(\frac{p-q}{p}) $} is close to zero and the cost of coding each feature is {$ - log(1/p) = log(p)$} bits.
We then need to code the value of the weight for each weight in the model. Weights can be estimated to an accuracy proportional to {$\sqrt{n}$} (i.e., the variance of {$ \hat{w}$} is proportional to {$ \frac{1}{n}$}, so we will use {$ log \sqrt{n} = \frac{1}{2} log(n)$} bits to code each feature. (We’re being a little sloppy here, since the weights really have an expected error proportional to {$\sigma / log \sqrt{n} $} (times the covariance of {$ \bf{x}$} which we don’t care about because it is an additive constant once we take the log.)
First consider the case where n becomes large, more precisely where {$ \sqrt{n} \gg p$}. In this case the cost of coding which features are present or absent is negligible, and we ‘charge’ {$ \frac{1}{2} log(n) $} bits per feature. This limit is called {$ {\bf BIC} $}.
Second, consider the opposite case, where {$p \gg \sqrt{n}$}. Now the cost of coding the values of the weights is negligible compared to specifying which features are present. Now we ‘charge’ {$ log(p) $} bits per feature. This limit is called {$ {\bf RIC} $}.
If {$p$} is comparable in size to {$\sqrt{n}$}, then one should, of course, include both terms and use a penalty of {$ \frac{1}{2} log(n) + log(p/q)$} bits for each feature added. (So a total penalty of {$q$} times this.) This doesn’t have a name.
Finally, consider a slightly odd (or old-fashioned) limit. Assume that n and p are both really tiny, and that half the features are expected to enter the model. In this limit, we might ‘charge’ {$ 1 $} bit per feature. This limit is called {$ {\bf AIC} $}.
More precisely, BIC, RIC, and AIC are defined as follows:
{$\mathrm{BIC} = {2 (- \ln{\hat L} + \frac{q}{2} \ln(n)}) = n \cdot \ln(\widehat{\sigma_e^2}) + q \ln(n) \ = n\cdot \ln(Err/n)+q \ln(n) $}
which is the same as minimizing {$ n \ln(Err)+q \ln(n) $}
{$ \mathrm {AIC} = 2(-\ln{\hat{L}} +q) = n \ln(Err/n)+2 q $} which is the same as minimizing {$ n\cdot \ln(Err)+2 q $}
Feature selection can be viewed as hypothesis testing: we want to reject, with some certainty, the null hypothesis that each weight is zero. If we’re reasonably sure it isn’t zero, then we add it to the model. This is often done is a way that controls the False Discovery Rate or FDR.