#!/usr/local/bin/php
Warning: "continue" targeting switch is equivalent to "break". Did you mean to use "continue 2"? in /cgihome/cis520/html/dynamic/2016/wiki/pmwiki.php on line 691

Warning: "continue" targeting switch is equivalent to "break". Did you mean to use "continue 2"? in /cgihome/cis520/html/dynamic/2016/wiki/pmwiki.php on line 694

Warning: Use of undefined constant MathJaxInlineCallback - assumed 'MathJaxInlineCallback' (this will throw an Error in a future version of PHP) in /cgihome/cis520/html/dynamic/2016/wiki/cookbook/MathJax.php on line 84

Warning: Use of undefined constant MathJaxEquationCallback - assumed 'MathJaxEquationCallback' (this will throw an Error in a future version of PHP) in /cgihome/cis520/html/dynamic/2016/wiki/cookbook/MathJax.php on line 88

Warning: Use of undefined constant MathJaxLatexeqrefCallback - assumed 'MathJaxLatexeqrefCallback' (this will throw an Error in a future version of PHP) in /cgihome/cis520/html/dynamic/2016/wiki/cookbook/MathJax.php on line 94
CIS520 Machine Learning | Lectures / MDL
Recent Changes - Search:

Home

MDL

 

To determine the complexity (the “description length”, measured in bits) of a model and associated data set, we code two things: the residual {$ r = y-\hat{y} $} and the model. For linear regression, the model will further be broken down into coding which features are present or absent from the model, and what the values of the associated weights are. This then gives us a loss function with two terms. The first, coding the residual, is proportional to the training err. The second term measures the model complexity. Minimizing their sum, i.e. finding the “minimum description length”, minimizes (in expectation) the test error.

Coding the residual

There are two largely equivalent ways to think about coding the residual. Conceptually, the idea is that since the data are generated by a noisy process {$ y \sim N(w^\top x, \sigma^2) $}, there is no point in coding each residual {$ r $} any more accurately than to precision of roughly {$ \sigma $}. We can also look at the the entropy of the residual in terms of the log-likelihood of {$ y $} given {$x$} and {$w$}:

{$ - \log P(D \mid \textbf{w},\sigma) = - n\log(\frac{1}{\sigma\sqrt{2\pi}}) + \frac{1}{2\sigma^2}\sum_i (y_i - \textbf{x}_ī\textbf{w})^2 = n\log(\sigma\sqrt{2\pi}) + \frac{1}{2\sigma^2} Err $}

where {$ Err$} is, as usual, the sum of squared errors.

If we know {$\sigma$}, this is all well and good, but in practice we don’t know it. There are two ways to deal with this. Both rely on estimating it by noting that if we know the true model {$w$}, then {$ \sigma^2 = E((y-\hat{y})^2) = \frac{1}{n} Err $}. We don’t know {$w$}, so we’ll need to use an estimate of it.

If we’re doing stepwise (or streamwise) regression, at each iteration {$t$}, we can use the estimate of {$\sigma^2 = \frac{1}{n} Err^{(t-1)}$} from the previous iteration. Coding the residuals then costs {$\frac{Err^{(t)}}{2 Err^{(t-1)}/n}$}, which is reasonable. The question at each iteration is whether the ratio of the errors (times n) goes down faster than the complexity goes up.

If we use the estimate from the current iteration, then we get {$\frac{Err^{(t)}}{2 Err^{(t)}/n} = \frac{n}{2}$}, which is not so helpful, since n is a constant in this analysis. But if we look back at the full log-likelihood, we are see that although the “main” part of the Gaussian cancels out, we still have the “normalization factor: {$n\log(\sigma\sqrt{2\pi}) $} which is equal to the sum of {$n\log(\sigma) $} and a second tem, which we don’t care about (it does not change as we change the model complexity or accuracy. But {$ n\log(\sigma) $} is well approximated by {$ n\log( \frac{1}{n} Err^{1/2} )$} which is a “constant term” (only depending on n) plus {$ n\log( Err^{1/2} ) = \frac{n}{2} log(Err)$}

Coding the weights

First we code for each weight whether or not it is present. If we expect {$ q $} of the {$p$} features to enter the model, the entropy of this is

{$ - \sum_i \frac{q}{ p} log (\frac{q}{p}) + \frac{p-q}{p} log(\frac{p-q}{p}) $}

We consider two special cases:

(1) if {$q = p/2$} then we expect half the features to come in, then the cost of coding each feature presence is 1 bit.

(2) if {$q=1$} then we expect one feature to come in, then {$log(\frac{p-q}{p}) $} is close to zero and the cost of coding each feature is {$ - log(1/p) = log(p)$} bits.

We then need to code the value of the weight for each weight in the model. Weights can be estimated to an accuracy proportional to {$\sqrt{n}$} (i.e., the variance of {$ \hat{w}$} is proportional to {$ \frac{1}{n}$}, so we will use {$ log \sqrt{n} = \frac{1}{2} log(n)$} bits to code each feature. (We’re being a little sloppy here, since the weights really have an expected error proportional to {$\sigma / log \sqrt{n} $} (times the covariance of {$ \bf{x}$} which we don’t care about because it is an additive constant once we take the log.)

We can consider a number of limiting cases of this, based on the relative sizes of n and p.

First consider the case where n becomes large, more precisely where {$ \sqrt{n} \gg p$}. In this case the cost of coding which features are present or absent is negligible, and we ‘charge’ {$ \frac{1}{2} log(n) $} bits per feature. This limit is called {$ {\bf BIC} $}.

Second consider the opposite case, where {$p \gg \sqrt{n}$}. Now the cost of coding the values of the weights is negligible compared to saying which features are present. Now we ‘charge’ {$ log(p) $} bits per feature. This limit is called {$ {\bf RIC} $}.

If {$p$} is comparable in size to {$\sqrt{n}$}, then one should of course include both terms and use a penalty of {$ \frac{1}{2} log(n) + log(p)$} bits for feature added. (So a total penalty of {$1$} times this.) This doesn’t have a name.

Finally, consider a slightly odd (or old fashioned) limit . Assume that n and p are both really tiny, and that half the features are expected to enter the model. In this limit we might ‘charge’ {$ 1 $} bit per feature. This limit is called {$ {\bf AIC} $}.

More precisely, BIC, RIC, and AIC are defined as follows:

{$\mathrm{BIC} = {2 (- \ln{\hat L} + \frac{q}{2} \ln(n)}) = n \cdot \ln(\widehat{\sigma_e^2}) + q \ln(n) \ = n\cdot \ln(Err/n)+q \ln(n) $}

which is the same as minimizing {$ n \ln(Err)+q \ln(n) $}

{$ \mathrm {AIC} = 2(-\ln{\hat{L}} +q) = n \ln(Err/n)+2 q $} which is the same as minimizing {$ n\cdot \ln(Err)+2 q $}

Back to Lectures

Edit - History - Print - Recent Changes - Search
Page last modified on 11 October 2016 at 07:12 AM