CIS520 Machine Learning | Lectures / L 0 Regression

This page in progress

The simplest, greediest search procedure is to consider each feature once for inclusion in the model, add it if it reduces penalized error, and never consider it again if it does not. This is called Streamwise Regression.

The most common search procedure is to pick the best out of all candidate features for inclusion in the model, add it if it reduces penalized error, stop the search if it does not. Repeat until one stops or has added all features. This is called Stepwise Regression.

In a greedier alternative, Stagewise Regression, instead of running a regression which has all {$q$} of the features that have been selected so far plus the new candidate feature, one just regresses the new candidate feature on the residual {$r = y -\hat{y}_q$}, the difference between the true value of y and the estimate based on the features that had already been selected

Non-greedy search is also possible for small sets of features. Generate all possible subsets of features, compute the penalized error for each of the feature subsets, and pick the set of features with the lowest penalized error. This is called all subsets regression and is, of course, very expensive for large number of predictors {$p$}.

Complexity

There are two parts to think about complexity: how many features are tested for inclusion in the model, and how much each potential inclusion ‘costs’. Streamwise regression considers each of p features once. Stepwise regression considers each of the p features at each step, stopping whenever the penalized error stops decreasing. If q features are added (with all p features tried at each point), then p*q potential regressions are tried. In both of these cases you need to invert a matrix that could be {$p*p$}, but in fact is only {$q*q$}, since only {$q$} features will be added. More precisely, each potential addition of a feature requires doing a regression with that feature and all features already in the model, solving {$(X'X)^{-1} X'y$}, where {$X$} is a matrix of size n*q where q-1 is the number of features that have been selected so far. This requires n*q multiplication to find {$X'X$} and {$q^3$} multiplications to invert it.

In stagewise regression, each potential addition of a feature regresses that feature against the residual, which takes of order n multiplications {$(x'x)^{-1} x'y$}, where {$x$} is a vector of length {$n$}

Note that it is often more expensive to compute {$X'X$} than to invert it, since the computation, for dense {$X$} takes {$n*p^2$} multiplications.

Back to Lectures