Lectures /
Feature BlocksOften features come in groups, or ‘blocks’. There might be a small set of demographic feature (age and sex), a large, sparse set of word features (which words were used), and a large dense set of pixels from an image. Using the same regularization penalty for all of them is a terrible idea. For example, using a big enough L1 penalty to zero out irrelevant words will shrink the demographic features, which should not be shrunk. There are several ways to address this. The most elegant is perhaps to just solve a single optimization which fits different penalties to each feature block. For example if the observations fall into three feature blocks, {$x_1, x_2, x_3$}, (e.g. corresponding to demographics, words and images) with corresponding weight vectors {$w_1, w_2, w_3$}, then an elastic net formulation of the loss function is: {$ w = argmin_w \sum_i (y_i - \hat{y}_i)^2 + \lambda_{11 }||w_1||_1 + \lambda_{12} ||w_1||_2^2 + \lambda_{21} ||w_2||_1 + \lambda_{22} ||w_2||_2^2 + + \lambda_{31} ||w_3||_1 + \lambda_{32} ||w_3||_2^2 $} Note that one would need to pick all six penalties by cross validation. (This risks a little overfitting, so to be safe, one should test accuracy on a held out data set.) The most widely used version of this is called Group lasso. A simpler version would be to run three different penalized regressions, predicting {$y$} from each of the ‘blocks’ of features. I.e. predict {$y$} from {$x_1$} getting an estimate {$\hat{y}_1$} , and also predict it from {$x_2$} {$\hat{y}_2$} and from {$x_3$} getting {$\hat{y}_3$}. One then has to combine the three different estimates of {$y$}. One could average them, but much better is to run a regression, predicting {$y$} from a weighted combination of the three estimates {$\hat{y}_1, \hat{y}_2, \hat{y}_3$}. A third option, which is also easy to implement is a ‘block stagewise’ procedure. First regress {$y$} on the first feature block {$x_1$}, getting predictions {$\hat{y}_1$}. Compute the residuals {$r_1 = y - \hat{y}_1$}, and regress them on the second block of features {$x_2$}, getting a new set of predictions {$\hat{y}_2$}. (At this point, to make a real prediction, you would compute {$\hat{y} = \hat{y}_1 + \hat{y}_2$}. Repeat again for each feature block, e.g. computing further residual {$r_2 = y - \hat{y}_1 - \hat{y}_2$} and predicting that using the third feature block {$x_3$}. |