Lectures /
Scale InvarianceA machine learning method is ‘scale invariant’ if rescaling any (or all) of the features—i.e. multiplying each column by a different nonzero number—does not change its predictions. This is desirable; it is preferable if changing a feature from centimeters to meters (e.g. dividing it by 100) does not change the model predictions. OLS is scale invariant. If you have a model {$\hat{y} = w_0 + w_1 x_1 + w_2 x_2$} and you replace {$x_1$} with {$x_1' = x_1/2$} and re-estimate the model, you’ll get a new model {$\hat{y} = w_0 + 2 w_1 x_1' + w_2 x_2$} which gives exactly the same preditions. The new {$x_1'$} is half as big, so its coefficient is now twice as big. Ridge and {$L_1$}-penalized regression (and hence elastic net) are not scale invariant. Ridge shrinks the big weights more than the small ones, so if you rescale the features, you change what the big weights are. {$L_0$} regression is scale invariant; the feature is in or out of the model, so the size doesn’t matter. PCA is not scale invariant. People therefore often rescale the data (standardize it) before they do PCA. When using a non-scale invariant method, if the features are of different units (e.g. dollars and miles and kilograms and numbers of products). people often standardized the data (subtract off the mean and divide by the standard deviation of each column in {$X$}. This makes all the features be our roughly the same size. If the features are all on the same scale (e.g. pixels in an image), then rescaling is generally a bad idea, as features that are always close to zero (and hence unimportant), when rescaled, become as big as features that have signal. |