Lectures /
Missing DataData are often missing. (Think about examples) Missing at Random (MAR)Data are rarely missing at random. When they are, there is usually a simple EM algorithm to impute the missing values. One can then do machine learning on the ‘complete’ data set. The simplest imputation is to replace all missing values by the average value of that feature. An alternative is to do matrix completion (like we did for the Netflix problem. EM can also be used. Missing Not at Random (MNAR)Data are mostly not missing at random. They are missing for a good reason. For regression, a standard approach is
Oddly, most packages don’t automatically add the missing variable indicators, although plenty of them will do step 1 (“imputation”) If you aren’t doing feature selection, you can just use a zero instead of the mean in step 1. For more details, see here |