Lectures /
Missing DataData are often missing. (Think about examples) Missing at Random (MAR)Data are rarely missing at random. When they are, there is usually a simple EM algorithm to impute the missing values. One can then do machine learning on the ‘complete’ data set. A more complete definition is here Missing Not at Random (MNAR)Data are mostly not missing at random. They are missing for a good reason. For regression, a standard approach is
Oddly, most packages don’t automatically add the missing variable indicators, although plenty of them will do step 1 (“imputation”) If you aren’t doing feature selection, you can just use a zero instead of the mean in step 1. |