CIS520 Machine Learning | Lectures / Missing Data

Data are often missing. (Think about examples)

Missing at Random (MAR)

Data are rarely missing at random. When they are, there is usually a simple EM algorithm to impute the missing values. One can then do machine learning on the ‘complete’ data set.

The simplest imputation is to replace all missing values by the average value of that feature.

An alternative is to do matrix completion (like we did for the Netflix problem.

EM can also be used.

Missing Not at Random (MNAR)

Data are mostly not missing at random. They are missing for a good reason.

For regression, a standard approach is

Replace any missing values with the average of the values that are there. (“imputation”)
Add a separate column for each feature which is an indicator function: 1 if missing, 0 if present.
Run standard regression and feature selection.

Oddly, most packages don’t automatically add the missing variable indicators, although plenty of them will do step 1 (“imputation”)

If you aren’t doing feature selection, you can just use a zero instead of the mean in step 1.

For more details, see here

Back to Lectures