CIS520 Machine Learning | Lectures / Missing Data

Data are often missing. (Think about examples)

Missing at Random (MAR)

Data are rarely missing at random. When they are, there is usually a simple EM algorithm to impute the missing values. One can then do machine learning on the ‘complete’ data set.

A more complete definition is here

Missing Not at Random (MNAR)

Data are mostly not missing at random. They are missing for a good reason.

For regression, a standard approach is

Replace any missing values with the average of the values that are there. (“imputation”)
Add a separate column for each feature which is an indicator function: 1 if missing, 0 if present.
Run standard regression and feature selection.

Oddly, most packages don’t automatically add the missing variable indicators, although plenty of them will do step 1 (“imputation”)

If you aren’t doing feature selection, you can just use a zero instead of the mean in step 1.

Back to Lectures