MissingData

Data are often missing. (Think about examples)

Missing at Random (MAR)

Data are rarely missing at random. When they are, there is usually a simple EM algorithm to impute the missing values. One can then do machine learning on the 'complete' data set.

The simplest imputation is to replace all missing values by the average value of that feature.

An alternative is to do matrix completion (see the Netflix problem).

EM can also be used for imputation.

Missing Not at Random (MNAR)

Data are mostly not missing at random. They are missing for a good reason.

For supervised learning, a standard approach is

  1. Replace any missing real values with the average of the values that are there. ("imputation")
  2. for categorical variables, just let "missing" be its own category.
  3. Add a separate column for each feature which is an indicator function: 1 if missing, 0 if present.
  4. Run standard supervised learning.

Oddly, most packages don't automatically add the missing variable indicators, although plenty of them will do step 1 ("imputation")

If you aren't doing feature selection, you can just use a zero instead of the mean in step 1.

For more details, see this

Back to Lectures