Recent Changes - Search:

Home

Missing Data

 

Data are often missing. (Think about examples)

Missing at Random (MAR)

Data are rarely missing at random. When they are, there is usually a simple EM algorithm to impute the missing values. One can then do machine learning on the ‘complete’ data set.

A more complete definition is here

Missing Not at Random (MNAR)

Data are mostly not missing at random. They are missing for a good reason.

For regression, a standard approach is

  1. Replace any missing values with the average of the values that are there. (“imputation”)
  2. Add a separate column for each feature which is an indicator function: 1 if missing, 0 if present.
  3. Run standard regression and feature selection.

Oddly, most packages don’t automatically add the missing variable indicators, although plenty of them will do step 1 (“imputation”)

If you aren’t doing feature selection, you can just use a zero instead of the mean in step 1.

Back to Lectures

Edit - History - Print - Recent Changes - Search
Page last modified on 17 November 2016 at 12:52 PM