Imputation

Estimating missing data is a classic use of EM. Consider the simplest case of imputation: replacing missing real values in a single column. A sensible approach is to replace the missing values with the average of the values that are observed. This can be derived using EM.

EM requires a generative model. In this case, we assume that {$ x \sim N(\mu,\Sigma) $}

EM then alternates between computing the expected value of the missing data given the current estimates of the model parameters, and taking the maximum likelihood estimate of the model parameters given the current estimates of the missing data.

The expected value {$ E(x) $} of observations from a normal distribution is just {$ \mu $}, so the E step will just set the missing data to the current estimate of {$ \mu $}.

The maximum likelihood estimate of the model parameters is equally easy. The expected value the MLE estimate of {$ \mu $} is just the average of the observations {$ (1/n) \sum_i x_i $}, so the M step will just compute the average of all the data.

Let's walk through a couple of rounds of this. Assume we have {$ x = (1,2, NA, 1, NA) $}

Initialize {$ \mu = 0, \Sigma = 1 $} - or anything else.

E step: set the missing values {$ x_3 $} and {$ x_5 $} to {$ \mu $}, getting {$ x = (1, 2, 0, 1, 0) $}

M step: estimate {$ \mu = (1+2+0+1+0)/5 = 0.8 $}

E step: set the missing values {$ x_3 $} and {$ x_5 $} to {$ \mu $}, getting {$ x = (1, 2, 0.8, 1, 0.8) $}

M step: estimate {$ \mu = (1+2+0.8+1+0.8)/5 = 1.12 $}

This will converge to {$\mu = (1+2+1)/3 = 4/3 $}

Fancier imputation

The above model assumed that each column was independent, which is not usually the case. We can generalize that model in several ways. One would be to assume that {$ \bf{x} \sim N(\mu,\Sigma) $} with {$ \Sigma $} now being a {$ p*p $} covariance matrix. EM looks very similar to what it is above.

Or we could use a similar EM-style alternation with a regression model.

  • Replace all missing values with an initial value, say 0.
  • Repeat until convergence
    • For each of the columns {$ j=1:p$} with missing values,
      • Fit a regression on all of the other columns, trained to predict the non-missing {$x_j$}
      • Use that regression model to estimate the missing {$x_j$}

Remember what assumptions imputation makes!

We are assuming in imputation that the missing data are generated from the same distribution as the observed data, i.e., that they are missing at random.

Back to Lectures