Imputatation

Estimating missing data is a classic use of EM. Consider the simplest case of imputation: replacing missing real values in a single column. A sensible approach is to replace the missing values with the average of the values that are observed. This can be derived using EM.

EM requires a generative model. In this case, we assume that {$ x \sim N(\mu,\Sigma) $}

EM then alternates between computing the expected value of the missing data given the current estimates of the model parameters, and taking the maximum likelihood estimate of the model parameters given the current estimates of the missing data.

The expected value {$ E(x) $} of observations from a normal distribution is just {$ \mu $}, so the E step will just set the missing data to the current estimate of {$ \mu $}.

The maximum likelihood estimate of the model parameters is equally easy. The expected value the MLE estimate of {$ \mu $} is just the average of the observations {$ (1/n) \sum_i x_i $}, so the M step will just compute the average of all the data.

Let's walk through a couple of rounds of this. assume we have {$ x = (1,2, NA, 1, NA) $}

Initialize {$ \mu = 0, \Sigma = 1 $} - or anything else.

E step: set the missing values {$ x_3 $} and {$ x_5 $} to {$ \mu $}, getting {$ x = (1, 2, 0, 1, 0) $}

M step: estimate {$ \mu = (1+2+0+1+0)/5 = 0.8 $}

E step: set the missing values {$ x_3 $} and {$ x_5 $} to {$ \mu $}, getting {$ x = (1, 2, 0.8, 1, 0.8) $}

M step: estimate {$ \mu = (1+2+0.8+1+0.8)/5 = 1.12 $}

this will converge to {$\mu = (1+2+1)/3 = 4/3 $}

Back to Lectures