LDA

See the supplemental reading on LDA, which is quite clear.

This page will just give the big picture on what EM for LDA looks like.

Assume k topics (a hyperparameter) and v words.

Input the d documents with the tokens in them. (A token is an instance of a word, so "I thought I saw her" has two occurrences of the word "I" out of the 5 tokens)

0) initialize by randomly assigning every token in every document to a topic. Estimate the probability distribution of each of the topic across each document by counting. (So each of the d documents will have a k-dimensional distribution {$\theta_k$} over the topics.

M-Step. Given the topic assignments for every token, estimate {$\beta_{jk} = p(word=j|topic=k)$} by counting across all tokens in all documents for each topic, how often did each word occur?

E-Step. Given the {$\beta_{jk} $} from the M-step, for every token in every document estimate {$ p(topic=k|token=j) \sim p(word=j|topic=k) p(topic=k)$}

I've glossed over a bunch of the details. In particular, to get the distribution {$\theta_k$}, instead of MLE (counting), we use a prior. When we did coin flipping, we estimated p(heads) using a beta distribution as the prior. Now, instead of having two outcomes (heads/tails, so a Bernoulli distribution), we have a k-dimensional distribution, so the conjugate prior is now a Dirichlet instead of a beta. But the idea is the same. Back to Lectures