Expectation-Maximization (EM) Algorithm: Detailed Explanation, Proofs, and Derivations
The Expectation-Maximization (EM) algorithm is an iterative method for finding maximum likelihood estimates of parameters in statistical models, particularly when the model depends on unobserved latent variables. The EM algorithm alternates between performing an expectation (E) step and a maximization (M) step.
Definition of the EM Algorithm
Let \( X \) be the observed data, \( Z \) be the latent variables, and \( \theta \) be the parameters of the model. The goal is to find the parameter values that maximize the likelihood function \( L(\theta) = p(X \mid \theta) \). The complete data likelihood is given by \( p(X, Z \mid \theta) \).
The EM algorithm consists of the following steps:
E-step (Expectation step)
Compute the expected value of the log-likelihood function with respect to the conditional distribution of the latent variables given the observed data and the current estimate of the parameters:
$$ Q(\theta \mid \theta^{(t)}) = \mathbb{E}_{Z \mid X, \theta^{(t)}} [ \log p(X, Z \mid \theta) ] $$
M-step (Maximization step)
Find the parameter that maximizes the expected log-likelihood found in the E-step:
$$ \theta^{(t+1)} = \arg\max_{\theta} Q(\theta \mid \theta^{(t)}) $$
Convergence
The EM algorithm guarantees that the likelihood does not decrease with each iteration, and under certain regularity conditions, it converges to a local maximum of the likelihood function.
Derivations and Proofs
Derivation of the E-step
Starting from the observed data likelihood, we introduce the latent variables \( Z \):
$$ \log p(X \mid \theta) = \log \sum_Z p(X, Z \mid \theta) $$
Using Jensen’s inequality, we obtain a lower bound:
$$ \log p(X \mid \theta) \geq \mathbb{E}_{Z \mid X, \theta^{(t)}} [ \log p(X, Z \mid \theta) ] – \mathbb{E}_{Z \mid X, \theta^{(t)}} [ \log p(Z \mid X, \theta^{(t)}) ] $$
The first term in the inequality is the Q-function used in the E-step:
$$ Q(\theta \mid \theta^{(t)}) = \mathbb{E}_{Z \mid X, \theta^{(t)}} [ \log p(X, Z \mid \theta) ] $$
The second term is constant with respect to \( \theta \) and can be ignored in the M-step.
Derivation of the M-step
In the M-step, we maximize the Q-function:
$$ \theta^{(t+1)} = \arg\max_{\theta} Q(\theta \mid \theta^{(t)}) $$
Given that the Q-function is the expected log-likelihood of the complete data, this step involves differentiating \( Q(\theta \mid \theta^{(t)}) \) with respect to \( \theta \) and setting the derivative to zero to find the maximum.
Example: Gaussian Mixture Model
Consider a Gaussian mixture model with \( K \) components. The observed data \( X = \{ x_1, x_2, \ldots, x_N \} \) is assumed to be generated from a mixture of \( K \) Gaussian distributions with parameters \( \theta = \{ \pi_k, \mu_k, \Sigma_k \}_{k=1}^K \), where \( \pi_k \) are the mixture weights, \( \mu_k \) are the means, and \( \Sigma_k \) are the covariances.
E-step for GMM
The latent variables \( Z = \{ z_{nk} \} \) indicate the component membership for each data point. The Q-function for the GMM is:
$$ Q(\theta \mid \theta^{(t)}) = \sum_{n=1}^N \sum_{k=1}^K \gamma_{nk} \left( \log \pi_k + \log \mathcal{N}(x_n \mid \mu_k, \Sigma_k) \right) $$
where \( \gamma_{nk} \) are the responsibilities, computed as:
$$ \gamma_{nk} = \frac{\pi_k^{(t)} \mathcal{N}(x_n \mid \mu_k^{(t)}, \Sigma_k^{(t)})}{\sum_{j=1}^K \pi_j^{(t)} \mathcal{N}(x_n \mid \mu_j^{(t)}, \Sigma_j^{(t)})} $$
M-step for GMM
The parameters are updated as follows:
$$ \pi_k^{(t+1)} = \frac{\sum_{n=1}^N \gamma_{nk}}{N} $$
$$ \mu_k^{(t+1)} = \frac{\sum_{n=1}^N \gamma_{nk} x_n}{\sum_{n=1}^N \gamma_{nk}} $$
$$ \Sigma_k^{(t+1)} = \frac{\sum_{n=1}^N \gamma_{nk} (x_n – \mu_k^{(t+1)})(x_n – \mu_k^{(t+1)})^T}{\sum_{n=1}^N \gamma_{nk}} $$