Diffusion Model｜Voyagers

Diffusion models are a class of generative models that aim to model the distribution of data by simulating the diffusion process. These models are particularly powerful for tasks like image generation, where the goal is to learn the underlying distribution of the data and generate new samples from this distribution.

Mathematical Formulation

Forward Diffusion Process

The forward diffusion process is defined as a sequence of Gaussian noise additions to the data. Let $ \mathbf{x}_0 $ be the initial data point (e.g., an image). The forward process generates a sequence of latent variables $ \mathbf{x}_1, \mathbf{x}_2, \ldots, \mathbf{x}_T $ by gradually adding noise:

$$ q(\mathbf{x}t | \mathbf{x}{t-1}) = \mathcal{N}(\mathbf{x}t; \sqrt{1 – eta_t} \mathbf{x}{t-1}, eta_t \mathbf{I}) $$

where $ eta_t $ is a variance schedule that controls the amount of noise added at each step $ t $.

Reverse Diffusion Process

The reverse diffusion process aims to recover the original data $ \mathbf{x}0 $ from the noisy latent variables $ \mathbf{x}_T $. The reverse process is parameterized by a neural network $ p heta $ and is defined as:

$$ p_ heta(\mathbf{x}{t-1} | \mathbf{x}_t) = \mathcal{N}(\mathbf{x}{t-1}; \mu_ heta(\mathbf{x}t, t), \Sigma heta(\mathbf{x}_t, t)) $$

The mean $ \mu_ heta $ and variance $ \Sigma_ heta $ are learned by the neural network.

Objective Function

The training objective for the diffusion model is to maximize the likelihood of the data under the reverse process. This can be achieved by minimizing the variational bound on the negative log-likelihood:

$$ L = – \mathbb{E}{q(\mathbf{x}{0:T})} \left[ \log rac{p_ heta(\mathbf{x}{0:T})}{q(\mathbf{x}{1:T} | \mathbf{x}_0)} \right] $$

Expanding this expression, we get:

$$ L = \mathbb{E}{q} \left[ \sum{t=1}^T D_{KL}(q(\mathbf{x}t | \mathbf{x}{t-1}) || p_ heta(\mathbf{x}t | \mathbf{x}{t-1})) – \log p_ heta(\mathbf{x}_0 | \mathbf{x}_1) \right] $$

where $ D_{KL} $ is the Kullback-Leibler divergence.

Derivation

Forward Process

The forward process gradually adds noise to the data. For each time step $ t $, the noisy data is obtained as:

$$ \mathbf{x}t = \sqrt{1 – eta_t} \mathbf{x}{t-1} + \sqrt{eta_t} \mathbf{z}_t $$

where $ \mathbf{z}_t \sim \mathcal{N}(0, \mathbf{I}) $.

By iterating this process, we can express $ \mathbf{x}_t $ in terms of the original data $ \mathbf{x}_0 $:

$$ \mathbf{x}_t = \sqrt{ lpha_t} \mathbf{x}_0 + \sqrt{1 – lpha_t} \mathbf{z} $$

where $ lpha_t = \prod_{i=1}^t (1 – eta_i) $.

Reverse Process

The reverse process is defined by a neural network that learns the parameters $ \mu_ heta $ and $ \Sigma_ heta $. The training objective is to minimize the KL divergence between the true posterior $ q(\mathbf{x}{t-1} | \mathbf{x}_t) $ and the model posterior $ p heta(\mathbf{x}_{t-1} | \mathbf{x}_t) $:

$$ D_{KL}(q(\mathbf{x}{t-1} | \mathbf{x}_t, \mathbf{x}_0) || p heta(\mathbf{x}_{t-1} | \mathbf{x}_t)) $$

Given the Gaussian assumptions, this divergence can be simplified and the parameters can be optimized using gradient descent.

Example: Image Generation

In the context of image generation, the diffusion model is trained on a dataset of images. During training, the model learns to denoise images corrupted by Gaussian noise. Once trained, the model can generate new images by sampling from the learned distribution and applying the reverse diffusion process.

Forward Process for Images

$$ q(\mathbf{x}t | \mathbf{x}{t-1}) = \mathcal{N}(\mathbf{x}t; \sqrt{1 – eta_t} \mathbf{x}{t-1}, eta_t \mathbf{I}) $$

Reverse Process for Images

$$ p_ heta(\mathbf{x}{t-1} | \mathbf{x}_t) = \mathcal{N}(\mathbf{x}{t-1}; \mu_ heta(\mathbf{x}t, t), \Sigma heta(\mathbf{x}_t, t)) $$

The model is trained to minimize the KL divergence between the forward and reverse processes, effectively learning to generate realistic images from noise.

Conclusion

Diffusion models provide a robust framework for generative modeling by leveraging the principles of stochastic processes. They are particularly effective for high-dimensional data like images, offering a powerful tool for tasks such as image synthesis and denoising.