Denoising Diffusion Probabilistic Models (DDPM) and links with Flow Matching
- ((choice)) small added noise variance at step
- signal retention factor at step
- overall signal retention factor from the start to step
- overall noise variance added from the start to step
- ((choice)) variance of the backward step at step
- forward/noising step distribution
- big-jump forward/noising step distribution
- big-jump for step
- “sandwich view”
- sandwich mean
- sandwich variance
This lesson introduces the principles of diffusion models. The aim is to clearly explain, with both the intuition and the detailed derivation, the original Denoising Diffusion Probabilistic Models (DDPM). The derivations are quite detailed, including some parts that are usually skipped in papers/blogs.
Diffusion models predate flow matching models, and were the first models to achieve high-quality image generation results comparable to GANs. The standard diffusion setting aims at learning to map a noise distribution, typically via a continuous normalizing flow. Even though the formulation is quite different and stochastic, in many ways, diffusion models can be seen as a special case of flow matching models, with a specific choice of noise distribution and trajectory parametrization.
A key breakthrough of diffusion models was to be able to specify the complete evolution of the probability distribution between noise and data. This allows to supervise the generative/denoising process (a continuous normalizing flow) at every step instead of just specifying the distribution at the first and last steps.
To achieve this, diffusion models imagine a forward noising process, which progressively adds noise to the data until it becomes pure noise. The generative process is then trained to reverse this noising process, and denoise the data step by step.
Common knowledge about Gaussians / normal distributions
The goal here is not to define the normal distribution, nor to explain all its properties. We will just recall some properties of Gaussians that will be useful in the rest of the lesson.
KL divergence between two normals
The KL divergence between two normal distributions and is given by:
In this post only the first term will be useful: it is the only term depending on the variable of interest.
Product of two normal densities
The product of two normal densities and is proportional to another normal density:
or more explicitly,
and
Proof/Derivation
We work on the log space, focusing/keeping only the terms depending on (the rest is the normalization constant of a Gaussian).
So, first, let’s see that the log of a gaussian density can be expanded as:
This formula is useful for identifying the parameters of a Gaussian density from its expanded log density.
For the product of two Gaussian densities, we have:
We can identify directly, and then deduce :
Identities on normal mean
- combining both:
Forward/noising process
In this section, we introduce the forward/noising process, and derive all properties that are necessary afterwards.
Global view
The forward/noising process progressively adds noise to the data until it becomes pure noise. As shown in Fig. A, the forward process starts from data points (e.g. images from the training set, also named ). The forward process progressively adds Gaussian noise to these data points, until they are completely shuffled and become close to pure noise. The total process is run for a finite (but high) number of steps , and we (almost) obtain .
Local view
The forward/noising process is defined as a Markov chain, where each step adds a bit of Gaussian noise to the previous step. As shown in Fig. A the position at step depends only on the position at step (hence the Markov property). More precisely, the forward/noising process is defined as:
Even more precisely:
- we suppose the original dataset has been normalized,
- we can decide on a variance addition schedule saying how much noise variance to add at each step,
- as we add noise at each step, the distribution would be more and more spread along time steps () and would not reach a gaussian noise with identity covariance,
- to avoid this, we also rescale the signal at each step by a factor .
The goal of the rescaling is to ensure that the variance of the dataset at step is always , whatever . Overall the forward/noising process is defined as:
Big-jump view
Thanks to the properties of Gaussians, we can express the distribution at step as a function of the initial data point . Indeed, since each step is adding a Gaussian noise (and rescaling), the composition of all the steps is also a Gaussian distribution as shown in Fig. A. More precisely, we have:
with the overall signal retained from the start to step , in which is the signal retention factor at step .
Sandwich view
As this will become useful, we can also express the distribution at step as a function of the position at step and the initial data point . This is illustrated in Fig. A:
- the two densities in blue correspond to the information on that we have thanks to (the peaky one) and (the wide one),
- their product (in pink), once renormalized, gives us the distribution of knowing both and
More precisely, we have:
We will show that we can derive a closed-form expression for this distribution:
The proof relies on showing that (the anti-forward step) is proportional to a Gaussian density, and then using the product-of-two-Gaussians property that we derived in the preliminaries.
Proof/Derivation
Knowing , we can express the anti-forward step using the Bayes rule. Remembering that this is a distribution over , we can focus on the factors that depend on it (and thus drop below):
Learnable backward process
Inspired by the forward/noising process, we define a backward/denoising process that will have parameters so that we can learn it to reverse the forward/noising process. As shown in Fig. A, the backward/denoising process starts from some random noise. It is also defined as a Markov chain, where each step only depends on the previous step .
Said differently, the process is defined by the distributions for , where are the learnable parameters of the model. The form of is actually a gaussian distribution, with a mean predicted by a neural network, and a variance that can be fixed or learned. We typically have, with a fixed denoising variance schedule , and finally write:
Overall goal: how to guide learning?
The parameters of the backward/denoising process will be learned by fitting the backward steps to the forward steps. More precisely, we want to minimize the KL divergence between the distribution induced by the forward step and one induced by the backward step , as hinted in Fig. A.
From global to local KL
This part details the key insight of how to transform a global KL loss into a sum of local ones, eventually expressed as square errors.
Overview
We will show that we can simplify the global objective into a sum of local objectives, one for each step, with the proper conditioning. We will follow a few steps:
- starting by writing the KL divergence between the two joint distributions of the Markov chains, in their natural direction (forward for the noising process, backward for the learned process),
- re-introducing an expectation on to make it the expression tractable (below, using the “sandwich view”),
- “reversing” the conditional forward noising process,
- using the sandwich view,
- using the closed-form of the KL between Gaussians to get a final closed-form loss.
The final terms involved in the loss are KL divergences between two Gaussian distributions, which can be computed in closed form. These are illustrated in Fig. A.
Goal: KL divergence between the forward and backward processes
We decompose the global KL divergence between the forward noising process and the backward learnable process. A big part of these derivations are also detailed at the end of this page, in isolation of the rest.
NB: seems ok but need to be chunked better.
All involved distributions are either fixed (at , constant ) or Gaussian with known closed-form (the others).
The empty sandwich, overriding (case of )
The case of is a bit special, as we and so no “sandwich” (the sandwich equation gives which is a Dirac). The initial term is directly , i.e., .
To avoid having handling this special case below, we override and .
We thus get a closed-form expression for the loss, involving square norms (coming from the KL between gaussians).
with , and where, as above with and , can be equivalently sampled either independently or conditioned on or conditioned on it.
Overall, thanks to all derivations, we managed to get a loss that is simple as it is conditioned on the data point and is local in time (sum over ).
All-in-one visual
What to fit?
The above derivations reason in terms of fitting the means of the backward steps .
Based on the “big jump view” of the forward process, and the Gaussian properties, we can reparametrize the sampling of (conditioned on ) as a function of and some noise :
Which, once reversed, gives:
.We can plug this expression on to make it depend only on and , not on :
The constants can be refined further, but for now, let’s look at the implications. Using the sampling of reparametrized using and substituting the value of we just derived, we can rewrite the loss as:
So, up to the time reweighting ( instead of ), as takes as input and , we can equivalently train a network to predict the noise that was used to generate from , instead of predicting . We can thus define a network and train it to minimize the loss:
With the two-way mapping between and being:
Aside: link with flow matching
Looking at algorithms, we can uncover the link with flow matching. Conceptually, both sample a time step (although with different semantics), a data point and a unit noise. However, they differ in the path, i.e., the formula for :
- diffusion aims at preserving the variance across time,
- flow matching (in its typical form) aims at a linear interpolation between data and noise.
We can however instantiate flow matching that will match the diffusion path. The similarities/differences are then just in the time weighting and what is fit. The details are left out for now.
Aside: Bayes rule over a Markov chain
Even though the forward/noising process is defined forward in time, we can also express it in the opposite direction. It corresponds to a different decomposition of the joint distribution of the Markov chain :
A different view / proof (and keeping a condition on )
We can develop the markov chain, and apply the Bayes rule at each step.
We also, introduce thanks to the independence of on when is known.