This lesson is currently in beta. It may contain inaccuracies or incomplete information. Please give feedback on any issues or suggestions for improvement!

Denoising Diffusion Probabilistic Models (DDPM) and links with Flow Matching


  • 𝛽𝑡 ((choice)) small added noise variance at step 𝑡
  • 𝛼𝑡1𝛽𝑡 signal retention factor at step 𝑡
  • 𝛼𝑡𝑡𝑠=1𝛼𝑠 overall signal retention factor from the start to step 𝑡
  • 1𝛼𝑡 overall noise variance added from the start to step 𝑡
  • 𝜎2𝑡 ((choice)) variance of the backward step at step 𝑡
  • 𝑞(𝑥𝑡|𝑥𝑡1)
    𝒩(1𝛽𝑡𝑥𝑡1,𝛽𝑡𝐼)
    forward/noising step distribution
  • 𝑞(𝑥𝑡|𝑥0)
    =𝒩(𝛼𝑡𝑥0,(1𝛼𝑡)𝐼)
    big-jump forward/noising step distribution
  • 𝑞(𝑥𝑡1|𝑥0)
    =𝒩(𝛼𝑡1𝑥0,(1𝛼𝑡1)𝐼)
    big-jump for step 𝑡1
  • 𝑞(𝑥𝑡1|𝑥𝑡,𝑥0)𝒩(̃𝜇,̃𝛽𝐼) “sandwich view”
  • ̃𝜇(𝑥0,𝑡,𝑥𝑡)=𝛼𝑡1𝛽𝑡1𝛼𝑡𝑥0+𝛼𝑡(1𝛼𝑡1)1𝛼𝑡𝑥𝑡
    sandwich mean
  • ̃𝛽(𝑡)=1𝛼𝑡11𝛼𝑡𝛽𝑡 sandwich variance

This lesson introduces the principles of diffusion models. The aim is to clearly explain, with both the intuition and the detailed derivation, the original Denoising Diffusion Probabilistic Models (DDPM). The derivations are quite detailed, including some parts that are usually skipped in papers/blogs.

Diffusion models predate flow matching models, and were the first models to achieve high-quality image generation results comparable to GANs. The standard diffusion setting aims at learning to map a noise distribution, typically N(0,I)N(0,I) via a continuous normalizing flow. Even though the formulation is quite different and stochastic, in many ways, diffusion models can be seen as a special case of flow matching models, with a specific choice of noise distribution and trajectory parametrization.

A key breakthrough of diffusion models was to be able to specify the complete evolution of the probability distribution between noise and data. This allows to supervise the generative/denoising process (a continuous normalizing flow) at every step instead of just specifying the distribution at the first and last steps.

To achieve this, diffusion models imagine a forward noising process, which progressively adds noise to the data until it becomes pure noise. The generative process is then trained to reverse this noising process, and denoise the data step by step.

Common knowledge about Gaussians / normal distributions

The goal here is not to define the normal distribution, nor to explain all its properties. We will just recall some properties of Gaussians that will be useful in the rest of the lesson.

KL divergence between two normals

The KL divergence between two normal distributions 𝒩(𝜇1,𝜎21) and 𝒩(𝜇2,𝜎22) is given by:

KL(𝒩(𝜇1,𝜎21)𝒩(𝜇2,𝜎22))=(𝜇1𝜇2)22𝜎22+(𝜎212𝜎22)12+log(𝜎2𝜎1)

In this post only the first term will be useful: it is the only term depending on the variable of interest.

Product of two normal densities

The product of two normal densities 𝒩(𝜇1,𝜎21) and 𝒩(𝜇2,𝜎22) is proportional to another normal density:

𝒩(𝜇1,𝜎21)×𝒩(𝜇2,𝜎22)𝒩(((𝜇1𝜎21+𝜇2𝜎221𝜎21+1𝜎22,11𝜎21+1𝜎22)))

or more explicitly,

𝐾1,𝜇,𝜎,𝑥:𝒩(𝜇1,𝜎21)(𝑥)×𝒩(𝜇2,𝜎22)(𝑥)=𝐾1×𝒩(𝜇,𝜎)(𝑥)

and

𝜇=𝜇1𝜎21+𝜇2𝜎221𝜎21+1𝜎22=𝜎22𝜇1+𝜎21𝜇2𝜎21+𝜎22𝜎2=11𝜎21+1𝜎22=𝜎21𝜎22𝜎21+𝜎22

Proof/Derivation

We work on the log space, focusing/keeping only the terms depending on 𝑥 (the rest is the normalization constant of a Gaussian).

So, first, let’s see that the log of a gaussian density 𝒩(𝜇,𝜎2) can be expanded as:

ln(𝒩(𝜇,𝜎2)(𝑥))=12((𝑥𝜇)2𝜎2)+𝐶1=12(𝑥2𝜎22𝜇𝑥𝜎2+𝜇2𝜎2)+𝐶1=12𝜎2(𝑥22𝜇𝑥)+𝐶2=12(1𝜎2𝑥22𝜇𝜎2𝑥)+𝐶2

This formula is useful for identifying the parameters of a Gaussian density from its expanded log density.

For the product of two Gaussian densities, we have:

ln(𝒩(𝜇1,𝜎21)(𝑥)×𝒩(μ2,σ22)(𝓍))=12((𝑥𝜇1)2𝜎21+(𝑥𝜇2)2𝜎22)+𝐶3(developing)=12(𝑥2𝜎212𝜇1𝑥𝜎21+𝜇21𝜎21+𝑥2𝜎222𝜇2𝑥𝜎22+𝜇22𝜎22)+𝐶3(factorizing and pushing constant terms in the constant)=12((1𝜎21+1𝜎22)𝑥22(𝜇1𝜎21+𝜇2𝜎22)𝑥)+𝐶4=12(1𝜎2𝑥22𝜇𝜎2𝑥)+𝐶4

We can identify 𝜎2 directly, and then deduce 𝜇:

𝜎2=11𝜎21+1𝜎22=𝜎21𝜎22𝜎21+𝜎22𝜇=𝜎2(𝜇1𝜎21+𝜇2𝜎22)=𝜎22𝜇1+𝜎21𝜇2𝜎21+𝜎22

Identities on normal mean

  • 𝒩(𝜇,𝜎2)(𝑥)=𝒩(𝑥,𝜎2)(𝜇)
  • 𝒩(𝑎𝜇,𝜎2)(𝑥)=𝒩(𝜇,𝜎2𝑎2)(𝑥𝑎)
  • combining both: 𝒩(𝑎𝑥,𝜎2)(𝜇)=𝒩(𝜇𝑎,𝜎2𝑎2)(𝑥)

Forward/noising process

In this section, we introduce the forward/noising process, and derive all properties that are necessary afterwards.

Global view

The forward/noising process progressively adds noise to the data until it becomes pure noise. As shown in Fig. A, the forward process starts from data points 𝑖,𝑥𝑖0𝑞(𝑥0) (e.g. images from the training set, also named ̂𝑝data). The forward process progressively adds Gaussian noise to these data points, until they are completely shuffled and become close to pure noise. The total process is run for a finite (but high) number of steps 𝑇, and we (almost) obtain 𝑖,𝑥𝑖𝑇𝒩(0,𝐼).

t=0 t=T t-1 t t t-1 t t-1
[A] global forward noising process.

Local view

The forward/noising process is defined as a Markov chain, where each step adds a bit of Gaussian noise to the previous step. As shown in Fig. A the position at step 𝑡 depends only on the position at step 𝑡1 (hence the Markov property). More precisely, the forward/noising process is defined as:

𝑥0𝑞(𝑥0)𝑡[1..𝑇],𝑥𝑡𝑞(𝑥𝑡|𝑥𝑡1)

Even more precisely:

  • we suppose the original dataset has been normalized,
  • we can decide on a variance addition schedule 𝛽1,,𝛽𝑇 saying how much noise variance to add at each step,
  • as we add noise at each step, the distribution would be more and more spread along time steps (𝑡) and would not reach a gaussian noise with identity covariance,
  • to avoid this, we also rescale the signal at each step by a factor 1𝛽𝑡.

The goal of the rescaling is to ensure that the variance of the dataset at step 𝑡 is always 1, whatever 𝑡. Overall the forward/noising process is defined as:

𝑥0𝑞(𝑥0)𝑡[1..𝑇],𝑥𝑡𝑞(𝑥𝑡|𝑥𝑡1)=𝒩(1𝛽𝑡𝑥𝑡1,𝛽𝑡𝐼)
t=0 t=T t-1 t t t-1 t t-1
[B] One-step noising process.

Big-jump view

Thanks to the properties of Gaussians, we can express the distribution at step 𝑡 as a function of the initial data point 𝑥0. Indeed, since each step is adding a Gaussian noise (and rescaling), the composition of all the steps is also a Gaussian distribution as shown in Fig. A. More precisely, we have:

𝑡[1..𝑇],𝑥𝑡𝑞(𝑥𝑡|𝑥0)=𝒩(𝛼𝑡𝑥0,(1𝛼𝑡)𝐼)

with 𝛼𝑡𝑡𝑠=1𝛼𝑠 the overall signal retained from the start to step 𝑡, in which 𝛼𝑡1𝛽𝑡 is the signal retention factor at step 𝑡.

t=0 t=T t-1 t t t-1 t t-1
[C] Multi-step noising process.

Sandwich view

As this will become useful, we can also express the distribution at step 𝑡1 as a function of the position at step 𝑡 and the initial data point 𝑥0. This is illustrated in Fig. A:

  • the two densities in blue correspond to the information on 𝑥𝑡1 that we have thanks to 𝑥𝑡 (the peaky one) and 𝑥0 (the wide one),
  • their product (in pink), once renormalized, gives us the distribution of 𝑥𝑡1 knowing both 𝑥𝑡 and 𝑥0

More precisely, we have:

𝑡[2..𝑇],𝑞(𝑥𝑡1|𝑥𝑡,𝑥0)𝑞(𝑥𝑡1|𝑥𝑡)×𝑞(𝑥𝑡1|𝑥0)

We will show that we can derive a closed-form expression for this distribution:

𝑡[2..𝑇],𝑞(𝑥𝑡1|𝑥𝑡,𝑥0)𝒩(̃𝜇,̃𝛽𝐼)with̃𝜇(𝑥0,𝑡,𝑥𝑡)=𝛼𝑡1𝛽𝑡1𝛼𝑡𝑥0+𝛼𝑡(1𝛼𝑡1)1𝛼𝑡𝑥𝑡̃𝛽(𝑡)=1𝛼𝑡11𝛼𝑡𝛽𝑡

The proof relies on showing that 𝑞(𝑥𝑡1|𝑥𝑡) (the anti-forward step) is proportional to a Gaussian density, and then using the product-of-two-Gaussians property that we derived in the preliminaries.

Proof/Derivation

Knowing 𝑥0, we can express the anti-forward step 𝑞(𝑥𝑡1|𝑥𝑡,𝑥0) using the Bayes rule. Remembering that this is a distribution over 𝑥𝑡1, we can focus on the factors that depend on it (and thus drop q(xtx0)q(x_t | x_0) below):

𝑡[2..𝑇],𝑞(𝑥𝑡1|𝑥𝑡,𝑥0)=𝑞(𝑥𝑡|𝑥𝑡1,𝑥0)𝑞(𝑥𝑡1|𝑥0)𝑞(𝑥𝑡|𝑥0)𝑞(𝑥𝑡|𝑥𝑡1)𝑞(𝑥𝑡1|𝑥0)𝒩(1𝛽𝑡𝑥𝑡1,𝛽𝑡𝐼)(𝑥𝑡)×𝑞(𝑥𝑡1|𝑥0)𝒩(𝑎𝑥,𝜎2)(𝜇)𝒩(𝜇𝑎,𝜎2𝑎2)(𝑥)   𝒩(𝑥𝑡1𝛽𝑡,𝛽𝑡1𝛽𝑡𝐼)(𝑥𝑡1)   ×𝒩(𝛼𝑡1𝑥0,(1𝛼𝑡1)𝐼)(𝑥𝑡1)(prod. of gaussian densities)   𝒩(̃𝜇,̃𝛽)(𝑥𝑡1)with̃𝛽=(𝛽𝑡1𝛽𝑡)(1𝛼𝑡1)(𝛽𝑡1𝛽𝑡)+(1𝛼𝑡1)=1𝛼𝑡1𝛽𝑡+(1𝛽𝑡)(1𝛼𝑡)𝛽𝑡=1𝛼𝑡11𝛼𝑡+𝛼𝑡(1𝛼𝑡)𝛽𝑡=1𝛼𝑡11𝛼𝑡𝛽𝑡and̃𝜇=(𝑥𝑡1𝛽𝑡)(1𝛼𝑡1)+(𝛼𝑡1𝑥0)(𝛽𝑡1𝛽𝑡)(𝛽𝑡1𝛽𝑡)+(1𝛼𝑡1)(as for 𝜇, pass 1𝛽𝑡 down)   =𝛼𝑡1𝛽𝑡1𝛼𝑡𝑥0+𝛼𝑡(1𝛼𝑡1)1𝛼𝑡𝑥𝑡
t=0 t=T t-1 t t t-1 t t-1
[D] Closed-form conditional backward step.

Learnable backward process

Inspired by the forward/noising process, we define a backward/denoising process that will have parameters so that we can learn it to reverse the forward/noising process. As shown in Fig. A, the backward/denoising process starts from some random noise. It is also defined as a Markov chain, where each step 𝑡1 only depends on the previous step 𝑡.

Said differently, the process is defined by the distributions 𝑝𝜃(𝑥𝑡1|𝑥𝑡) for 𝑡[1..𝑇], where 𝜃 are the learnable parameters of the model. The form of 𝑝𝜃 is actually a gaussian distribution, with a mean predicted by a neural network, and a variance that can be fixed or learned. We typically have, with a fixed denoising variance schedule 𝜎1,,𝜎𝑇, and finally write:

𝑥𝑇𝑝(𝑥𝑇)=𝒩(0,𝐼)𝑡[1..𝑇],𝑥𝑡1𝑝𝜃(𝑥𝑡1|𝑥𝑡)=𝒩(𝜇𝜃(𝑥𝑡,𝑡),𝜎2𝑡𝐼)
t=0 t=T t-1 t t t-1 t t-1
[E] Learnable backward process.

Overall goal: how to guide learning?

The parameters 𝜃 of the backward/denoising process will be learned by fitting the backward steps to the forward steps. More precisely, we want to minimize the KL divergence between the distribution induced by the forward step 𝑞(𝑥0:𝑇) and one induced by the backward step 𝑝𝜃(𝑥0:𝑇), as hinted in Fig. A.

t=0 t=T t-1 t t t-1 t t-1
[F] Fitting the forward process with the backward process.

From global to local KL

This part details the key insight of how to transform a global KL loss into a sum of local ones, eventually expressed as square errors.

Overview

We will show that we can simplify the global objective into a sum of local objectives, one for each step, with the proper conditioning. We will follow a few steps:

  • starting by writing the KL divergence between the two joint distributions of the Markov chains, in their natural direction (forward for the noising process, backward for the learned process),
  • re-introducing an expectation on 𝑥0 to make it the expression tractable (below, using the “sandwich view”),
  • “reversing” the conditional forward noising process,
  • using the sandwich view,
  • using the closed-form of the KL between Gaussians to get a final closed-form loss.

The final terms involved in the loss are KL divergences between two Gaussian distributions, which can be computed in closed form. These are illustrated in Fig. A.

Goal: KL divergence between the forward and backward processes

We decompose the global KL divergence between the forward noising process and the backward learnable process. A big part of these derivations are also detailed at the end of this page, in isolation of the rest.

NB: seems ok but need to be chunked better.

KL(𝑞(𝑥0:𝑇)𝑝𝜃(𝑥0:𝑇))=𝔼𝑥0:𝑇𝑞(𝑥0:𝑇)[ln(𝑞(𝑥0:𝑇)𝑝𝜃(𝑥0:𝑇))]=𝔼𝑥0:𝑇𝑞(𝑥0:𝑇)[[[ln(((𝑞(𝑥0)𝑇𝑡=1𝑞(𝑥𝑡|𝑥𝑡1)𝑝𝜃(𝑥𝑇)𝑇𝑡=1𝑝𝜃(𝑥𝑡1|𝑥𝑡))))]]]=𝔼𝑥0:𝑇𝑞(𝑥0:𝑇)[ln𝑞(𝑥0)ln𝑝𝜃(𝑥𝑇)+𝑇𝑡=1ln𝑞(𝑥𝑡|𝑥𝑡1)𝑇𝑡=1ln𝑝𝜃(𝑥𝑡1|𝑥𝑡)](removing constant expectations, all expectations are in𝑞())=𝔼𝑥0[ln𝑞(𝑥0)]𝔼𝑥𝑇[ln𝑝𝜃(𝑥𝑇)]𝐴1+𝑇𝑡=1𝔼𝑥𝑡1,𝑥𝑡[ln𝑞(𝑥𝑡|𝑥𝑡1)]𝑇𝑡=1𝔼𝑥𝑡1,𝑥𝑡[ln𝑝𝜃(𝑥𝑡1|𝑥𝑡)]𝐴2(using independence of𝑥𝑡and𝑥0, given𝑥𝑡1, to introduce a conditioning on𝑥0)=𝐴1𝔼𝑥0𝐴2+𝔼𝑥0𝑇𝑡=1𝔼𝑥𝑡1,𝑥𝑡[ln𝑞(𝑥𝑡|𝑥𝑡1,𝑥0)](taking out the particular case 𝑡=1)=𝐴1𝔼𝑥0𝐴2+𝔼𝑥0𝔼𝑥1[ln𝑞(𝑥1|𝑥0)]+𝔼𝑥0𝑇𝑡=2𝔼𝑥𝑡1,𝑥𝑡[ln𝑞(𝑥𝑡|𝑥𝑡1,𝑥0)](using the bayes rule to reverse 𝑞(𝑥𝑡|𝑥𝑡1,𝑥0))=𝐴1𝔼𝑥0𝐴2+𝔼𝑥0𝔼𝑥1[ln𝑞(𝑥0|𝑥1)𝑞(𝑥1)𝑞(𝑥0)]+𝔼𝑥0𝑇𝑡=2𝔼𝑥𝑡1,𝑥𝑡[ln𝑞(𝑥𝑡1|𝑥𝑡,𝑥0)𝑞(𝑥𝑡|𝑥0)𝑞(𝑥𝑡1|𝑥0)](unpacking logs, expectations and 𝐴2)=𝐴1𝔼𝑥0𝑇𝑡=1𝔼𝑥𝑡1,𝑥𝑡[ln𝑝𝜃(𝑥𝑡1|𝑥𝑡)]+ 𝔼𝑥0𝔼𝑥1ln𝑞(𝑥0|𝑥1)+𝔼𝑥1ln𝑞(𝑥1)𝔼𝑥0ln𝑞(𝑥0)+ 𝔼𝑥0𝑇𝑡=2𝔼𝑥𝑡1,𝑥𝑡ln𝑞(𝑥𝑡1|𝑥𝑡,𝑥0)+𝔼𝑥0𝑇𝑡=2𝔼𝑥𝑡ln𝑞(𝑥𝑡|𝑥0)𝔼𝑥0𝑇𝑡=2𝔼𝑥𝑡1ln𝑞(𝑥𝑡1|𝑥0)(splitting first sum, grouping terms for KL, colliding the two last sums)=𝐴1+𝔼𝑥1ln𝑞(𝑥1)𝔼𝑥0ln𝑞(𝑥0)+𝔼𝑥0𝔼𝑥𝑇ln𝑞(𝑥𝑇|𝑥0)𝔼𝑥0𝔼𝑥1ln𝑞(𝑥1|𝑥0)+ 𝔼𝑥1𝔼𝑥0[ln(𝑞(𝑥0|𝑥1)𝑝𝜃(𝑥0|𝑥1))]+ 𝔼𝑥0𝑇𝑡=2𝔼𝑥𝑡𝔼𝑥𝑡1[ln(𝑞(𝑥𝑡1|𝑥𝑡,𝑥0)𝑝𝜃(𝑥𝑡1|𝑥𝑡))](NB:averaging over all𝑥1and𝑥0, is the same as averaging over all𝑥1and𝑥0given𝑥1)(we can thus have e.g.𝔼𝑥1𝔼𝑥0=𝔼𝑥1𝔼𝑥0|𝑥1so that it matches the numerator in the log)=𝔼𝑥1ln𝑞(𝑥1)𝔼𝑥𝑇ln𝑝𝜃(𝑥𝑇) 𝔼𝑥0𝔼𝑥1ln𝑞(𝑥1|𝑥0)+𝔼𝑥0𝔼𝑥𝑇ln𝑞(𝑥𝑇|𝑥0)+ 𝔼𝑥1KL(𝑞(𝑥0|𝑥1)𝑝𝜃(𝑥0|𝑥1))+ 𝔼𝑥0𝑇𝑡=2𝔼𝑥𝑡[KL(𝑞(𝑥𝑡1|𝑥𝑡,𝑥0)𝑝𝜃(𝑥𝑡1|𝑥𝑡))](again some KL and simplifying)=𝔼𝑥0KL(𝑞(𝑥𝑇|𝑥0)𝑝𝜃(𝑥𝑇))𝐶+𝔼𝑥1KL(𝑞(𝑥0|𝑥1)𝑝𝜃(𝑥0|𝑥1))+ 𝔼𝑥0𝑇𝑡=2𝔼𝑥𝑡[KL(𝑞(𝑥𝑡1|𝑥𝑡,𝑥0)𝑝𝜃(𝑥𝑡1|𝑥𝑡))]

All involved distributions are either fixed (at 𝑡=𝑇, constant CC) or Gaussian with known closed-form (the others).

The empty sandwich, overriding ̃𝜇 (case of 𝑡=1)

The case of 𝑡=1 is a bit special, as we 𝑥𝑡1=𝑥0 and so no “sandwich” (the sandwich equation gives 𝑞(𝑥0|𝑥1,𝑥0) which is a Dirac). The initial term is directly 𝑞(𝑥0|𝑥1), i.e., 𝑞(𝑥0|𝑥1)=𝒩(11𝛽1𝑥𝑡,𝛽11𝛽1𝐼).

To avoid having handling this special case below, we override ̃𝜇(𝑥0,𝑡=1,𝑥𝑡)11𝛽1𝑥𝑡 and ̃𝛽(𝑡=1)𝛽11𝛽1.

We thus get a closed-form expression for the loss, involving square norms (coming from the KL between gaussians).

=𝐶+𝔼𝑥0𝑞(𝑥0)𝑇𝑡=1𝔼𝑥𝑡𝜆𝑡̃𝜇(𝑥0,𝑡,𝑥𝑡)𝜇𝜃(𝑥𝑡,𝑡)2=𝐶+𝑇𝑡=1𝜆𝑡𝔼𝑥0𝑞(𝑥0)𝔼𝑥𝑡𝑞(𝑥𝑡)̃𝜇(𝑥0,𝑡,𝑥𝑡)𝜇𝜃(𝑥𝑡,𝑡)2=𝐶+𝑇𝑡=1𝜆𝑡𝔼𝑥0𝑞(𝑥0)𝔼𝑥𝑡𝑞(𝑥𝑡|𝑥0)̃𝜇(𝑥0,𝑡,𝑥𝑡)𝜇𝜃(𝑥𝑡,𝑡)2

with 𝜆𝑡=12𝜎2𝑡, and where, as above with x1x_1 and x0x_0, xtx_t can be equivalently sampled either independently or conditioned on x0x_0 or conditioned on it.

Overall, thanks to all derivations, we managed to get a loss that is simple as it is conditioned on the data point 𝑥0 and is local in time (sum over 𝑡).

t=0 t=T t-1 t t t-1 t t-1
[G] Temporal locality of the learning process.
All-in-one visual
t=0 t=T t-1 t t t-1 t t-1
[H] All together.

What to fit?

The above derivations reason in terms of fitting the means of the backward steps 𝜇𝜃(𝑥𝑡,𝑡).

Based on the “big jump view” of the forward process, and the Gaussian properties, we can reparametrize the sampling of 𝑥𝑡 (conditioned on x0x_0) as a function of 𝑥0 and some noise 𝜀:

𝑥𝑡=𝛼𝑡𝑥0+1𝛼𝑡𝜀𝜀𝒩(0,𝐼)

Which, once reversed, gives:

𝑥0=1𝛼𝑡𝑥𝑡1𝛼𝑡𝛼𝑡𝜀
.

We can plug this expression on ̃𝜇 to make it depend only on 𝑥𝑡 and 𝜀, not on 𝑥0:

̃𝜇(𝑥0,𝑡,𝑥𝑡)=𝛼𝑡1𝛽𝑡1𝛼𝑡(((1𝛼𝑡𝑥𝑡1𝛼𝑡𝛼𝑡𝜀)))+𝛼𝑡(1𝛼𝑡1)1𝛼𝑡𝑥𝑡=𝛼𝑡1𝛽𝑡𝛼𝑡(1𝛼𝑡)𝑥𝑡+𝛼𝑡(1𝛼𝑡1)1𝛼𝑡𝑥𝑡𝛼𝑡1𝛽𝑡1𝛼𝑡𝛼𝑡1𝛼𝑡𝜀=(𝛼𝑡1𝛽𝑡𝛼𝑡(1𝛼𝑡)+𝛼𝑡(1𝛼𝑡1)1𝛼𝑡)𝑥𝑡𝛼𝑡1𝛽𝑡1𝛼𝑡𝛼𝑡1𝛼𝑡𝜀𝐾𝑥𝑡(𝑡)𝑥𝑡+𝐾𝜀(𝑡)𝜀

The constants can be refined further, but for now, let’s look at the implications. Using the sampling of 𝑥𝑡 reparametrized using 𝜀 and substituting the value of ̃𝜇 we just derived, we can rewrite the loss as:

=𝐶+𝑇𝑡=1𝜆𝑡𝔼𝑥0𝑞(𝑥0)𝔼𝜀𝒩(0,𝐼)𝐾𝑥𝑡(𝑡)𝑥𝑡+𝐾𝜀(𝑡)𝜀𝜇𝜃(𝑥𝑡,𝑡)2=𝐶+𝑇𝑡=1𝜆𝑡𝔼𝑥0𝑞(𝑥0)𝔼𝜀𝒩(0,𝐼)𝐾𝜀(𝑡)(𝜀𝜇𝜃(𝑥𝑡,𝑡)𝐾𝑥𝑡(𝑡)𝑥𝑡𝐾𝜀(𝑡))2=𝐶+𝑇𝑡=1(𝜆𝑡𝐾2𝜀(𝑡))𝔼𝑥0𝑞(𝑥0)𝔼𝜀𝒩(0,𝐼)𝜀𝜇𝜃(𝑥𝑡,𝑡)𝐾𝑥𝑡(𝑡)𝑥𝑡𝐾𝜀(𝑡)2

So, up to the time reweighting (𝛾𝑡𝜆𝑡𝐾2𝜀(𝑡) instead of 𝜆𝑡), as 𝜇𝜃 takes as input 𝑥𝑡 and 𝑡, we can equivalently train a network to predict the noise 𝜀 that was used to generate 𝑥𝑡 from 𝑥0, instead of predicting ̃𝜇. We can thus define a network 𝜀𝜃(𝑥𝑡,𝑡) and train it to minimize the loss:

=𝐶+𝑇𝑡=1𝛾𝑡𝔼𝑥0𝑞(𝑥0)𝔼𝜀𝒩(0,𝐼)𝜀𝜀𝜃(𝑥𝑡,𝑡)2

With the two-way mapping between 𝜇𝜃 and 𝜀𝜃 being:

𝜇𝜃(𝑥𝑡,𝑡)=𝐾𝑥𝑡(𝑡)𝑥𝑡+𝐾𝜀(𝑡)𝜀𝜃(𝑥𝑡,𝑡)𝜀𝜃(𝑥𝑡,𝑡)=𝜇𝜃(𝑥𝑡,𝑡)𝐾𝑥𝑡(𝑡)𝑥𝑡𝐾𝜀(𝑡)

Looking at algorithms, we can uncover the link with flow matching. Conceptually, both sample a time step (although with different semantics), a data point and a unit noise. However, they differ in the path, i.e., the formula for 𝑥𝑡:

  • diffusion aims at preserving the variance across time,
  • flow matching (in its typical form) aims at a linear interpolation between data and noise.

We can however instantiate flow matching that will match the diffusion path. The similarities/differences are then just in the time weighting and what is fit. The details are left out for now.

Aside: Bayes rule over a Markov chain

Even though the forward/noising process is defined forward in time, we can also express it in the opposite direction. It corresponds to a different decomposition of the joint distribution of the Markov chain q(x0,...,xT)q(x_0, ..., x_T):

𝑞(𝑥0:𝑇)=𝑞(𝑥0)×𝑇𝑡=1𝑞(𝑥𝑡|𝑥𝑡1)(anti-forward)=𝑞(𝑥𝑇)×𝑇𝑡=1𝑞(𝑥𝑡1|𝑥𝑡)
A different view / proof (and keeping a condition on 𝑥0)

We can develop the markov chain, and apply the Bayes rule at each step.

ln𝑞(𝑥0:𝑇)ln(𝑞(𝑥,0)×𝑇𝑡=1𝑞(𝑥𝑡|𝑥𝑡1))=ln𝑞(𝑥0)+𝑇𝑡=1ln𝑞(𝑥𝑡|𝑥𝑡1)=ln𝑞(𝑥0)+𝑇𝑡=1ln(𝑞(𝑥𝑡1|𝑥𝑡)𝑞(𝑥𝑡)𝑞(𝑥𝑡1))=ln𝑞(𝑥0)+𝑇𝑡=1ln𝑞(𝑥𝑡1|𝑥𝑡)+𝑇𝑡=1ln𝑞(𝑥𝑡)𝑇𝑡=1ln𝑞(𝑥𝑡1)=ln𝑞(𝑥0)+𝑇𝑡=1ln𝑞(𝑥𝑡1|𝑥𝑡)+𝑇𝑡=1ln𝑞(𝑥𝑡)𝑇1𝑡=0ln𝑞(𝑥𝑡)=ln𝑞(𝑥0)+𝑇𝑡=1ln𝑞(𝑥𝑡1|𝑥𝑡)+ln𝑞(𝑥𝑇)ln𝑞(𝑥0)=𝑇𝑡=1ln𝑞(𝑥𝑡1|𝑥𝑡)+ln𝑞(𝑥𝑇)

We also, introduce 𝑥0 thanks to the independence of 𝑥𝑡 on 𝑥0 when 𝑥𝑡1 is known.

ln𝑞(𝑥0:𝑇)ln(𝑞(𝑥0)×𝑇𝑡=1𝑞(𝑥𝑡|𝑥𝑡1))=ln𝑞(𝑥0)+𝑇𝑡=1ln𝑞(𝑥𝑡|𝑥𝑡1)=ln𝑞(𝑥0)+ln𝑞(𝑥1|𝑥0)A+𝑇𝑡=2ln𝑞(𝑥𝑡|𝑥𝑡1)=𝐴+𝔼𝑥0𝑞(𝑥0)𝑇𝑡=2ln𝑞(𝑥𝑡|𝑥𝑡1, 𝑥0)=𝐴+𝔼𝑇𝑡=2ln(𝑞(𝑥𝑡1|𝑥𝑡, 𝑥0)𝑞(𝑥𝑡| 𝑥0)𝑞(𝑥𝑡1| 𝑥0))=𝐴+𝔼𝑇𝑡=2ln𝑞(𝑥𝑡1|𝑥𝑡, 𝑥0)+𝔼𝑇𝑡=2ln𝑞(𝑥𝑡| 𝑥0)𝔼𝑇𝑡=2ln𝑞(𝑥𝑡1| 𝑥0)=𝐴+𝔼𝑇𝑡=2ln𝑞(𝑥𝑡1|𝑥𝑡, 𝑥0)+𝔼𝑇𝑡=2ln𝑞(𝑥𝑡| 𝑥0)𝔼𝑇1𝑡=1ln𝑞(𝑥𝑡| 𝑥0)=𝐴+𝔼𝑇𝑡=2ln𝑞(𝑥𝑡1|𝑥𝑡, 𝑥0)+𝔼ln𝑞(𝑥𝑇| 𝑥0)𝔼ln𝑞(𝑥1| 𝑥0)=𝐵+𝔼𝑥0𝑞(𝑥0)𝑇𝑡=2ln𝑞(𝑥𝑡1|𝑥𝑡, 𝑥0)with𝐵=𝐴+𝔼ln𝑞(𝑥𝑇| 𝑥0)𝔼ln𝑞(𝑥1| 𝑥0)=ln𝑞(𝑥0)+ln𝑞(𝑥1|𝑥0)+𝔼𝑥0𝑞(𝑥0)ln𝑞(𝑥𝑇| 𝑥0)𝔼𝑥0𝑞(𝑥0)ln𝑞(𝑥1| 𝑥0)