Variational Autoencoders (VAE)


It this Lecture, we will derive the core principles behind Variational Autoencoders (VAEs). The goal is to learn a generative model through: i) a low-dimensional latent space, ii) a decoder that maps from the latent space to the data space.

Autoencoders

Idea The core idea of vanilla autoencoders id to jointly learn an encoder eφe_\varphi, and a decoder dθd_\theta. The encoder maps a data point xRdx \in \mathbb{R}^d to a low-dimensional latent representation eφ(x)e_\varphi(x), and the decoder maps back this latent representation dθ(eφ(x))d_\theta(e_\varphi(x)) to the data space Rd\mathbb{R}^d. The parameters of the encoder and the decoder are learned by minimizing the error between a data point xx and its reconstruction dθ(eφ(x))d_\theta(e_\varphi(x)):

L(ϕ,θ)=xdataxsampledθ(eφ(x))samplereconstruction2.\mathcal{L}(\phi, \theta) = \sum_{x \in \text{data}} \| \underbrace{x}_{\mathrm{sample}} - \underbrace{d_\theta(e_\varphi(x))}_{ \substack{ \mathrm{sample} \\ \mathrm{reconstruction} } } \|^2 \enspace.

A Probabilistic Interpretation of Autoencoders

As in most cases where a squared error is minimized, we can interpret the decoder as a Gaussian likelihood model:

p(xz,θ)=N(x;dθ(z),I)density of a Gaussian variable with mean dθ(z) and identity covariance, evaluated at point x.p(x|z, \theta) = \underbrace{\mathcal{N}(x; d_\theta(z), I)}_{\substack{\text{density of a Gaussian variable with mean } d_\theta(z) \\ \text{ and identity covariance, evaluated at point } x } } \enspace.

In other words, the decoder predicts the mean of a Gaussian distribution, which is equivalent to maximize the log-likelihood:

argminϕ,θxdataxdθ(eφ(x))2=argmaxϕ,θxdatalogpθ(xz=eφ(x))\arg\min_{\phi, \theta} \sum_{x \in \text{data}} \| x - d_\theta(e_\varphi(x)) \|^2 = \arg\max_{\phi, \theta} \sum_{x \in \text{data}} \log p_\theta(x|z = e_\varphi(x))

(or in typst)

argmin𝜑,𝜃𝑥data𝑥𝑑𝜃(𝑒𝜑(𝑥))2=argmax𝜑,𝜃𝑥datalog𝑝𝜃(𝑥|𝑧=𝑒𝜑(𝑥))

“Proof” of the equivalence

We start with the log-likelihood:

logpθ(xz)=logN(x;dθ(z),I)=log((2π)D2exp(12xdθ(z)2))=D2log(2π)12xdθ(z)2\begin{align*} \log p_\theta(x|z) & = \log \mathcal{N}(x; d_\theta(z), I) \\ & = \log \left( (2\pi)^{-\frac{D}{2}} \exp\left( -\frac{1}{2} \| x - d_\theta(z) \|^2 \right) \right) \\ & = -\frac{D}{2} \log(2\pi) - \frac{1}{2} \| x - d_\theta(z) \|^2 \end{align*}

(or in typst)

log𝑝𝜃(𝑥|𝑧)=log𝑁(𝑥;𝑑𝜃(𝑧),𝐼)=log((2𝜋)𝐷2exp(12𝑥𝑑𝜃(𝑧)2))=𝐷2log(2𝜋)12𝑥𝑑𝜃(𝑧)2

where DD is the dimensionality of xx. Ignoring the constant term and factor, we have:

logpθ(xz)xdθ(z)2+K\log p_\theta(x|z) \propto - \| x - d_\theta(z) \|^2 + K

(\propto means proportional to)

Thus, maximizing the log-likelihood is equivalent to minimizing the squared error, i.e.:

argminϕ,θxdataxdθ(eφ(x))2=argmaxϕ,θxdatalogpθ(xz=eφ(x))\arg\min_{\phi, \theta} \sum_{x \in \text{data}} \| x - d_\theta(e_\varphi(x)) \|^2 = \arg\max_{\phi, \theta} \sum_{x \in \text{data}} \log p_\theta(x|z = e_\varphi(x))

Overview of Variational Autoencoders (VAEs)

In Variational Autoencoders (VAEs), the position ziz_i in the latent space (for a data point xix_i) is supposed to be a random variable. Indeed, there is technically some uncertainty on the exact position of ziz_i that best explains xix_i, especially given that we consider all points jointly. In a VAE, on thus manipulates, for each point xix_i, a distribution on its latent position ziz_i, that aims at best approximating the true-unknown-tricky posterior distribution pθ(ziX)p_\theta(z_i|X) of this latent variable given all the training set.

We will now decompose the construction of the VAE, starting with formulations that have only the decoder. The encoder will be introduced later as a trick (i.e., amortization).

MAP Estimation of the Latent Variables

We are interesting in estimating both the decoder parameters θ\theta and the latent variables ziz_i for each data point xix_i.

One could think of optimizing over Z={zi}i=1NZ = \{z_i\}_{i=1}^N (all the latent positions) the log-likelihood of the data:

logpθ(XZ)=i=1Nlogpθ(xizi)\log p_\theta(X|Z) = \sum_{i=1}^N \log p_\theta(x_i|z_i)
Decomposable loss from i.i.d. assumption

Assuming that the data points are i.i.d. given the latent variables, we have:

pθ(XZ)=i=1Npθ(xizi)p_\theta(X|Z) = \prod_{i=1}^N p_\theta(x_i|z_i)

and hence:

logpθ(XZ)=logi=1Npθ(xizi)=i=1Nlogpθ(xizi)\log p_\theta(X|Z) = \log \prod_{i=1}^N p_\theta(x_i|z_i) = \sum_{i=1}^N \log p_\theta(x_i|z_i)

One could think of doing this :

logpθ(X,Z)=i=1Nlogpθ(xizi)+logp(zi)\log p_\theta(X, Z) = \sum_{i=1}^N \log p_\theta(x_i|z_i) + \log p(z_i)

//(or in typst)

log𝑝𝜃(𝑋,𝑍)=𝑁𝑖=1log𝑝𝜃(𝑥𝑖|𝑧𝑖)+log𝑝(𝑧𝑖)
// //where p(zi)p(z_i) is a prior distribution on the latent variables (e.g., a standard Gaussian). // //We could then maximize this joint log-likelihood w.r.t. both θ\theta and Z={zi}i=1NZ = \{z_i\}_{i=1}^N. // //However, this would lead to overfitting, as we could always increase the likelihood by increasing the capacity of the decoder and setting ziz_i to arbitrary values.

Variational Inference

Reparameterization Trick

Doubly Stochastic Variational Inference

Prior and latent space misconception