Sep 9, 2025

Variational Autoencoders (VAE)

It this Lecture, we will derive the core principles behind Variational Autoencoders (VAEs). The goal is to learn a generative model through: i) a low-dimensional latent space, ii) a decoder that maps from the latent space to the data space.

Autoencoders

Idea The core idea of vanilla autoencoders id to jointly learn an encoder $e_\varphi$ , and a decoder $d_\theta$ . The encoder maps a data point $x \in \mathbb{R}^d$ to a low-dimensional latent representation $e_\varphi(x)$ , and the decoder maps back this latent representation $d_\theta(e_\varphi(x))$ to the data space $\mathbb{R}^d$ . The parameters of the encoder and the decoder are learned by minimizing the error between a data point $x$ and its reconstruction $d_\theta(e_\varphi(x))$ :

\mathcal{L}(\phi, \theta) = \sum_{x \in \text{data}} \| \underbrace{x}_{\mathrm{sample}} - \underbrace{d_\theta(e_\varphi(x))}_{ \substack{ \mathrm{sample} \\ \mathrm{reconstruction} } } \|^2 \enspace.

A Probabilistic Interpretation of Autoencoders

As in most cases where a squared error is minimized, we can interpret the decoder as a Gaussian likelihood model:

p(x|z, \theta) = \underbrace{\mathcal{N}(x; d_\theta(z), I)}_{\substack{\text{density of a Gaussian variable with mean } d_\theta(z) \\ \text{ and identity covariance, evaluated at point } x } } \enspace.

In other words, the decoder predicts the mean of a Gaussian distribution, which is equivalent to maximize the log-likelihood:

\arg\min_{\phi, \theta} \sum_{x \in \text{data}} \| x - d_\theta(e_\varphi(x)) \|^2 = \arg\max_{\phi, \theta} \sum_{x \in \text{data}} \log p_\theta(x|z = e_\varphi(x))

(or in typst)

“Proof” of the equivalence

We start with the log-likelihood:

\begin{align*} \log p_\theta(x|z) & = \log \mathcal{N}(x; d_\theta(z), I) \\ & = \log \left( (2\pi)^{-\frac{D}{2}} \exp\left( -\frac{1}{2} \| x - d_\theta(z) \|^2 \right) \right) \\ & = -\frac{D}{2} \log(2\pi) - \frac{1}{2} \| x - d_\theta(z) \|^2 \end{align*}

(or in typst)

where $D$ is the dimensionality of $x$ . Ignoring the constant term and factor, we have:

\log p_\theta(x|z) \propto - \| x - d_\theta(z) \|^2 + K

( $\propto$ means proportional to)

Thus, maximizing the log-likelihood is equivalent to minimizing the squared error, i.e.:

\arg\min_{\phi, \theta} \sum_{x \in \text{data}} \| x - d_\theta(e_\varphi(x)) \|^2 = \arg\max_{\phi, \theta} \sum_{x \in \text{data}} \log p_\theta(x|z = e_\varphi(x))

Overview of Variational Autoencoders (VAEs)

In Variational Autoencoders (VAEs), the position $z_i$ in the latent space (for a data point $x_i$ ) is supposed to be a random variable. Indeed, there is technically some uncertainty on the exact position of $z_i$ that best explains $x_i$ , especially given that we consider all points jointly. In a VAE, on thus manipulates, for each point $x_i$ , a distribution on its latent position $z_i$ , that aims at best approximating the true-unknown-tricky posterior distribution $p_\theta(z_i|X)$ of this latent variable given all the training set.

We will now decompose the construction of the VAE, starting with formulations that have only the decoder. The encoder will be introduced later as a trick (i.e., amortization).

MAP Estimation of the Latent Variables

We are interesting in estimating both the decoder parameters $\theta$ and the latent variables $z_i$ for each data point $x_i$ .

One could think of optimizing over $Z = \{z_i\}_{i=1}^N$ (all the latent positions) the log-likelihood of the data:

\log p_\theta(X|Z) = \sum_{i=1}^N \log p_\theta(x_i|z_i)

Decomposable loss from i.i.d. assumption

Assuming that the data points are i.i.d. given the latent variables, we have:

p_\theta(X|Z) = \prod_{i=1}^N p_\theta(x_i|z_i)

and hence:

\log p_\theta(X|Z) = \log \prod_{i=1}^N p_\theta(x_i|z_i) = \sum_{i=1}^N \log p_\theta(x_i|z_i)

One could think of doing this :

\log p_\theta(X, Z) = \sum_{i=1}^N \log p_\theta(x_i|z_i) + \log p(z_i)

//(or in typst)

// //where

p(z_i)

is a prior distribution on the latent variables (e.g., a standard Gaussian). // //We could then maximize this joint log-likelihood w.r.t. both

\theta

and

Z = \{z_i\}_{i=1}^N

. // //However, this would lead to overfitting, as we could always increase the likelihood by increasing the capacity of the decoder and setting

z_i

Variational Autoencoders (VAE)

Autoencoders

A Probabilistic Interpretation of Autoencoders

Overview of Variational Autoencoders (VAEs)

MAP Estimation of the Latent Variables

Variational Inference

Reparameterization Trick

Doubly Stochastic Variational Inference

Prior and latent space misconception