It this Lecture, we will derive the core principles behind Variational Autoencoders (VAEs).
The goal is to learn a generative model through: i) a low-dimensional latent space, ii) a decoder that maps from the latent space to the data space.
Autoencoders
Idea The core idea of vanilla autoencoders id to jointly learn an encodereφ, and a decoderdθ. The encoder maps a data point x∈Rd to a low-dimensional latent representationeφ(x), and the decoder maps back this latent representation dθ(eφ(x)) to the data space Rd.
The parameters of the encoder and the decoder are learned by minimizing the error between a data point x and its reconstruction dθ(eφ(x)):
In Variational Autoencoders (VAEs), the position zi in the latent space (for a data point xi) is supposed to be a random variable.
Indeed, there is technically some uncertainty on the exact position of zi that best explains xi, especially given that we consider all points jointly.
In a VAE, on thus manipulates, for each point xi, a distribution on its latent position zi, that aims at best approximating the true-unknown-tricky posterior distribution pθ(zi∣X) of this latent variable given all the training set.
We will now decompose the construction of the VAE, starting with formulations that have only the decoder.
The encoder will be introduced later as a trick (i.e., amortization).
MAP Estimation of the Latent Variables
We are interesting in estimating both the decoder parameters θ and the latent variables zi for each data point xi.
One could think of optimizing over Z={zi}i=1N (all the latent positions) the log-likelihood of the data:
logpθ(X∣Z)=i=1∑Nlogpθ(xi∣zi)Decomposable loss from i.i.d. assumption
Assuming that the data points are i.i.d. given the latent variables, we have:
//
//where p(zi) is a prior distribution on the latent variables (e.g., a standard Gaussian).
//
//We could then maximize this joint log-likelihood w.r.t. both θ and Z={zi}i=1N.
//
//However, this would lead to overfitting, as we could always increase the likelihood by increasing the capacity of the decoder and setting zi to arbitrary values.