Quantized Variational Auto Encoders (QVAE) (on going)

VAE Paper: https://arxiv.org/abs/1312.6114 BETA Vae Paper: https://openreview.net/pdf?id=Sy2fzU9gl

Very useful for generation and combining with transformers and llms to generate other types of data.

We first need to understand auto-encoders and variational auto encoders.

An auto encoders is composed of two networks, an encoder and a decoder that are trained together. The encoder's goal is to take an input (for example an image) with a large dimension (lot of pixels) and compress it into a smaller representation that uses less data (smaller vector). And the deocder's goal is to do the inverse process, given a compressed representation it is supposed to invert the operation and get back the original object (image in our case).

The loss function for the auto-encoder network is just an MSE between the decoder's output and the original input. Autoencoders can be trained in an unsupervised way without labelling data, the compressed representations can later be used as inputs for other networks as they're supposed to contain meanigful information about the original input but in a more compact format.

Now VAE (Variational Auto Encoders) are essentially the same thing, the only difference is that the encoder will output parameters for a distribution rather than the compacted version. To get the compacted version we are required to take a sample from the distribution described by the outputed parameters. The distribution's parameter will be different for each input. The decoder will work the same way and take the compact version and reconstruct the original image. The loss is different than for standard vae, it is a sum of the original MSE loss + another term, KL divergence term that is responsible of making sure that the distribution described by the parameters given by the encoder is close to a standard normal distribution. The reason for this is that we want the distribution to be smooth and continuous, so that small changes in the input will result in small changes in the output, and also to make sure that the latent space is well structured and can be easily sampled from. So the main advantage of VAE over standard auto-encoders is that it allows us to generate new data by sampling from the latent space, which is not possible with standard auto-encoders. And the compact space representation is supposedly more meaningful and structured than the one obtained from standard auto-encoders.

Let the latent (compressed) variable $z \in R^d$ and the input data $x \in R^D$ . We first chose a prior (what we think / believe before seeing the data) over the latents variable (we usually take a standard normal):

p(z) = \mathcal{N}(z; 0, I)

We now define the likelihood model which represents our decoder (parametrized by $\theta$ ):

p_\theta(x | z)

Now if we want a generative model we need the marginal likelihood of a datapoint, meaning the "distribution" / "probability" of given data point, this way we can generate new data points from this distribution:

p_\theta(x) = \int p_\theta(x | z) p(z) dz

And now to train our generative model, we need this $p_\theta$ to be accurate given our dataset, so we are going to maximize it over our whole dataset of size $N$ :

\sum_{i=1}^N \log p_\theta(x_i)

But the issue is that the integral is intractable and the same goes for the posterior (which we need as it tells use which latent variables $z$ are plausible explanations for the observed data $x$ under our current model $\theta$ ). The posterior represents our encoder.

p_\theta(z|x) = \frac{p_\theta(x | z)p(z)}{p_\theta(x)}

To get around this issue, we'll use variational inference and we are going to introduce an approximate posterior (encoder) $q_\phi(z|x)$ which typically is a gaussian and its parameters come from some neural network or anything.

q_\phi(z|x) = \mathcal{N}(z; \mu_\phi(x), \text{diag}(\sigma_\phi^2(x)))

So what we need to do now is optimize the parameters of our approximate posterior $q$ which are $\phi$ and this can be a neural network or anything else in order for the outputed distribution to be as close as possible to the groudntruth posterior distribution.

So we are going to get back to the quantity we are interested in for our objectif (generating new samples).

\begin{aligned} \log p_\theta(x) &= \log \int p_\theta(x|z) p(z) dz\\ &= \log \int \frac{q_\phi(z \mid x)}{q_\phi(z \mid x)} p_\theta(x \mid z) p(z) dz\\ &= \log \int q_\phi(z \mid x) \frac{p_\theta(x \mid z) p(z)}{q_\phi(z \mid x)} dz\\ &= \log \mathbb{E}_{q_\phi(z\mid x)}\left[\frac{p_\theta(x \mid z)p(z)}{q_\phi(z \mid x)}\right] \end{aligned}

Now we want to optimize this quantity over all our datapoints $x$ , it is hard to do it so instead we are going to optimize another quantity that has a lower bound on this one, meaning that if we optimize this other quantity we are also optimizing the original one, and this is called the evidence lower bound (ELBO). We get this quantity by applying Jensen's inequality, jensen's inequality states that for a convex function $f$ and a random variable $X$ we have:

f(\mathbb{E}[X]) \leq \mathbb{E}[f(X)]\\ \mathbb{E}[f(X)] \geq f(\mathbb{E}[X])

And in our case we are going to take $f$ as the logarithm function which is a concave function, so we get:

\begin{aligned} \log p_\theta(x) &= \log \mathbb{E}_{q_\phi(z\mid x)}\left[\frac{p_\theta(x \mid z)p(z)}{q_\phi(z \mid x)}\right]\\ &\ge \mathbb{E}_{q_\phi(z\mid x)}\left[\log \left(\frac{p_\theta(x \mid z)p(z)}{q_\phi(z \mid x)}\right)\right]\\ &= \mathbb{E}_{q_\phi(z\mid x)}\left[\log p_\theta(x \mid z) + \log p(z) - \log q_\phi(z \mid x)\right] \end{aligned}

And now we get the ELBO by taking the negative of this quantity:

\begin{aligned} \text{ELBO}(\theta, \phi) &= -\mathbb{E}_{q_\phi(z\mid x)}\left[\log p_\theta(x \mid z) + \log p(z) - \log q_\phi(z \mid x)\right]\\ &= -\mathbb{E}_{q_\phi(z\mid x)}\left[\log p_\theta(x \mid z)\right] + \text{KL}(q_\phi(z \mid x) || p(z))] \end{aligned}

Because

\text{KL}(q_\phi(z \mid x) || p(z)) = \mathbb{E}_{q_\phi(z\mid x)}\left[\log q_\phi(z \mid x) - \log p(z)\right]

So we added the negative sign to get a quantity that we want to minimize, and this is the loss function for our VAE, we want to minimize the negative ELBO over all our datapoints $x$ . It is better than the original quantity because it is easier to compute and optimize, and it also has a nice interpretation as a trade-off between the reconstruction error (the first term) and the KL divergence (the second term) which encourages the approximate posterior to be close to the prior.

A variant of VAE is $\beta$ -VAE where we add a hyperparameter $\beta$ to the KL divergence term to control the trade-off between the reconstruction error and the KL divergence, this allows us to learn more disentangled representations in the latent space.

\text{ELBO}(\theta, \phi) = -\mathbb{E}_{q_\phi(z\mid x)}\left[\log p_\theta(x \mid z)\right] + \beta \text{KL}(q_\phi(z \mid x) || p(z))]

Also fortunately for the case of gaussian distributions we can compute the KL divergence in closed form, so we don't need to do any approximation for this term, and we can just compute it directly. With

q_\phi(z\mid x) = \mathcal{N}(z;\mu, \mathrm{diag}(\sigma^2)),\quad p(z)=\mathcal{N}(0,I)

he KL has a closed form:

\mathrm{KL}(q\|p) = \frac{1}{2}\sum_{j=1}^d \left(\mu_j^2 + \sigma_j^2 - \log \sigma_j^2 - 1\right).

And then there is a final crucial part, which is the sampling part, a rule for training neural networks is that we need to be able to backpropagate through all the operations, but sampling is a non-differentiable operation, so we need to use a trick called the reparameterization trick, which allows us to backpropagate through the sampling operation by expressing the sampled variable as a deterministic function of the parameters and some noise. So instead of sampling $z$ directly from $q_\phi(z|x)$ , we can sample $\epsilon$ from a standard normal distribution and then compute $z$ as:

z = \mu_\phi(x) + \sigma_\phi(x) \odot \epsilon

On the ELBO, the reconstruction term pushes $z$ to retain information about $x$ so the decoder can explain $x$ and the KL term pushes the approximate posterior toward N(0, I), thus indirectly limiting the amount of information $z$ can retain about $x$ and thus encouraging the model to learn a more compact representation of the data.

Do a toy example where we generate images of clocks with a given time.

Original Article: https://arxiv.org/pdf/1711.00937

Quantization is ...

Now on VQ-VAEs, Vector Quantized Variational Auto Encoders. The idea is the same, a generative model, but the latent space is discrete and represented by indices into a learned codebook. A sort of learned set of tokens. In VQ-VAE we won't have gaussian posterior anymore and we will have a differentiation issue again due to the quantization part.

The encoder will produce a continuous latent vector $z_e(x) \in R^d$ . We'll have a code book (a learned set of embeddings) $E = \{e_k \in R^d\}_{k=1}^K$ . So we'll quantize and choose the nearest code book vector to $z_e$ :

k^{*} = \text{argmin}_{k}||z_e(x) - e_k||_2

so $z_{quantized}(x) = z_q(x) = e_k$ , the latent is the discrete index $k$ or a grid of indices for images if we process the images by patch.

TODO: organize notes.

raideno

Quantized Variational Auto Encoders (QVAE) (on going)