If you read (implement) machine learning (and application) papers, there is a high probability that you have come across Kullback–Leibler divergence a.k.a. KL divergence loss. I frequently stumble upon it when I read about latent variable models (like VAEs). I am almost sure all of us know what the term means (don’t worry if you don’t as I have provided a brief explanation below and Google wil get you hundreds of resources on it), but may not have actually derived it till the end. In my opinion, deriving this term would make its implementation much clearer.

Below, I derive the KL divergence in case of univariate Gaussian distributions, which can be extended to the multivariate case as well 1.

What is KL Divergence?

KL divergence is a measure of how one probability distribution differs (in our case q) from the reference probability distribution (in our case p). Its valuse is always >= 0. Though, I should remind you that it is not a distance metric as it is not symmetric, KL(q || p) is not equivalent to KL(p || q).

KL(q || p ) = Cross Entropy(q, p) - Entropy (q), where q and p are two univariate Gaussian distributions.

More specifically:

KL Divergence for Gaussian distributions?

We know that PDF of Gaussian distribution can be written as:

After taking the logarithm of the PDF above we get:

Let’s also assume that we have that our two distributions have parameters as follows: and .

To add some more context in terms of latent variable models, we try to fit an approximate posterior to the true posterior by minimizing the reverse KL divergence (computationally better than the forward one, read more here 2). Think of z as the latent variable, q(z) as the approximate distribution and p(z) as the prior distribution. Usually, we model q and p as Gaussian distributions. Prior distribution is assumed to have mean of 0 and variance of 1 (standard Normal distribution) and parameters of q are the output of the inference (encoder) network.

Now, let’s look at Cross Entropy and Entropy seperately for ease of evaluation.

Entropy

Cross Entropy

Note that:

  1. The integral over a PDF is always 1 .
  2. And, expectation over square of a random variable is equivalent to sum of square of mean and variance .

Cross Entropy - Entropy

Now let’s put both the terms together:

By stretch of the imagination, the above equation could be generalized to multivariate cases (D dimensions) by summing over all the dimensions:

The above equation can be easily implemented in frameworks like Pytorch. I hope the post helped you to understand this concept a little better!

References:

  1. Auto-Encoding Variational Bayes by Kingma and Welling
  2. KL-divergence as an objective function by Tim Vieira
  3. Allison Chaney for the post image.