Note on Denoising Diffusion Probabilistic Models

I've recently discovered a fantastic online course titled "TinyML and Efficient Deep Learning Computing" taught by Prof. Song Han at MIT. This course delves into the latest advancements in large language models and generative AI. While Lecture 16 provides a comprehensive overview on diffusion models and their recent generalizations, it skips some mathematical details regarding Denoising Diffusion Probabilistic Models (DDPM).

This post serves as my notes on these skipped mathematical details from the lecture. Especially,

We provide a simplified and much more transparent derivation on the training loss than the one presented in the DDPM paper.
We show that the dropped $L_T$ term in the DDPM paper should not appear at all if we start with the correct loss.
No special treatment is needed for the $L_0$ term in the DDPM paper, i.e. $L_{t-1}$ is applicable for $t=1$ as well.

Forward diffusion process

The forward diffusion process is to gradually add white noises $z_t$ to the image $x_t$ in each step $t$: \begin{equation}x_t = a_t\,x_{t-1} + b_t\,z_t\tag{1}\end{equation} with $z_t\overset{\mathrm{iid}}{\sim} {\cal{N}}(0, \mathbf{I})$.

Starting with Eq. (1), we can write $x_t$ directly in terms of $x_0$ as \begin{equation}x_t = A_t\,x_{0} + B_t\,\epsilon_t\,,\tag{2}\end{equation} where $\epsilon_t\overset{\mathrm{iid}}{\sim} {\cal{N}}(0, \mathbf{I})$ and \begin{eqnarray}A_t =a_t A_{t-1}\,,\quad\quad B^2_t = b_t^2+a_t^2\,B_{t-1}^2\,.\tag{3}\end{eqnarray} Notes:

If we requires $x_t$ eventually becomes a white noise, i.e. $A_t\rightarrow 0$ and $B^2_t\rightarrow 1$ at $t\rightarrow\infty$. By Eq. (3), there must be the relation $a^2_t + b^2_t=1$. This is the reason that at the beginning of the DDPM paper, $a_t$ and $b_t$ are directly parameterized as $\sqrt{1-\beta_t}$ and $\sqrt{\beta_t}$.
When $a^2_t + b^2_t=1$, we can prove by induction that $A^2_t + B^2_t=1$. So $A_t$ and $B_t$ are denoted by $\sqrt{\bar{\alpha}_t}$ and $\sqrt{1-\bar{\alpha}_t}$ in the DDPM paper.

Reverse diffusion process

In the reverse process, we consider the conditional probability $q\left(x_{t-1}|x_t, x_0\right)$. By using Bayes' rule \begin{eqnarray}q\left(x_{t-1}|x_t, x_0\right) &=&\frac{q\left(x_t|x_{t-1}, x_0\right)q\left(x_{t-1}|x_0\right)}{q\left(x_t|x_0\right)}\\&\propto&\exp\left[-\frac{1}{2}\left(\frac{x_t-a_t\,x_{t-1}}{b_t}\right)^2-\frac{1}{2} \left(\frac{x_{t-1}-A_t\,x_0}{B_{t-1}}\right)^2+\cdots \right]\,, \end{eqnarray} we see that $q\left(x_{t-1}|x_t, x_0\right)$ is Gaussian. To determine its mean $\mu_t$ and variance $\sigma^2_t$, we consider the coefficients of the linear and square terms of $x_{t-1}$ in the above exponent, respectively: \begin{eqnarray}\frac{\mu_{t}}{\sigma_t^2}&=&\frac{a_t\,x_t}{b^2_t}+\frac{A_{t-1}\,x_0}{B^2_{t-1}}\,,\\ \frac{1}{\sigma^2_t}&=&\left(\frac{a_t}{b_t^2}\right)^2+\frac{1}{B_{t-1}^2}\,.\end{eqnarray} With a few algebra as well as the relation (3), we have \begin{eqnarray}\mu_t &=& a_t\frac{B_{t-1}^2}{B^2_t}\,x_t + A_{t-1}\frac{b_t^2}{B^2_t}\,x_0 \,,\\ \sigma_t &=& b_t\frac{B_{t-1}}{B_t}\,.\tag{4}\end{eqnarray}

Finally, by replacing $x_0$ through Eq. (2), we have \begin{equation}\mu_t=\frac{1}{a_t}\left(x_t - \frac{b_t^2}{B_t}\epsilon_t\right)\,.\tag{5}\end{equation}

Remarks:

$x_{t-1}|x_t,x_0\sim {\cal{N}}(\mu_t,\sigma^2_t)$ is Gaussian while $x_{t-1}|x_t$ is NOT. This is why the reverse process focuses on $q\left(x_{t-1}|x_t, x_0\right)$ instead of $q\left(x_{t-1}|x_t\right)$.
There requires no special treatment for computations involving $q\left(x_{t-1}|x_t, x_0\right)$ at $t=1$, since $x_{t-1}|x_t,x_0\sim {\cal{N}}(\mu_t,\sigma^2_t)$ holds at $t=1$ as well.

Proof: to make the relation (3) hold at $t=1$, we should define $A_0=1$ and $B_0=0$. Then by Eq. (4), we have $\mu_1=x_0$ and $\sigma_1=0$. As a result, $x_{0}|x_1,x_0\sim {\cal{N}}(x_0,0)$ becomes degenerate and $q\left(x_0=x|x_1, x_0=x_0\right)=\delta(x-x_0)$. The latter expression is often written as $q\left(x_0|x_1, x_0\right)=1$ in the discrete case.

Training loss

The goal is is to train a neural network that is able to recover $x_0$ from $x_T$ in the training phase (and generate a new $x_0$ from a pure white noise in the inference phase).

We denote the distribution of model outputs in the training phase by $p_{\theta}(x_0|x_T)$, where $\theta$ represents all the neural network parameters. Note that $x_T$ is generated from $x_0$ through forward process, so the overall reconstructed distribution of $x_0$ by the neural network is \begin{equation}R_{\theta}(x_0)\equiv \int dx_T\,q(x_T|x_0)\,p_{\theta}(x_0|x_T)\,.\tag{6}\end{equation} To make $R_{\theta}(x_0)$ as close as the true distribution $q(x_0)$, we consider the cross entropy as explained in this post: \begin{eqnarray}{\cal{l}}_{\text{C.E.}}(\theta)\equiv -\mathbb{E}_{q(x_0)}\,\log\,R_{\theta}(x_0)=-\mathbb{E}_{q(x_0)}\,\log\left[\int dx_T\,q(x_T|x_0)\,p_{\theta}(x_0|x_T)\right]\,.\tag{7}\end{eqnarray} Since form of the cross entropy (7) is hard to be optimized, a common trick is to convert it to evidence lower bound instead: \begin{eqnarray}{\cal{l}}_{\text{C.E.}}(\theta)&=&-\mathbb{E}_{q(x_0)}\,\log\left[\int dx_{1:T}\,q(x_T|x_0)\,p_{\theta}(x_{0:T-1}|x_T)\right]\\&=&-\mathbb{E}_{q(x_0)}\log\left[\mathbb{E}_{q(x_{1:T}|x_0)}\frac{q(x_T|x_0)\,p_{\theta}(x_{0:T-1}|x_T)}{q(x_{1:T}|x_0)}\right]\\ &\leq& -\mathbb{E}_{q(x_{0:T})}\log\,\frac{q(x_T|x_0)\,p_{\theta}(x_{0:T-1}|x_T)}{q(x_{1:T}|x_0)}\equiv l_{elbo}(\theta)\,,\tag{8}\end{eqnarray} where we use the Jensen's inequality in the last line. Indeed, minimizing the cross entropy is equivalent to minimizing the minus of its evidence lower bound.

Unlike what is presented in the DDPM paper, we now simplify $l_{elbo}(\theta)$ in Eq. (8) by only considering the terms that contains $\theta$: \begin{eqnarray}l_{elbo}(\theta)&\sim&-\mathbb{E}_{q(x_{0:T})}\log\,p_{\theta}(x_{0:T-1}|x_T)\\&=&-\sum_{t=1}^T \mathbb{E}_{q(x_{0:T})}\log\,p_{\theta}(x_{t-1}|x_t)\\&=&-\sum_{t=1}^T \mathbb{E}_{q(x_0, x_{t-1}, x_t)}\log\,p_{\theta}(x_{t-1}|x_t)\\&=& \boxed{-\sum_{t=1}^T \mathbb{E}_{q(x_0)}\mathbb{E}_{q(x_t|x_0)}\mathbb{E}_{q(x_{t-1}|x_t,x_0)}\log\,p_{\theta}(x_{t-1}|x_t)}\,,\tag{9}\end{eqnarray} where in the second line, we use the Markov property $p_{\theta}(x_{0:T-1}|x_T)=\prod_{t=1}^Tp_{\theta}(x_{t-1}|x_t)$. In the third line, we integrate out all the irrelevant variables except $x_0, x_{t-1}, x_t$. In the fourth line, we break the total expectation into a chain of conditional expectations.

Finally, we design the neural network to make the generated distribution $p_{\theta}(x_{t-1}|x_t)$ a Gaussian: \begin{equation}p_{\theta}(x_{t-1}|x_t) \equiv \frac{1}{\sqrt{2\pi}\Sigma_t}\exp\left\{-\frac{\left[x_{t-1} - \mu_{\theta}(x_t)\right]^2}{2\Sigma^2_t}\right\}\end{equation} where the mean \begin{equation}\mu_{\theta}(x_t)\equiv \frac{1}{a_t}\left[x_t - \frac{b_t^2}{B_t}\epsilon_{\theta}(x_t)\right]\tag{10}\end{equation} is of the same form as Eq. (5) but with a model predicted noise $\epsilon_{\theta}(x_t)$. With such design, we have \begin{eqnarray}-\mathbb{E}_{q(x_{t-1}|x_t,x_0)}\log\,p_{\theta}(x_{t-1}|x_t)&\sim&\frac{1}{2\Sigma^2_t}\,\mathbb{E}_{q(x_{t-1}|x_t,x_0)}\Big[x_{t-1} - \mu_{\theta}(x_t)\Big]^2\\&=&\frac{1}{2\Sigma^2_t}\Big[\mathbb{E}_{q(x_{t-1}|x_t,x_0)} x_{t-1}^2 - 2 \mu_{\theta}(x_t)\,\,\mathbb{E}_{q(x_{t-1}|x_t,x_0)}x_{t-1} + \mu_{\theta}^2(x_t) \Big]\\&=&\frac{1}{2\Sigma^2_t}\Big[\mu^2_t + \sigma_t^2 - 2 \mu_{\theta}(x_t)\,\,\mu_t + \mu_{\theta}^2(x_t)\Big]\\&\sim&\frac{1}{2\Sigma^2_t}\Big[ \mu_t - \mu_{\theta}(x_t)\Big]^2\tag{11}\\&=&\frac{1}{2}\left(\frac{b^2_t}{a_tB_t\Sigma_t}\right)^2\Big[\epsilon_t - \epsilon_{\theta}(x_t)\Big]^2\,,\tag{12}\end{eqnarray} where in the first line we ignore terms that does not contain $\theta$. In the third line, we use $\mathbb{E}_{q(x_{t-1}|x_t,x_0)} x_{t-1}^2=\mu_t^2+\sigma_t^2$ and $\mathbb{E}_{q(x_{t-1}|x_t,x_0)} x_{t-1}=\mu_t$ because of $x_{t-1}|x_t,x_0\sim {\cal{N}}(\mu_t,\sigma_t)$ as proved in the previous session. In the fourth line, we throw the $\sigma_t^2$ term. Finally, in the last line, we submit Eq. (5) and Eq. (10).

Putting all things together, we obtain the final form of the training loss \begin{equation}l_{elbo}(\theta)=\frac{1}{2} \sum_{t=1}^T \mathbb{E}_{q(x_0)}\mathbb{E}_{q(x_t|x_0)}\left(\frac{b^2_t}{a_tB_t\Sigma_t}\right)^2\Big[\epsilon_t - \epsilon_{\theta}(x_t)\Big]^2\,,\tag{13}\end{equation} which explicitly manifests the training algorithm: in the training phase, we should first sample a $x_0$, then generate a $x_t|x_0$ for any $1\leq t \leq T$, and finally backpropagate using the gradient $\nabla_{\theta}||\epsilon_t-\epsilon_{\theta}(x_t)||^2$.

Remarks

As compared to our derivation to the training loss (13), I believe there are two unnecessary complexities and one error in the derivation presented in the DDPM paper:

First of all, a significant portion of the DDPM paper's derivation is to reduce the training loss to the KL divergences in the form of $\mathbb{E}_q\log\frac{q(x_{t-1}|x_t,x_0)}{p_{\theta}(x_{t-1}|x_t)}$, which conveys a FALSE impression that the paring between $q(x_{t-1}|x_t,x_0)$ and $p_{\theta}(x_{t-1}|x_t)$ in the numerator and denominator is critical.

But as revealed in this post, what really matters is $\mathbb{E}_q\log\frac{1}{p_{\theta}(x_{t-1}|x_t)}$ and the true key is to expand $\mathbb{E}_q$ into $\mathbb{E}_{q(x_0)}\mathbb{E}_{q(x_t|x_0)}\mathbb{E}_{q(x_{t-1}|x_t,x_0)}$ as in Eq. (9).

Secondly, it is also unnecessary to separate the $L_0$ term from other $L_{t-1}$ terms in the DDPM paper's derivation: when $t=1$, $\mathbb{E}_q\log\frac{q(x_{0}|x_1,x_0)}{p_{\theta}(x_{0}|x_1)}$ can still be viewed as the KL divergence between two Gaussian distributions since $q(x_{0}|x_1,x_0)$ is a degenerate Gaussian with zero variance.

One can also verify that our derivation of Eq. (11) and (12) applies to $t=1$ case as well. No special treatment for $t=1$ case is needed.

More importantly, the DDPM paper does not distinguish between $R_{\theta}(x_0)$ and $p_{\theta}(x_0|x_T)$ as in Eq. (6). Instead of considering the cross entropy (7), DDPM paper starts with an incorrect loss $-\mathbb{E}_{q(x_0)}\,\log\,p_{\theta}(x_0|x_T)$, which leads to an extra $L_T$ term in their loss.

As shown in the appendix, there will be no $L_T$ term if we start with the correct loss (7).
The notation $p_{\theta}(x_T)$ in the $L_T$ term is a reminder that something is wrong. Mathematically, its subscript $\theta$ makes the $L_T$ term $\theta$-dependent. Conceptually, $x_T$ is the input and should never be the output of the model.

Appendix

In this appendix, we also reduce training loss (7) into KL divergences as in the DDPM paper: \begin{eqnarray}{\cal{l}}_{\text{elbo}}(\theta)&\equiv& \mathbb{E}_{q(x_{0:T})}\log\,\frac{q(x_{1:T}|x_0)}{q(x_T|x_0)\,p_{\theta}(x_{0:T-1}|x_T)}\\&=&\mathbb{E}_{q(x_{0:T})} \log\left[\frac{1}{q(x_T|x_0)}\prod_{t=1}^T\frac{q(x_t|x_{t-1},x_0)}{p_{\theta}(x_{t-1}|x_t)}\right]\\&=&\mathbb{E}_{q(x_{0:T})} \log\left[\frac{1}{q(x_T|x_0)}\prod_{t=1}^T\frac{q(x_{t-1}|x_t,x_0)q(x_t|x_0)}{p_{\theta}(x_{t-1}|x_t)q(x_{t-1}|x_0)}\right]\\&=&\mathbb{E}_{q(x_{0:T})} \log\left[\prod_{t=1}^T\frac{q(x_{t-1}|x_t,x_0)}{p_{\theta}(x_{t-1}|x_t)}\right]\,,\tag{14}\end{eqnarray} where in the second line, we use the Markov property to expand $q(x_{1:T}|x_0)=\prod_{t=1}^Tq(x_t|x_{t-1})$ and $p_{\theta}(x_{0:T-1}|x_T)=\prod_{t=1}^Tp_{\theta}(x_{t-1}|x_t)$. In the third line, we use the Bayes' rule to convert $q(x_t|x_{t-1},x_0)$ to $q(x_{t-1}|x_t,x_0)$. In the fourth line, we use the conditional probability $q(x_0|x_0)=1$ and cancel all the remaining $q(x_t|x_0)$ for $1\leq t \leq T$ within the product.

Remarks:

Indeed, there is no $L_T$ term in Eq. (14) as compared to that in the DDPM paper. Although the DDPM paper dropped the $L_T$ term at the end, it should not appear in the loss at all.
Because of the Markov property $q(x_t|x_{t-1},x_0)=q(x_t|x_{t-1})$, we could also convert $q(x_t|x_{t-1},x_0)$ to $q(x_{t-1}|x_t)$ by the Bayes' rule and derive the above ${\cal{l}}_{\text{elbo}}(\theta)$ in terms of $q(x_{t-1}|x_t)$ instead of $q(x_{t-1}|x_t,x_0)$: \begin{equation}{\cal{l}}_{\text{elbo}}(\theta) =\mathbb{E}_{q(x_{0:T})}\left[\sum_{t=1}^T\log \frac{q(x_{t-1}|x_t)}{p_{\theta}(x_{t-1}|x_t)} \right]\,.\end{equation} The problem is that we don't know the explicit form of $q(x_{t-1}|x_t)$.

Search This Blog

Life with Physics