Note on Denoising Diffusion Probabilistic Models
I've recently discovered a fantastic online course titled "TinyML and Efficient Deep Learning Computing" taught by Prof. Song Han at MIT. This course delves into the latest advancements in large language models and generative AI. While Lecture 16 provides a comprehensive overview on diffusion models and their recent generalizations, it skips some mathematical details regarding Denoising Diffusion Probabilistic Models (DDPM).
This post serves as my notes on these skipped mathematical details from the lecture. Especially,
- We provide a simplified and much more transparent derivation on the training loss than the one presented in the DDPM paper.
- We show that the dropped L_T term in the DDPM paper should not appear at all if we start with the correct loss.
- No special treatment is needed for the L_0 term in the DDPM paper, i.e. L_{t-1} is applicable for t=1 as well.
Forward diffusion process
The forward diffusion process is to gradually add white noises z_t to the image x_t in each step t: \begin{equation}x_t = a_t\,x_{t-1} + b_t\,z_t\tag{1}\end{equation} with z_t\overset{\mathrm{iid}}{\sim} {\cal{N}}(0, \mathbf{I}).
Starting with Eq. (1), we can write x_t directly in terms of x_0 as \begin{equation}x_t = A_t\,x_{0} + B_t\,\epsilon_t\,,\tag{2}\end{equation} where \epsilon_t\overset{\mathrm{iid}}{\sim} {\cal{N}}(0, \mathbf{I}) and \begin{eqnarray}A_t =a_t A_{t-1}\,,\quad\quad B^2_t = b_t^2+a_t^2\,B_{t-1}^2\,.\tag{3}\end{eqnarray} Notes:
- If we requires x_t eventually becomes a white noise, i.e. A_t\rightarrow 0 and B^2_t\rightarrow 1 at t\rightarrow\infty. By Eq. (3), there must be the relation a^2_t + b^2_t=1. This is the reason that at the beginning of the DDPM paper, a_t and b_t are directly parameterized as \sqrt{1-\beta_t} and \sqrt{\beta_t}.
- When a^2_t + b^2_t=1, we can prove by induction that A^2_t + B^2_t=1. So A_t and B_t are denoted by \sqrt{\bar{\alpha}_t} and \sqrt{1-\bar{\alpha}_t} in the DDPM paper.
Reverse diffusion process
In the reverse process, we consider the conditional probability q\left(x_{t-1}|x_t, x_0\right). By using Bayes' rule \begin{eqnarray}q\left(x_{t-1}|x_t, x_0\right) &=&\frac{q\left(x_t|x_{t-1}, x_0\right)q\left(x_{t-1}|x_0\right)}{q\left(x_t|x_0\right)}\\&\propto&\exp\left[-\frac{1}{2}\left(\frac{x_t-a_t\,x_{t-1}}{b_t}\right)^2-\frac{1}{2} \left(\frac{x_{t-1}-A_t\,x_0}{B_{t-1}}\right)^2+\cdots \right]\,, \end{eqnarray} we see that q\left(x_{t-1}|x_t, x_0\right) is Gaussian. To determine its mean \mu_t and variance \sigma^2_t, we consider the coefficients of the linear and square terms of x_{t-1} in the above exponent, respectively: \begin{eqnarray}\frac{\mu_{t}}{\sigma_t^2}&=&\frac{a_t\,x_t}{b^2_t}+\frac{A_{t-1}\,x_0}{B^2_{t-1}}\,,\\ \frac{1}{\sigma^2_t}&=&\left(\frac{a_t}{b_t^2}\right)^2+\frac{1}{B_{t-1}^2}\,.\end{eqnarray} With a few algebra as well as the relation (3), we have \begin{eqnarray}\mu_t &=& a_t\frac{B_{t-1}^2}{B^2_t}\,x_t + A_{t-1}\frac{b_t^2}{B^2_t}\,x_0 \,,\\ \sigma_t &=& b_t\frac{B_{t-1}}{B_t}\,.\tag{4}\end{eqnarray}
Finally, by replacing x_0 through Eq. (2), we have \begin{equation}\mu_t=\frac{1}{a_t}\left(x_t - \frac{b_t^2}{B_t}\epsilon_t\right)\,.\tag{5}\end{equation}
Remarks:
- x_{t-1}|x_t,x_0\sim {\cal{N}}(\mu_t,\sigma^2_t) is Gaussian while x_{t-1}|x_t is NOT. This is why the reverse process focuses on q\left(x_{t-1}|x_t, x_0\right) instead of q\left(x_{t-1}|x_t\right).
- There requires no special treatment for computations involving q\left(x_{t-1}|x_t, x_0\right) at t=1, since x_{t-1}|x_t,x_0\sim {\cal{N}}(\mu_t,\sigma^2_t) holds at t=1 as well.
- Proof: to make the relation (3) hold at t=1, we should define A_0=1 and B_0=0. Then by Eq. (4), we have \mu_1=x_0 and \sigma_1=0. As a result, x_{0}|x_1,x_0\sim {\cal{N}}(x_0,0) becomes degenerate and q\left(x_0=x|x_1, x_0=x_0\right)=\delta(x-x_0). The latter expression is often written as q\left(x_0|x_1, x_0\right)=1 in the discrete case.
Training loss
The goal is is to train a neural network that is able to recover x_0 from x_T in the training phase (and generate a new x_0 from a pure white noise in the inference phase).
We denote the distribution of model outputs in the training phase by p_{\theta}(x_0|x_T), where \theta represents all the neural network parameters. Note that x_T is generated from x_0 through forward process, so the overall reconstructed distribution of x_0 by the neural network is \begin{equation}R_{\theta}(x_0)\equiv \int dx_T\,q(x_T|x_0)\,p_{\theta}(x_0|x_T)\,.\tag{6}\end{equation} To make R_{\theta}(x_0) as close as the true distribution q(x_0), we consider the cross entropy as explained in this post: \begin{eqnarray}{\cal{l}}_{\text{C.E.}}(\theta)\equiv -\mathbb{E}_{q(x_0)}\,\log\,R_{\theta}(x_0)=-\mathbb{E}_{q(x_0)}\,\log\left[\int dx_T\,q(x_T|x_0)\,p_{\theta}(x_0|x_T)\right]\,.\tag{7}\end{eqnarray} Since form of the cross entropy (7) is hard to be optimized, a common trick is to convert it to evidence lower bound instead: \begin{eqnarray}{\cal{l}}_{\text{C.E.}}(\theta)&=&-\mathbb{E}_{q(x_0)}\,\log\left[\int dx_{1:T}\,q(x_T|x_0)\,p_{\theta}(x_{0:T-1}|x_T)\right]\\&=&-\mathbb{E}_{q(x_0)}\log\left[\mathbb{E}_{q(x_{1:T}|x_0)}\frac{q(x_T|x_0)\,p_{\theta}(x_{0:T-1}|x_T)}{q(x_{1:T}|x_0)}\right]\\ &\leq& -\mathbb{E}_{q(x_{0:T})}\log\,\frac{q(x_T|x_0)\,p_{\theta}(x_{0:T-1}|x_T)}{q(x_{1:T}|x_0)}\equiv l_{elbo}(\theta)\,,\tag{8}\end{eqnarray} where we use the Jensen's inequality in the last line. Indeed, minimizing the cross entropy is equivalent to minimizing the minus of its evidence lower bound.
Unlike what is presented in the DDPM paper, we now simplify l_{elbo}(\theta) in Eq. (8) by only considering the terms that contains \theta: \begin{eqnarray}l_{elbo}(\theta)&\sim&-\mathbb{E}_{q(x_{0:T})}\log\,p_{\theta}(x_{0:T-1}|x_T)\\&=&-\sum_{t=1}^T \mathbb{E}_{q(x_{0:T})}\log\,p_{\theta}(x_{t-1}|x_t)\\&=&-\sum_{t=1}^T \mathbb{E}_{q(x_0, x_{t-1}, x_t)}\log\,p_{\theta}(x_{t-1}|x_t)\\&=& \boxed{-\sum_{t=1}^T \mathbb{E}_{q(x_0)}\mathbb{E}_{q(x_t|x_0)}\mathbb{E}_{q(x_{t-1}|x_t,x_0)}\log\,p_{\theta}(x_{t-1}|x_t)}\,,\tag{9}\end{eqnarray} where in the second line, we use the Markov property p_{\theta}(x_{0:T-1}|x_T)=\prod_{t=1}^Tp_{\theta}(x_{t-1}|x_t). In the third line, we integrate out all the irrelevant variables except x_0, x_{t-1}, x_t. In the fourth line, we break the total expectation into a chain of conditional expectations.
Finally, we design the neural network to make the generated distribution p_{\theta}(x_{t-1}|x_t) a Gaussian: \begin{equation}p_{\theta}(x_{t-1}|x_t) \equiv \frac{1}{\sqrt{2\pi}\Sigma_t}\exp\left\{-\frac{\left[x_{t-1} - \mu_{\theta}(x_t)\right]^2}{2\Sigma^2_t}\right\}\end{equation} where the mean \begin{equation}\mu_{\theta}(x_t)\equiv \frac{1}{a_t}\left[x_t - \frac{b_t^2}{B_t}\epsilon_{\theta}(x_t)\right]\tag{10}\end{equation} is of the same form as Eq. (5) but with a model predicted noise \epsilon_{\theta}(x_t). With such design, we have \begin{eqnarray}-\mathbb{E}_{q(x_{t-1}|x_t,x_0)}\log\,p_{\theta}(x_{t-1}|x_t)&\sim&\frac{1}{2\Sigma^2_t}\,\mathbb{E}_{q(x_{t-1}|x_t,x_0)}\Big[x_{t-1} - \mu_{\theta}(x_t)\Big]^2\\&=&\frac{1}{2\Sigma^2_t}\Big[\mathbb{E}_{q(x_{t-1}|x_t,x_0)} x_{t-1}^2 - 2 \mu_{\theta}(x_t)\,\,\mathbb{E}_{q(x_{t-1}|x_t,x_0)}x_{t-1} + \mu_{\theta}^2(x_t) \Big]\\&=&\frac{1}{2\Sigma^2_t}\Big[\mu^2_t + \sigma_t^2 - 2 \mu_{\theta}(x_t)\,\,\mu_t + \mu_{\theta}^2(x_t)\Big]\\&\sim&\frac{1}{2\Sigma^2_t}\Big[ \mu_t - \mu_{\theta}(x_t)\Big]^2\tag{11}\\&=&\frac{1}{2}\left(\frac{b^2_t}{a_tB_t\Sigma_t}\right)^2\Big[\epsilon_t - \epsilon_{\theta}(x_t)\Big]^2\,,\tag{12}\end{eqnarray} where in the first line we ignore terms that does not contain \theta. In the third line, we use \mathbb{E}_{q(x_{t-1}|x_t,x_0)} x_{t-1}^2=\mu_t^2+\sigma_t^2 and \mathbb{E}_{q(x_{t-1}|x_t,x_0)} x_{t-1}=\mu_t because of x_{t-1}|x_t,x_0\sim {\cal{N}}(\mu_t,\sigma_t) as proved in the previous session. In the fourth line, we throw the \sigma_t^2 term. Finally, in the last line, we submit Eq. (5) and Eq. (10).
Putting all things together, we obtain the final form of the training loss \begin{equation}l_{elbo}(\theta)=\frac{1}{2} \sum_{t=1}^T \mathbb{E}_{q(x_0)}\mathbb{E}_{q(x_t|x_0)}\left(\frac{b^2_t}{a_tB_t\Sigma_t}\right)^2\Big[\epsilon_t - \epsilon_{\theta}(x_t)\Big]^2\,,\tag{13}\end{equation} which explicitly manifests the training algorithm: in the training phase, we should first sample a x_0, then generate a x_t|x_0 for any 1\leq t \leq T, and finally backpropagate using the gradient \nabla_{\theta}||\epsilon_t-\epsilon_{\theta}(x_t)||^2.
Remarks
- First of all, a significant portion of the DDPM paper's derivation is to reduce the training loss to the KL divergences in the form of \mathbb{E}_q\log\frac{q(x_{t-1}|x_t,x_0)}{p_{\theta}(x_{t-1}|x_t)}, which conveys a FALSE impression that the paring between q(x_{t-1}|x_t,x_0) and p_{\theta}(x_{t-1}|x_t) in the numerator and denominator is critical.
- But as revealed in this post, what really matters is \mathbb{E}_q\log\frac{1}{p_{\theta}(x_{t-1}|x_t)} and the true key is to expand \mathbb{E}_q into \mathbb{E}_{q(x_0)}\mathbb{E}_{q(x_t|x_0)}\mathbb{E}_{q(x_{t-1}|x_t,x_0)} as in Eq. (9).
- Secondly, it is also unnecessary to separate the L_0 term from other L_{t-1} terms in the DDPM paper's derivation: when t=1, \mathbb{E}_q\log\frac{q(x_{0}|x_1,x_0)}{p_{\theta}(x_{0}|x_1)} can still be viewed as the KL divergence between two Gaussian distributions since q(x_{0}|x_1,x_0) is a degenerate Gaussian with zero variance.
- One can also verify that our derivation of Eq. (11) and (12) applies to t=1 case as well. No special treatment for t=1 case is needed.
- More importantly, the DDPM paper does not distinguish between R_{\theta}(x_0) and p_{\theta}(x_0|x_T) as in Eq. (6). Instead of considering the cross entropy (7), DDPM paper starts with an incorrect loss -\mathbb{E}_{q(x_0)}\,\log\,p_{\theta}(x_0|x_T), which leads to an extra L_T term in their loss.
- As shown in the appendix, there will be no L_T term if we start with the correct loss (7).
- The notation p_{\theta}(x_T) in the L_T term is a reminder that something is wrong. Mathematically, its subscript \theta makes the L_T term \theta-dependent. Conceptually, x_T is the input and should never be the output of the model.
Appendix
In this appendix, we also reduce training loss (7) into KL divergences as in the DDPM paper: \begin{eqnarray}{\cal{l}}_{\text{elbo}}(\theta)&\equiv& \mathbb{E}_{q(x_{0:T})}\log\,\frac{q(x_{1:T}|x_0)}{q(x_T|x_0)\,p_{\theta}(x_{0:T-1}|x_T)}\\&=&\mathbb{E}_{q(x_{0:T})} \log\left[\frac{1}{q(x_T|x_0)}\prod_{t=1}^T\frac{q(x_t|x_{t-1},x_0)}{p_{\theta}(x_{t-1}|x_t)}\right]\\&=&\mathbb{E}_{q(x_{0:T})} \log\left[\frac{1}{q(x_T|x_0)}\prod_{t=1}^T\frac{q(x_{t-1}|x_t,x_0)q(x_t|x_0)}{p_{\theta}(x_{t-1}|x_t)q(x_{t-1}|x_0)}\right]\\&=&\mathbb{E}_{q(x_{0:T})} \log\left[\prod_{t=1}^T\frac{q(x_{t-1}|x_t,x_0)}{p_{\theta}(x_{t-1}|x_t)}\right]\,,\tag{14}\end{eqnarray} where in the second line, we use the Markov property to expand q(x_{1:T}|x_0)=\prod_{t=1}^Tq(x_t|x_{t-1}) and p_{\theta}(x_{0:T-1}|x_T)=\prod_{t=1}^Tp_{\theta}(x_{t-1}|x_t). In the third line, we use the Bayes' rule to convert q(x_t|x_{t-1},x_0) to q(x_{t-1}|x_t,x_0). In the fourth line, we use the conditional probability q(x_0|x_0)=1 and cancel all the remaining q(x_t|x_0) for 1\leq t \leq T within the product.
Remarks:
- Indeed, there is no L_T term in Eq. (14) as compared to that in the DDPM paper. Although the DDPM paper dropped the L_T term at the end, it should not appear in the loss at all.
- Because of the Markov property q(x_t|x_{t-1},x_0)=q(x_t|x_{t-1}), we could also convert q(x_t|x_{t-1},x_0) to q(x_{t-1}|x_t) by the Bayes' rule and derive the above {\cal{l}}_{\text{elbo}}(\theta) in terms of q(x_{t-1}|x_t) instead of q(x_{t-1}|x_t,x_0): \begin{equation}{\cal{l}}_{\text{elbo}}(\theta) =\mathbb{E}_{q(x_{0:T})}\left[\sum_{t=1}^T\log \frac{q(x_{t-1}|x_t)}{p_{\theta}(x_{t-1}|x_t)} \right]\,.\end{equation} The problem is that we don't know the explicit form of q(x_{t-1}|x_t).
Comments
Post a Comment