Posts

Showing posts with the label machine learning

Note on Denoising Diffusion Probabilistic Models

I've recently discovered a fantastic online course titled " TinyML and Efficient Deep Learning Computing " taught by  Prof. Song Han  at MIT. This course delves into the latest advancements in large language models and generative AI. While   Lecture 16  provides a comprehensive overview on diffusion models and their recent generalizations, it skips some mathematical details regarding  Denoising Diffusion Probabilistic Models  (DDPM).  This post serves as my notes on these skipped mathematical details from the lecture. Especially,  We provide a simplified and much more transparent derivation on the training loss than the one presented in the  DDPM paper .  We show that the dropped $L_T$ term in the  DDPM paper  should not appear at all if we start with the correct loss.  No special treatment is needed for the $L_0$ term in the  DDPM paper , i.e. $L_{t-1}$ is applicable for $t=1$ as well.  Forwa...

More on Entropy

I reviewed some basics of thermodynamics as well as the Boltzmann entropy in this post . Here, I continue to review the Gibbs entropy, the information entropy as well as the cross-entropy that is widely used in machine learning. In the modern view, entropy is a measure of the ignorance of an observer on the system. Boltzmann Entropy Physics students start to learn the concept of entropy from the Carnot engine and Boltzmann entropy. The Boltzmann entropy applies to a single system with the fixed energy: \begin{equation}S=k_B\log \Omega\,,\tag{1}\end{equation}where $k_B$ is the  Boltzmann constant  and $\Omega$ is the number of microstates corresponding to a system's macrostate. Boltzmann also derived another form of entropy that is very close to the modern form in mathematics. Consider a single system consisting of $N$ particles, we partition $N$ particles into $n$ groups. Let $N_i$ be the number of particles in the group $i$, then the number of microstates is \begin{equa...

Karush-Kuhn-Tucker (KKT) conditions

When I first learned Andrew Ng's CS229 machine learning, I did not really understand KKT conditions in his lecture note 3 , in which KKT conditions are presented without proof and mixed with the content of Lagrangian duality. Recently, I occasionally read a blog which provides a more clear and intuitive illustration on KKT conditions. Let me try to rephase the argument in this blog. Lagrangian Multiplier KKT conditions are about the optimization under inequality constraints. So before jumping into KKT conditions, we should first recall the optimization under equality constraints. Let's consider the optimization problem \begin{eqnarray} \begin{split} \min_{x}&&\quad f(x) \\ \text{s.t.}&& \quad g(x)=0 \end{split}\tag{1}\end{eqnarray} The key insight is that if $x^*$ is the optimal solution of the problem (1), then \begin{eqnarray} \nabla f(x^*)&=&-\lambda\nabla g(x^*)\tag{2}\\ g(x^*)&=&0\tag{3}\end{eqnarray} for some $\lambda \in \mathbb{R}$. Ther...

Support Vector Machine (SVM)

Image
The training dataset of SVM consists of $n$ training examples of the form $\left(x^{(i)}, y^{(i)}\right)$ for $i=1, \cdots, n$, where each ${x}^{(i)}$ is a $p$-dimensional input vector and $y^{(i)}$ is the corresponding label. Furthermore, there is a fixed feature map $\phi$ in SVM that maps an input vector $x$ to a feature vector $\phi(x)$. Such feature map is implicitly specified by a kernel in the dual formulation of SVM. The task of SVM is to find two parallel linear boundaries (hyperplanes) in the feature space by maximizing their margin for either binary classification or regression. SVM as binary classifier For binary classification, $y^{(i)}\in \{1, -1\}$ labels the class to which ${x}^{(i)}$ belongs. The task of SVM is to find two parallel hyperplanes in the feature space that separate the two classes.  Fig. 1: Two parallel hyperplanes in the feature space with the maximum margin for binary classification: (a) Linearly separable data: taken from wiki . (b) Non-linearl...

Quantum Computing: Quantum Machine Learning

We introduced the HHL algorithm and its subroute quantum phase estimation (QPE) algorithm in this blog . Now we introduce two quantum versions of machine learning algorithms: quantum principal component analysis (qPCA) and quantum support vector machine (qSVM), which replies on QPE and HHL, respectively. Data Preparation Suppose we have $n$ data samples and each data sample has $p$ features, i.e., $\mathbf{x}^{(i)}\equiv[x_1^{(i)}, \cdots, x_p^{(i)}]^T$ for $i=1,\cdots, n$. To apply quantum algorithms, we encode each data sample $\mathbf{x}^{(i)}$ in a quantum state \begin{eqnarray}\mathbf{x}^{(i)}\equiv[x_1^{(i)}, \cdots, x_p^{(i)}]^T \quad\xrightarrow{p=2^m} \quad |\psi_i\rangle\equiv \frac{1}{\left|\left| \mathbf{x}^{(i)}\right|\right|_2}\sum_{\tau\in \{0,1\}^m}x_{\tau}^{(i)}|\tau\rangle\,.\tag{1}\end{eqnarray} To encode the entire dataset, we entangle the data state $|\psi_i\rangle$ with an auxiliary state $|i\rangle$ and form a pure state \begin{equation}|\psi\rangle\equiv\sum_{i=...

L1 regression (Lasso)

Image
L1 regularization is to add a L1-norm term: \begin{equation}\hat{\boldsymbol{\beta}}^{\text{lasso}}=\text{argmin}_{\boldsymbol{\beta}}\left\{\sum_{i=1}^n\left(y^{(i)}-\beta_0-\sum_{j=1}^p \beta_jx^{(i)}_{j}\right)^2+\lambda\sum_{j=1}^p \left|\beta_j\right|\right\}\,.\end{equation} Note that there is no regularization for the interception term $\beta_0$ since $\beta_0$ is a global shift to all $y$. We can remove $\beta_0$ by making $\mathbf{y}$ and $\mathbf{X}$ centered.  With centered inputs , we have \begin{equation}\hat{\boldsymbol{\beta}}^{\text{lasso}}=\text{argmin}_{\boldsymbol{\beta}}\left\{(\mathbf{y}-\mathbf{X}\boldsymbol{\beta})^T(\mathbf{y}-\mathbf{X}\boldsymbol{\beta})+\lambda \left|\left|\boldsymbol{\beta}\right|\right|_1\right\}\,.\tag{1}\end{equation} Compared with ridge , lasso can be used for continuous subset (feature) selection besides regularization since it leads to sparse coefficients. The argument is from an equivalent optimization problem of (1): \begin{...

L2 regression (Ridge)

L2 regularization is to add a L2-norm term to the optimization problem $\min_{\mathbf{x}}f(\mathbf{x})$ as \begin{equation}\min_{\mathbf{x}}\,l_{\lambda}(\mathbf{x})\equiv f(\mathbf{x})+\frac{\lambda}{2} \left|\left| \mathbf{x}\right|\right|^2_2\,.\tag{1}\end{equation} Remarks: In deep learning, L2 regularization is also referred as weight decay . This is because in the vallina SGD \begin{eqnarray}\mathbf{x}^{(t+1)} = \mathbf{x}^{(t)}-\gamma\left[\nabla f(\mathbf{x}^{(t)})+\lambda\mathbf{x}^{(t)}\right]=(1-\gamma\, \lambda)\,\mathbf{x}^{(t)}-\gamma\,\nabla f(\mathbf{x}^{(t)})\,,\end{eqnarray} L_2 regularization is to decay the weight $\mathbf{x}^{(t)}$ by a factor $1-\gamma \lambda$. In more sophisticated optimizers beyond vanilla SGD, L2 regularization is different from weight decay: L2 regularization: replace each $\nabla f(\mathbf{x})$ in the optimizer by $\nabla f(\mathbf{x})+\lambda\mathbf{x}$; Weight decay: replace each $\nabla f(\mathbf{x})$ in the optimizer by $\nabla ...

R-squared

Image
Given $n$ samples $y^{(k)}$ and the corresponding fitting values $\hat{y}^{(k)}$, one can evaluate the goodness-of-fit by \begin{equation} R^2 = 1 - \frac{\sum_{k=1}^n \left(y^{(k)}-\hat{y}^{(k)}\right)^2}{\sum_{k=1}^n\left(y^{(k)}-\bar{y}\right)^2}\,,\tag{1}\end{equation} where $\bar{y}\equiv \frac{1}{n}\sum_{i=1}^n y^{(k)}$.  Let $Y=\hat{Y}+e$, we can write $R^2$ in short as \begin{equation}R^2=1-\frac{\mathbb{E}(e^2)}{\text{Var}(Y)}\,.\end{equation} In the following, we will discuss $R^2$ in the scope of linear regression. With interception term, as shown in this post , we have $\mathbb{E}(e)=0$ and $\text{Var}(Y)=\text{Var}(\hat{Y})+\text{Var}(e)$. As a result, \begin{equation}R^2=1-\frac{\text{Var}(e)}{\text{Var}(Y)}=\frac{\text{Var}(\hat{Y})}{\text{Var}(Y)}\geq 0\,.\tag{2}\end{equation} Without interception term, as shown in this  post , we can only have $\mathbb{E}(Y^2)=\mathbb{E}(\hat{Y}^2)+\mathbb{E}(e^2)$. As a result, \begin{equation} R^2=\frac{\mathbb{E}\left(...

Geometric interpretation of linear regression

Image
Hilbert space Given a linear space $V$ over the field $\mathbb{C}$, we define an inner product , i.e, a map $\langle \cdot, \cdot\rangle : \, V\times V \rightarrow \mathbb{C}$ that satisfies the following axioms for all vectors $\mathbf{x}, \mathbf{y}, \mathbf{z}\in V$ and all scalars $\alpha \in \mathbb{C}$: $\langle \mathbf{x}, \mathbf{y}\rangle =\langle \mathbf{y}, \mathbf{x}\rangle^*$ $\langle \mathbf{x}+\mathbf{y}, \mathbf{z}\rangle = \langle \mathbf{x}, \mathbf{z}\rangle + \langle \mathbf{y}, \mathbf{z}\rangle$ $\langle \alpha\,\mathbf{x}, \mathbf{y}\rangle=\alpha \langle \mathbf{x}, \mathbf{y}\rangle$ $\langle \mathbf{x}, \mathbf{x}\rangle \geq 0$. Equality holds iff $\mathbf{x}=\mathbf{0}\,.$ We can then define the norm $\left|\left|\mathbf{x}\right|\right|^2\equiv \langle \mathbf{x}, \mathbf{x}\rangle$. The norm satisfies the following two properties: Cauchy-Schwarz inequality: $\left|\langle \mathbf{x}, \mathbf{y}\rangle\right|^2\leq \left|\left|\mathbf{x}\right|\right|^2\,\...

Principal components analysis

PCA is a common technique to reduce data in high dimensions to a lower dimensional subspace, either for visualization or data compression.  Suppose we have a dataset consisting of $n$ samples and each sample is of $p$ features. We represent the dataset by a $n\times p$ matrix $\mathbf{X}$. We also assume that $\mathbf{X}$ is centered so that the sample covariance matrix $\boldsymbol{\Sigma}$ in the feature space is: \begin{eqnarray} \boldsymbol{\Sigma} = \frac{1}{n}\mathbf{X}^T\mathbf{X}\,.\tag{1}\end{eqnarray} PCA is to find a set of $k\leq p$ feature vectors $\mathbf{v}_1, \cdots, \mathbf{v}_k$ (called principal component directions) such that these new feature vectors have the largest $k$ sample variances $\mathbf{v}^T_1 \boldsymbol{\Sigma} \mathbf{v}_1,\cdots, \mathbf{v}^T_k \boldsymbol{\Sigma} \mathbf{v}_k$. That is, $\mathbf{v}_1, \cdots, \mathbf{v}_k$ are the eigenvectors of $\boldsymbol{\Sigma}$ corresponding to the $k$ largest eigenvalues $\lambda_1\geq \cdots \geq \la...

Generalized linear model

A generalized linear model (GLM) is said to have three components: A random component: $Y\sim \mathbb{P}(Y|\theta)$ in exponential family A linear predictor: $\eta=\mathbf{x}\cdot\boldsymbol{\beta}$ A link function $f$: $f(\mathbb{E}[Y|\theta])=\eta$ Remarks: Given $\mathbf{X}$, GLM predicts the conditional expectation of $Y$ instead of $Y$ itself. In GLM, the link function directly relates the conditional expectation of $Y$ to the linear predictor $\mathbf{x}\cdot\boldsymbol{\beta}$ as $f(\mathbb{E}[Y|\theta]) = \mathbf{x}\cdot\boldsymbol{\beta}$. In contrast, one may transform $Y$ and then do ordinary linear regression, which implies $\mathbb{E}[f(Y)|\theta] = \mathbf{x}\cdot\boldsymbol{\beta}$.  From $\mathbb{P}(Y|\theta)$, we can compute $\mathbb{E}[Y|\theta]=\mu(\theta)$. The linear predictor $\eta$ relates to $\theta$ by $f\left(\mu(\theta)\right)=\eta$. If we set $\eta=\theta$, ${\mu}^{-1}$ is a link function, called canonical link function. For example, the canonical l...

Exponential family

An exponential family is a large class of distributions with the form \begin{equation} \mathbb{P}\left(Y|\theta\right)= \frac{1}{Z(\theta)} g(Y)\,e^{\theta \,H(Y)}\,.\end{equation} Such form is the Boltzmann distribution in statistical physics, where $Y$ is a microstate of the ensemble; $H(Y)$ and $g(Y)$ are the energy and degeneracy; $\theta$ relates to the temperature; Finally, $Z(\theta)$ is the partition function : \begin{equation}Z(\theta)=\int g(y)\,e^{\theta \,H(y)}\,dy\,. \end{equation} As well known in statistical physics, from partition function we can compute \begin{equation}\mu(\theta) \equiv\mathbb{E}\left[H(Y) |\theta\right]=\frac{d}{d\theta} \log Z(\theta)\,,\tag{1}\end{equation}\begin{equation}\sigma^2(\theta) \equiv \text{Var}\left[H(Y)|\theta\right] = \frac{d^2}{d\theta^2} \log Z(\theta)\,. \end{equation}  Given observations $Y_1, \cdots, Y_n$, the log-likelihood of the exponential family is \begin{equation} l_n(\theta)\equiv \frac{1}{n}\log \sum_{i=1}^n\mathbb{P}...

Interview problem: linear regression (1)

Problem: A collection of 2D points $(x_i, y_i)$ are i.i.d. randomly sampled from a uniform distribution in a triangular region $0\leq X \leq 1, 0 \leq Y \leq 1, X+Y\geq 3/2$. If we fit a linear regression, say $y=kx+b$, on the sampled points, what are the expectation values of $k$ and $b$? Brute force solution: Linear regression is to solve the optimization problem $\min_{k, b} \mathbb{E}\left(Y-kX-b\right)^2$. As a result, we can derive $k=\frac{\text{Cov}(X, Y)}{Var(X)}$ and $b=\mathbb{E}(Y)-k\,\mathbb{E}(X)$. For the given distribution, we can compute $\mathbb{E}(X)=\mathbb{E}(Y)=5/6$, $\text{Cov}(X, Y)=-1/144$ and $\text{Var}(X)=1/72$. The final answer is $k=-1/2$ and $b=5/4$. Quick solution: As proved in chapter 2.4 of the ESL book , the solution to the optimization problem $\min_f \mathbb{E}(Y-f(X))^2$ is $f(X)=\mathbb{E}(Y|X)$. For the given distribution, $Y|X\sim \text{Uniform}[3/2-X, 1]$. As a result, $\mathbb{E}(Y|X)=\frac{1}{2}(3/2-X+1)=-\frac{1}{2}X + 5/4$, which is linear ...