More on Entropy

I reviewed some basics of thermodynamics as well as the Boltzmann entropy in this post. Here, I continue to review the Gibbs entropy, the information entropy as well as the cross-entropy that is widely used in machine learning. In the modern view, entropy is a measure of the ignorance of an observer on the system.

Boltzmann Entropy

Physics students start to learn the concept of entropy from the Carnot engine and Boltzmann entropy. The Boltzmann entropy applies to a single system with the fixed energy: \begin{equation}S=k_B\log \Omega\,,\tag{1}\end{equation}where $k_B$ is the Boltzmann constant and $\Omega$ is the number of microstates corresponding to a system's macrostate.

Boltzmann also derived another form of entropy that is very close to the modern form in mathematics. Consider a single system consisting of $N$ particles, we partition $N$ particles into $n$ groups. Let $N_i$ be the number of particles in the group $i$, then the number of microstates is \begin{equation}\Omega=\frac{N!}{N_1!N_2!\cdots N_n!}\,.\end{equation} Therefore, the Boltzmann entropy as defined in Eq. (1) is \begin{eqnarray}S&=&k_B\log \frac{N!}{N_1!N_2!\cdots N_n!} \\ &\approx& k_B N \log N-k_B\sum_{i=1}^n N_i \log N_i\\ &=&-k_B N \sum_{i=1}^n p_i \log p_i\,,\tag{2}\end{eqnarray} where $p_i\equiv{N_i}/{N}$ is the probability that a particle is assigned to the group $i$. In the second line of (2), we have used the Stirling's approximation $\log N!\approx N\log N$for large $N$.

Note that there is a prefactor of total number of particles $N$ in Eq. (2) since $p_i$ is interpreted as one particle's probability.

Gibbs Entropy

The Gibbs entropy generalizes the Boltzmann entropy to a system with a probability distribution over different energy levels instead of a single fixed energy.

Suppose the system can be of the energy $E_i$ with probability $P_i$ for $i=1,\cdots, n$. 

We consider $N$ identical copies of this system for a very large $N$ and let them in the thermal equilibrium with a giant heat reservoir. The total energy of the reservoir and $N$ copies of the system is fixed, and thus Boltzmann entropy (1) is applicable. The total number of microstates is $\Omega=\Omega_r\Omega_s$, where $\Omega_r$ is the number of microstates of the reservoir and $\Omega_s$ is the number of microstates of the $N$ copies of the system. 

Note that among $N$ copies of the system, $P_i*N$ copies of the system is of the energy $E_i$. As a result, \begin{equation} \Omega_s=\frac{N!}{(P_1N)!\,(P_2 N)!\,\cdots\,(P_n N)!}\,.\end{equation} Following the exactly same derivation of Eq. (2), we obtain the total entropy of the $N$ copies of the system \begin{equation}S_s=k_B\log\Omega_s =-k_B N \sum_{i=1}^n P_i \log P_i\,,\end{equation} and the average entropy of one copy of the system is \begin{equation}S =-k_B\sum_{i=1}^n P_i \log P_i\,.\tag{3}\end{equation} Remarks:

  • The mathematical forms of the Boltzmann entropy (2) and the Gibbs entropy (3) are almost the same. But the underlying physical meanings are different. In the jargon of modern statistical physics, the Boltzmann entropy (2) for the microcanonical ensemble while the Gibbs entropy (3) is for the canonical ensemble.
  • The Boltzmann entropy (1) is a special case of the Gibbs entropy (3) when $n=\Omega$ and $P_i = 1/\Omega$.

Information Entropy

In the information theory, the entropy is defined as the minimal average number of bits required convey a message during commutation.

For example, suppose that there are only two weather conditions with equal probability, a weather station only needs to convey 1 bit to broadcast which of the two conditions is happening. Similarly, if there are four weather conditions with equal probability, a weather station only needs to convey 2 bit to broadcast which condition is happening by encoding the four conditions as 00, 01, 10, 11, respectively.

A more interesting case is that the probabilities of the four weather conditions are not equal, but say 1/2, 1/4, 1/8, 1/8 instead. In this case, if we encode the four conditions as 0, 10, 110 and 111, then the average number of bits is \begin{equation} \frac{1}{2}\times 1 +  \frac{1}{4}\times 2 + \frac{1}{8}\times  3+  \frac{1}{8}\times  3= 1.75 < 2\,.\end{equation} Note that the fractional bit as above is meaningful: Given the expected number of bits is 1.75 in one day, the expected message length in bits across 100 days is of an integer 175.

In general, let $I(p)$ be the minimal number of bits required to encode a single event happening with probability $p$. To determine a explicit form of $I(p)$, we can consider an event consisting of two independent sub-event with probabilities $p_1$ and $p_2$. In this case, we require \begin{equation} I(p_1p_2) = I(p_1) + I(p_2)\,,\end{equation} which suggests that $I(p)$ should be in the form of \begin{equation}I(p)=-\log_2\,p\,.\end{equation} Therefore, the average number of bits to encode all events are \begin{equation} H(p)=-\sum_{i=1}^n p_i \log_2\,p_i\,.\tag{4}\end{equation}

Remarks:

  • Entropy in physics is a measure of disorder, the degree of randomness or uncertainty. But modern information theory extends the entropy as a measure of ignorance, the lack of information. In this view, the appearance of entropy in statistical physics is because we lack the information of the underlying microstates.
  • Conveying messages helps reduce ignorance. The larger ignorance, the longer message is required. This is why information theory directly defines the entropy as the average number of bits in the conveyed message.
  • The information entropy (4) differs from the Gibbs entropy (3) by an overall constant factor.

Cross Entropy and KL divergence

The view of entropy as the average number of bits makes it easy to understand the concept of cross entropy in machine learning.

Given a collection of events with true probability $p_1,\cdots, p_n$, theoretically we should encode them with the number of bits $-\log_2p_1,\cdots, -\log_2p_n$, respectively. But in practice, we usually don't know the true probabilities. Instead, we make estimations or predictions for the probabilities of these events as $q_1, \cdots, q_n$. Consequently, we encode the events with the respective number of the bits $-\log_2q_1,\cdots, -\log_2q_n$. As a result, the actual average number of bits becomes \begin{equation}H(p, q)=-\sum_{i=1}^n p_i \log_2\,q_i\,,\tag{5}\end{equation} which is the so-called cross entropy. 

Remarks:

  • Information entropy (4) is the average number of bits required in theory while the cross entropy (5) is the actual average number of bits used in practice.
  • When I first learned the concept of cross-entropy in machine learning, I felt unnatural that the cross entropy $H(p, q)$ is not symmetric on $p$ and $q$. The above view explains well why the cross entropy should be $H(p, q)$ rather than $H(q, p)$.
  • In the case of binary classification, the cross entropy (5) becomes \begin{equation}H(p, q)=-p\log q -(1-p)\log(1-q)\,.\end{equation}
From the above view, the cross entropy (5) should always larger than or equal to the information entropy (4) since information entropy is theoretically the lower bound. The so-called Kullback–Leibler divergence is defined by the margin between the cross entropy and information entropy: \begin{equation}D_{KL}(p|| q)\equiv H(p, q)-H(p)=-\sum_{i=1}^n p_i \log_2\frac{q_i}{p_i}\,.\tag{6}\end{equation} Indeed, $D_{KL}(p|| q)\geq 0$ because of the Jensen's inequality: \begin{equation}D_{KL}(p|| q)\geq -\log_2\left(\sum_{i=1}^n p_i  \frac{q_i}{p_i}\right) = -\log_2\left(\sum_{i=1}^n q_i\right)=-\log_21=0\,. \end{equation}

Comments

  1. 最后一部分最精彩,concretely, Jensen's inequality application.

    ReplyDelete

Post a Comment

Popular posts from this blog

529 Plan

How to offset W2 tax

Health Saving Account (HSA)