Why do Markov chains with larger spectral gaps mix faster?
Starting from the basics of Markov chains, this article will help you develop intuitions about why Markov chains with large spectral gaps mix faster.
A Markov chain is a stochastic model describing a sequence of state transitions in which the probability of each event only depends on the previous state of the event. Markov chains have been shown useful in a wide range of topics, such as biology (e.g., DNA evolution models), chemistry (e.g., chemical reaction networks), finance (e.g., stock market forecasting), speech recognition, or even music composition. In my research field, researchers use Markov state models to analyze molecular dynamics simulations to get insights into biomolecular systems. There are a number of classes of Markov chains or Markov models. In this article, we will restrict our discussion to finite-state Markov chains, which is more common than the infinite-state analog.
As exemplified in Figure 1, the state transitions in a Markov chain are frequently described by a transition/stochastic matrix, which encodes key characteristics of the Markov chain. For example, the spectral gap of a transition matrix is closely related to the mixing time of the corresponding Markov chain, which can be useful in one of the topics I’m interested in my Ph.D. research - assessing the sampling efficiencies of different algorithms in molecular dynamics. Roughly, the mixing time is the number of steps it takes for the deviation from the stationary distribution to drop by a factor of $e$. A shorter mixing time is generally reflected by a larger spectral gap. Below we define a spectral gap:
Notably, the definition above implies that for any transition matrix, the largest eigenvalue is always 1. While this statement along with the definition/implication of a spectral gap is almost common sense in relevant fields, introductory materials tend not to delve into the complicated theory behind the relation between the spectral gap and mixing time. Still, some mathematical details that are generally missing from introductory materials could be helpful for beginners to develop intuitions about why and how the spectral gap and mixing time are related. To fill in this gap, this article is aimed to prepare you with the sufficient mathematical background required to develop intuitions about the following statement:
Markov chains with larger spectral gaps mix faster.
In the following sections, I will first review some conventions and definitions of stochastic matrices and the eigenvalue equation. Then, I will prove that 1 is always the largest eigenvalue of a transition matrix and provide intuitions about why Markov chains with larger spectral gaps mix faster. Finally, I will conclude the article with a note about transition matrices in molecular dynamics. If you are already familiar with Markov chains and just curious about intuitions about the spectral gap and mixing time, feel free to jump to the last section of the article. For a deep dive into mixing times in Markov chains, I recommend the textbook written by Montenegro and Tetali or this handout by Dabbs.
Different definitions of stochastic/transition matrices
A stochastic matrix, or a transition matrix, is a square matrix frequently used to describe the transitions of a Markov chain. There are several definitions of a stochastic/transition matrix:
- A right/row stochastic matrix is a real square matrix with each row summing to 1.
- A left/column stochastic matrix is a real square matrix with each column summing to 1.
- A doubly stochastic matrix is a real square matrix with each row and column summing to 1. It is also a symmetric matrix.
Eigenvalues and eigenvectors
For an $n \times n$ matrix $A$, if a non-zero column vector ${\bf x}$ and a scalar ${\bf \lambda}$ satisfy $A{\bf x}=\lambda {\bf x}$, we call ${\bf x}$ and $\lambda$ the right eigenvector and right eigenvalue of the matrix $A$, respectively. On the other hand, a row vector ${\bf y}$ and a scalar $\lambda$ are respectively the left eigenvector and left eigenvalue of $A$ if ${\bf y}A=\lambda {\bf y}$. Geometrically, given the direct correspondence between $n \times n$ matrices and linear transformations in an $n$-dimensional vector space, the eigenvalue equation essentially implies the case where a linear transformation ($A$) scales a vector (${\bf x}$) with a scaling factor of $\lambda$ without rotating it, except that the direction of the vector is reversed when $\lambda<0$. Traditionally, the set of eigenvalues of a matrix is called the spectrum of the matrix, with the largest eigenvalue termed as the spectral radius.
For a square matrix $A$, we define a right eigenvalue $\lambda_r$ and the corresponding right eigenvector ${\bf x}_r$. This gives us $A{\bf x}_r=\lambda_r {\bf x}_r \Rightarrow (A-\lambda_r I){\bf x_r}={\bf 0} \Rightarrow \det (A-\lambda_r I)=0$ given that ${\bf x}_r$ must be a non-zero vector. Similarly, for a left eigenvalue $ {\bf x}_l$ and its corresponding eigenvector ${\bf x}_l$, we have ${\bf x}_lA=\lambda_l {\bf x}_l\Rightarrow ({\bf x}_l A)^T = A^T{\bf x}_l \lambda_l {\bf x}_l^T \Rightarrow (A^T{\bf x}_l^T-\lambda_l I){\bf x_l}T={\bf 0}$ i.e., $\det(A^T{\bf x}_l^T-\lambda_l I)=0$. Since the determinant of a matrix is equal to the determinant of its transpose, we have $\det (A^T{\bf x}_l^T-\lambda_l I) = \det(A-\lambda_l I)^T=\det (A-\lambda_l I)=0 $. That is, we have $\det (A-\lambda_r I)=\det (A-\lambda_l I)=0$ for any $A$, ${\bf x}_r$ and ${\bf x}_l$, which is only possible if $\lambda_r=\lambda_l$. Notably, while the left and right eigenvalues are equivalent, this is not necessarily true for the left and right eigenvectors.Supplementary note: The left and right eigenvalues of a matrix are equivalent.
The largest eigenvalue of a transition matrix is always 1.
To prove the statement in the section title, let’s first prove that any transition matrix has an eigenvalue of 1, then prove that 1 is always the largest eigenvalue of a transition matrix.
For a right stochastic matrix $P$, it is obvious that 1 is a right eigenvalue associated with a right eigenvector of ${\bf 1}$ given that $P{\bf 1}={\bf 1}$. We could end the proof here by proving that the left and right eigenvalues are equal. (See the Supplementary note above.) Alternatively, we could also consider the left eigenvalues by considering each component of ${\bf x}P=\lambda {\bf x}$, i.e., $\sum_i p_{ij}x_j=\lambda x_i$: $$\begin{align}x_1p_{11} + x_2p_{21} + &... + x_np_{n1}=\lambda x_1 \\ x_1p_{12} + x_2p_{22} + &... + x_np_{n2}=\lambda x_2 \\ &... \\ x_1p_{1n} + x_2p_{2n} + &... + x_np_{nn}=\lambda x_n \end{align}$$ Adding up all the equations above, we get $$x_1\sum_j p_{1j} + x_2 \sum_j p_{2j} + … + x_3 \sum_j p_{3j}=\lambda (x_1 + x_2 + … + x_n)$$ For any component $i$ in a right stochast matrix, $\sum_j p_{ij}=1$. Therefore, we have $x_1 + x_2 + … + x_n = \lambda(x_1 + x_2 + … + x_n)$, which implies that $\lambda=1$.
Proving that 1 is always an eigenvalue of a transition matrix is not sufficient, as we still need to prove that it is the largest eigenvalue, where I found the Gershgorin circle theorem quite handy:
Gershgorin circle theorem: Every eigenvalue of a square matrix lies within at least one of the Gershgorin discs.
Gershgorin theorem is a useful theorem to bound the spectrum of a square matrix. To understand the definition of a Gershgorin disc, let’s consider a complex $n \times n$ matrix $A$ with entries $a_{ij}$. For $i \in {1,…, n}$, let $R_i$ be the sum of the moduli of the off-diagonal entries in the $i$-th row, i.e., $R_i=\sum_{j\neq i} |A_{ij}|$, then a Gershgorin disc $D(a_{ii}, R_i)$ is a closed disc centered at $a_{ii}$ with radius $R_i$ on the complex plane. As an example, Figure 2 shows five Gershgorin discs for a $5 \times 5$ complex matrix, whose eigenvalues are all in at least one of the discs.
To prove the theorem, we let $\lambda$ be an eigenvalue of $A$ and $x_i$ be the largest absolute value of the elements in the corresponding eigenvector ${\bf x}=(x_j)$. Then, the $i$-th component of the eigenvalue equation $A{\bf x}=\lambda{\bf x}$ satisfies: $$\sum_i a_{ij}x_j = \lambda x_i \Rightarrow \sum_{j\neq i} a_{ij}x_j=(\lambda - a_{ii})x_i$$
By applying the triangle inequality and recalling that $|x_j|\leq|x_i|$, we can prove the theorem: $$|\lambda - a_{ii}| = |\sum_{j\neq i} \frac{a_{ij}{x_j}}{x_i}|\leq\sum_{j\neq i} |a_{ij}|=R_i$$ For a stochastic matrix, in which case $a_{ii}$ and $R_i$ are nonnegative real numbers, we have $$|\lambda - a_{ii}| \leq R_i \Rightarrow -R_i \leq \lambda - a_{ii} \leq R_i \Rightarrow -R_i + a_{ii} \leq \lambda \leq a_{ii} + R_i = 1$$ This means that the upper bound of the spectrum of any transition matrix should be 1. As we have proven that an eigenvalue of 1 always exists, this altogether proves that 1 is always the largest eigenvalue of a transition matrix. In probability theory, this is often expressed as $\pi P=\pi$ and called the balance equation. Notably, the eigenvector $\pi=(\pi_i)_{i\in \mathcal{S}}$ is called the Perron-Frobenius vector, which is associated with the largest eigenvalue of a matrix $\lambda_{\text{pf}}$ (not limited to a stochastic matrix) called the Perron-Frobenius eigenvalue.
Interestingly, the expression $\pi P = \pi$ implies that there exists a vector whose components do not change upon the application of the transition matrix $P$ but remain stationary. This eigenvector $\pi$ represents the stationary distribution of the Markov chain if $\pi$ also satisfies the following two conditions: (1) $\pi_i \geq 0$ for all $i\in \mathcal{S}$ and (2) $\sum_{i \in \mathcal{S}}\pi_i=1$. For a transition matrix, the first condition is guaranteed by the famous Perron-Frobenius theorem:
Perron-Frobenius theorem: A positive square matrix has a unique maximal eigenvalue, which corresponds to a positive eigenvector.
In a Markov chain, where the transition matrix is frequently not positive but only nonnegative, a variation of the Perron-Frobenius theorem is more relevant: For a nonnegative square matrix, the Perron-Frobenius eigenvalue $\lambda_{\text{pf}}$ is associated with nonnegative left and right eigenvectors. Finally, to satisfy the second condition, we can always scale the Perron-Frobenius vector such that the sum of its components is 1, as one can easily prove that a scaled eigenvector is still an eigenvector.
Supplementary note: The existence and uniqueness of stationary distributions
A Markov chain could have (1) No stationary distribution: A classic example is an infinite-state Markov chain with state space $\mathcal{S}={0, 1, 2, … }$ and $P(n+1|n)=1$. (2) One unique stationary distribution: This is true for any finite-state, irreducible Markov chain, where all states are positive recurrent. (A Markov chain is irreducible if every state can be reached from every other state within a finite number of steps.) (3) Multiple stationary distribution: This is true for any finite-state, reducible Markov chain, where there is at least one positive recurrent state. For example, if in Figure 1, $p_{11}=1$, $p_{21}=p_{23}=0.5$ and $p_{33}=1$, then there are infinitely many stationary distributions $\pi=[p, 0, 1-p]$In fact, the principle for distinguishing these three cases is that a Markov chain has at least one stationary distribution if and only if at least one state is positive recurrent. Also, we can say that any finite-state Markov chain has at least one stationary distribution. (Check the next Supplementary note for the definition of recurrence.)
Supplementary note: The transience and recurrence of Markov chains
A state in a Markov chain is either trasient or recurrent. Specifically, a Markov chain starting from a transient state will never return to that state after a finite number of steps. On the other hand, there is a guarantee that a Markov chain starting from a recurrent state will return to that state. Recurrent states can be further divided into positive recurrent or null recurrent states. To understand this, let’s take a look at the example in Figure S1.
In Figure S1, if $p<0.5$, state 0 will be transient because the chain tends to drift towards states with larger values of $i$. On the other hand, state 0 will be recurrent if $p>0.5$, because the chain tends to drift back toward state 0. Interestingly, the larger the value of $p$, the faster we expect a return to state 0. When $p=0.5$, we expect a return to state 0 to occur (the probability of recurrence is 1), but the mean recurrence time is infinite. In this case, state 0 is null recurrent.
Additionally, here are some facts about the transience and recurrence of Markov chains:
- Only infinite Markov chains can have null recurrent states.
- The states in an irreducible Markov chain are either all transient (in an infinite chain) or all positive recurrent (in a finite chain). That is, an irreducible Markov chain can at most have one stationary distribution.
- Any finite Markov chain must have at least one (positive) recurrent state. (If all states are transient, then each of them is either not visited at all or visited only finitely many times, which is not possible.)
For any reversible Markov chain, all eigenvalues of its transition matrix $P$ are real and $P$ is diagonalizable.
Before we explore the relation between the spectral gap and mixing time, I want to insert this supplementary section about reversible Markov chains, which is the type of Markov chain that most relevant discussions revolve around.
Before we prove the statement in the section title, we need to first understand the reversibility of a Markov chain: A Markov chain is reversible if and only if for all $i$ and $j$, it satisfies the detailed balance equation $\pi_{i}p_{ij}=\pi_{j}p_{ji}$. Physically, the detailed balance condition implies that the probability flux in and out of a state should be equal, which is a stricter criterion than the balance condition.
To prove that all eigenvalues of a reversible transition matrix $P$ are real, we consider a similar matrix of $P$, namely, $Q=DPD^{-1}$, where $D=\text{diag}(\sqrt{\pi_i}, …, \sqrt{\pi_n})$. As such, $Q_{ij}=(DPD^{-1})_{ij}=\sqrt{\pi_{i}}p_{ij}\frac{1}{\sqrt{\pi_j}}$. (Note that the inverse of a diagonal matrix is obtained by replacing the main diagonal elements of the matrix with their reciprocals.) Now, given $\pi_{i}p_{ij}=\pi_{j}p_{ji}$, we have $$Q_{ji}=\sqrt{\pi_{j}}p_{ji}\frac{1}{\sqrt{\pi_i}}=\sqrt{\pi_{j}}p_{ji}\frac{1}{\sqrt{pi_i}}=\frac{1}{\sqrt{\pi_j}}\pi_{j}p_{ji}\frac{1}{\sqrt{\pi_i}}=\frac{1}{\sqrt{\pi_j}}\pi_ip_{ij}\frac{1}{\sqrt{\pi_i}}=Q_{ij}$$ That is, $Q$ is symmetric and its eigenvalues are hence, real. (See Theorem 1 here.) Given that $P$ is similar to $Q$, all eigenvalues of $P$ are also real. Additionally, since $Q$ is a symmetric matrix, it is always diagonalizable, which implies that $P$, a reversible transition, should be also diagonalizable.
So, why is a large spectral gap indicative of faster mixing?
Finally, we are approaching the core question of this article: Why do Markov chains with larger spectral gaps mix faster?
Typically, discussions for answering this question are restricted to ergodic (meaning irreducible and aperiodic) and reversible Markov chains, which are not uncommon in the context of molecular dynamics. (See the Supplementary note below.) As a reminder, such Markov chains have the following properties:
- An irreducible finite-state Markov chain always has one unique distribution $\pi$. (Note that the mixing time is only meaningful when a stationary distribution exists.)
- If a Markov chain is irreducible and aperiodic, the eigenvalue of 1 is unique (i.e., has a multiplicity of 1) and the moduli of all other eigenvalues are strictly less than 1. (This is not elaborated in this article due to limited space.)
- For any reversible Markov chain, all eigenvalues of its transition matrix $P$ are real and $P$ is diagonalizable.
As $t$ increases, each term in the summation term decays exponentially. (Also note that $\pi=\lim_{t \rightarrow \infty} {\bf p_0}P^t$.) Since $|\lambda_2|\geq|\lambda_3|\geq…\geq|\lambda_n|$, the difference between ${\bf p}_0P$ and the stationary distribution $\pi$ is basically dominated by $\lambda_2$. That is, given a large spectral gap $1-\lambda_2$ (small $\lambda_2$), the deviation decays faster to 0, leading to a shorter mixing time!
Notably, a reversible, irreducible, aperiodic, Markov chain is not uncommon in molecular dynamics given the following:Supplementary note: Transition matrics in molecular dynamics
This is the end of the article! 🎉🎉 If you enjoyed this article, you are welcome to share it or leave a comment below, so I will be more motivated to write more! Thank you for reading this far! 😃