Why do Markov chains with larger spectral gaps mix faster?

Starting from the basics of Markov chains, this article will help you develop intuitions about why Markov chains with large spectral gaps mix faster.

Wei-Tse Hsu

Nov 7, 2022 15 min read Mathematics

A Markov chain is a stochastic model describing a sequence of state transitions in which the probability of each event only depends on the previous state of the event. Markov chains have been shown useful in a wide range of topics, such as biology (e.g., DNA evolution models), chemistry (e.g., chemical reaction networks), finance (e.g., stock market forecasting), speech recognition, or even music composition. In my research field, researchers use Markov state models to analyze molecular dynamics simulations to get insights into biomolecular systems. There are a number of classes of Markov chains or Markov models. In this article, we will restrict our discussion to finite-state Markov chains, which is more common than the infinite-state analog.

Figure 1. A 3-state Markov chain and its transition matrix. $p_{i j}$ denotes the transition probability from states $i$ to $j$ .

As exemplified in Figure 1, the state transitions in a Markov chain are frequently described by a transition/stochastic matrix, which encodes key characteristics of the Markov chain. For example, the spectral gap of a transition matrix is closely related to the mixing time of the corresponding Markov chain, which can be useful in one of the topics I’m interested in my Ph.D. research - assessing the sampling efficiencies of different algorithms in molecular dynamics. Roughly, the mixing time is the number of steps it takes for the deviation from the stationary distribution to drop by a factor of $e$ . A shorter mixing time is generally reflected by a larger spectral gap. Below we define a spectral gap:

Definition: A spectral gap is the difference between the moduli of the two largest eigenvalues of a matrix, i.e.,

| λ_{1} | - | λ_{2} |

given eigenvalues of

λ_{i}

of an

n \times n

matrix with

| λ_{1} | \geq | λ_{2} | \geq . . . \geq | λ_{n} |

. In particular, for a transition matrix, the spectral gap is

1 - | λ_{2} |

Notably, the definition above implies that for any transition matrix, the largest eigenvalue is always 1. While this statement along with the definition/implication of a spectral gap is almost common sense in relevant fields, introductory materials tend not to delve into the complicated theory behind the relation between the spectral gap and mixing time. Still, some mathematical details that are generally missing from introductory materials could be helpful for beginners to develop intuitions about why and how the spectral gap and mixing time are related. To fill in this gap, this article is aimed to prepare you with the sufficient mathematical background required to develop intuitions about the following statement:

Markov chains with larger spectral gaps mix faster.

In the following sections, I will first review some conventions and definitions of stochastic matrices and the eigenvalue equation. Then, I will prove that 1 is always the largest eigenvalue of a transition matrix and provide intuitions about why Markov chains with larger spectral gaps mix faster. Finally, I will conclude the article with a note about transition matrices in molecular dynamics. If you are already familiar with Markov chains and just curious about intuitions about the spectral gap and mixing time, feel free to jump to the last section of the article. For a deep dive into mixing times in Markov chains, I recommend the textbook written by Montenegro and Tetali or this handout by Dabbs.

Different definitions of stochastic/transition matrices

A stochastic matrix, or a transition matrix, is a square matrix frequently used to describe the transitions of a Markov chain. There are several definitions of a stochastic/transition matrix:

A right/row stochastic matrix is a real square matrix with each row summing to 1.
A left/column stochastic matrix is a real square matrix with each column summing to 1.
A doubly stochastic matrix is a real square matrix with each row and column summing to 1. It is also a symmetric matrix.

In this article, I will adopt the row-stochastic convention with

P_{i j}

denoting the probability of a state transition from states

i

j

. Since the total of transition probabilities from state

i

to all states (including itself) must be 1, we have

\sum_{j}^{n} p_{i j} = 1

, or

P 1 = 1

, where

1

is an

n

-dimensional column vector of all ones.

Eigenvalues and eigenvectors

For an $n \times n$ matrix $A$ , if a non-zero column vector $x$ and a scalar $λ$ satisfy $A x = λ x$ , we call $x$ and $λ$ the right eigenvector and right eigenvalue of the matrix $A$ , respectively. On the other hand, a row vector $y$ and a scalar $λ$ are respectively the left eigenvector and left eigenvalue of $A$ if $y A = λ y$ . Geometrically, given the direct correspondence between $n \times n$ matrices and linear transformations in an $n$ -dimensional vector space, the eigenvalue equation essentially implies the case where a linear transformation ( $A$ ) scales a vector ( $x$ ) with a scaling factor of $λ$ without rotating it, except that the direction of the vector is reversed when $λ < 0$ . Traditionally, the set of eigenvalues of a matrix is called the spectrum of the matrix, with the largest eigenvalue termed as the spectral radius.

Supplementary note: The left and right eigenvalues of a matrix are equivalent.

For a square matrix $A$ , we define a right eigenvalue $λ_{r}$ and the corresponding right eigenvector $x_{r}$ . This gives us $A x_{r} = λ_{r} x_{r} \Rightarrow (A - λ_{r} I) x_{r} = 0 \Rightarrow det (A - λ_{r} I) = 0$ given that $x_{r}$ must be a non-zero vector. Similarly, for a left eigenvalue $x_{l}$ and its corresponding eigenvector $x_{l}$ , we have $x_{l} A = λ_{l} x_{l} \Rightarrow (x_{l} A)^{T} = A^{T} x_{l} λ_{l} x_{l}^{T} \Rightarrow (A^{T} x_{l}^{T} - λ_{l} I) x_{l} T = 0$ i.e., $det (A^{T} x_{l}^{T} - λ_{l} I) = 0$ . Since the determinant of a matrix is equal to the determinant of its transpose, we have $det (A^{T} x_{l}^{T} - λ_{l} I) = det (A - λ_{l} I)^{T} = det (A - λ_{l} I) = 0$ . That is, we have $det (A - λ_{r} I) = det (A - λ_{l} I) = 0$ for any $A$ , $x_{r}$ and $x_{l}$ , which is only possible if $λ_{r} = λ_{l}$ . Notably, while the left and right eigenvalues are equivalent, this is not necessarily true for the left and right eigenvectors.

The largest eigenvalue of a transition matrix is always 1.

To prove the statement in the section title, let’s first prove that any transition matrix has an eigenvalue of 1, then prove that 1 is always the largest eigenvalue of a transition matrix.

For a right stochastic matrix $P$ , it is obvious that 1 is a right eigenvalue associated with a right eigenvector of $1$ given that $P 1 = 1$ . We could end the proof here by proving that the left and right eigenvalues are equal. (See the Supplementary note above.) Alternatively, we could also consider the left eigenvalues by considering each component of $x P = λ x$ , i.e., $\sum_{i} p_{i j} x_{j} = λ x_{i}$ : $\begin{aligned} x_{1} p_{11} + x_{2} p_{21} + & . . . + x_{n} p_{n 1} = λ x_{1} \\ x_{1} p_{12} + x_{2} p_{22} + & . . . + x_{n} p_{n 2} = λ x_{2} \\ . . . \\ x_{1} p_{1 n} + x_{2} p_{2 n} + & . . . + x_{n} p_{n n} = λ x_{n} \end{aligned}$ Adding up all the equations above, we get $x_{1} \sum_{j} p_{1 j} + x_{2} \sum_{j} p_{2 j} + \dots + x_{3} \sum_{j} p_{3 j} = λ (x_{1} + x_{2} + \dots + x_{n})$ For any component $i$ in a right stochast matrix, $\sum_{j} p_{i j} = 1$ . Therefore, we have $x_{1} + x_{2} + \dots + x_{n} = λ (x_{1} + x_{2} + \dots + x_{n})$ , which implies that $λ = 1$ .

Proving that 1 is always an eigenvalue of a transition matrix is not sufficient, as we still need to prove that it is the largest eigenvalue, where I found the Gershgorin circle theorem quite handy:

Gershgorin circle theorem: Every eigenvalue of a square matrix lies within at least one of the Gershgorin discs.

Gershgorin theorem is a useful theorem to bound the spectrum of a square matrix. To understand the definition of a Gershgorin disc, let’s consider a complex $n \times n$ matrix $A$ with entries $a_{i j}$ . For $i \in 1, \dots, n$ , let $R_{i}$ be the sum of the moduli of the off-diagonal entries in the $i$ -th row, i.e., $R_{i} = \sum_{j \neq i} | A_{i j} |$ , then a Gershgorin disc $D (a_{i i}, R_{i})$ is a closed disc centered at $a_{i i}$ with radius $R_{i}$ on the complex plane. As an example, Figure 2 shows five Gershgorin discs for a $5 \times 5$ complex matrix, whose eigenvalues are all in at least one of the discs.

Figure 2. The 5 Gershgorin discs of a $5 \times 5$ matrix and its eigenvalues.

To prove the theorem, we let $λ$ be an eigenvalue of $A$ and $x_{i}$ be the largest absolute value of the elements in the corresponding eigenvector $x = (x_{j})$ . Then, the $i$ -th component of the eigenvalue equation $A x = λ x$ satisfies: $\sum_{i} a_{i j} x_{j} = λ x_{i} \Rightarrow \sum_{j \neq i} a_{i j} x_{j} = (λ - a_{i i}) x_{i}$

By applying the triangle inequality and recalling that $| x_{j} | \leq | x_{i} |$ , we can prove the theorem: $| λ - a_{i i} | = | \sum_{j \neq i} \frac{a_{i j} x_{j}}{x_{i}} | \leq \sum_{j \neq i} | a_{i j} | = R_{i}$ For a stochastic matrix, in which case $a_{i i}$ and $R_{i}$ are nonnegative real numbers, we have $| λ - a_{i i} | \leq R_{i} \Rightarrow - R_{i} \leq λ - a_{i i} \leq R_{i} \Rightarrow - R_{i} + a_{i i} \leq λ \leq a_{i i} + R_{i} = 1$ This means that the upper bound of the spectrum of any transition matrix should be 1. As we have proven that an eigenvalue of 1 always exists, this altogether proves that 1 is always the largest eigenvalue of a transition matrix. In probability theory, this is often expressed as $π P = π$ and called the balance equation. Notably, the eigenvector $π = (π_{i})_{i \in S}$ is called the Perron-Frobenius vector, which is associated with the largest eigenvalue of a matrix $λ_{pf}$ (not limited to a stochastic matrix) called the Perron-Frobenius eigenvalue.

Interestingly, the expression $π P = π$ implies that there exists a vector whose components do not change upon the application of the transition matrix $P$ but remain stationary. This eigenvector $π$ represents the stationary distribution of the Markov chain if $π$ also satisfies the following two conditions: (1) $π_{i} \geq 0$ for all $i \in S$ and (2) $\sum_{i \in S} π_{i} = 1$ . For a transition matrix, the first condition is guaranteed by the famous Perron-Frobenius theorem:

Perron-Frobenius theorem: A positive square matrix has a unique maximal eigenvalue, which corresponds to a positive eigenvector.

In a Markov chain, where the transition matrix is frequently not positive but only nonnegative, a variation of the Perron-Frobenius theorem is more relevant: For a nonnegative square matrix, the Perron-Frobenius eigenvalue $λ_{pf}$ is associated with nonnegative left and right eigenvectors. Finally, to satisfy the second condition, we can always scale the Perron-Frobenius vector such that the sum of its components is 1, as one can easily prove that a scaled eigenvector is still an eigenvector.

Supplementary note: The existence and uniqueness of stationary distributions

A Markov chain could have

‎ (1) No stationary distribution: A classic example is an infinite-state Markov chain with state space $S = 0, 1, 2, \dots$ and $P (n + 1 | n) = 1$ .
‎ (2) One unique stationary distribution: This is true for any finite-state, irreducible Markov chain, where all states are positive recurrent. (A Markov chain is irreducible if every state can be reached from every other state within a finite number of steps.)
‎ (3) Multiple stationary distribution: This is true for any finite-state, reducible Markov chain, where there is at least one positive recurrent state. For example, if in Figure 1, $p_{11} = 1$ , $p_{21} = p_{23} = 0.5$ and $p_{33} = 1$ , then there are infinitely many stationary distributions $π = [p, 0, 1 - p]$

In fact, the principle for distinguishing these three cases is that a Markov chain has at least one stationary distribution if and only if at least one state is positive recurrent. Also, we can say that any finite-state Markov chain has at least one stationary distribution. (Check the next Supplementary note for the definition of recurrence.)

Supplementary note: The transience and recurrence of Markov chains

A state in a Markov chain is either trasient or recurrent. Specifically, a Markov chain starting from a transient state will never return to that state after a finite number of steps. On the other hand, there is a guarantee that a Markov chain starting from a recurrent state will return to that state. Recurrent states can be further divided into positive recurrent or null recurrent states. To understand this, let’s take a look at the example in Figure S1.

Figure S1. An infinite-state Markov chain.

In Figure S1, if $p < 0.5$ , state 0 will be transient because the chain tends to drift towards states with larger values of $i$ . On the other hand, state 0 will be recurrent if $p > 0.5$ , because the chain tends to drift back toward state 0. Interestingly, the larger the value of $p$ , the faster we expect a return to state 0. When $p = 0.5$ , we expect a return to state 0 to occur (the probability of recurrence is 1), but the mean recurrence time is infinite. In this case, state 0 is null recurrent.

Additionally, here are some facts about the transience and recurrence of Markov chains:

Only infinite Markov chains can have null recurrent states.
The states in an irreducible Markov chain are either all transient (in an infinite chain) or all positive recurrent (in a finite chain). That is, an irreducible Markov chain can at most have one stationary distribution.
Any finite Markov chain must have at least one (positive) recurrent state. (If all states are transient, then each of them is either not visited at all or visited only finitely many times, which is not possible.)

For any reversible Markov chain, all eigenvalues of its transition matrix $P$ are real and $P$ is diagonalizable.

Before we explore the relation between the spectral gap and mixing time, I want to insert this supplementary section about reversible Markov chains, which is the type of Markov chain that most relevant discussions revolve around.

Before we prove the statement in the section title, we need to first understand the reversibility of a Markov chain: A Markov chain is reversible if and only if for all $i$ and $j$ , it satisfies the detailed balance equation $π_{i} p_{i j} = π_{j} p_{j i}$ . Physically, the detailed balance condition implies that the probability flux in and out of a state should be equal, which is a stricter criterion than the balance condition.

To prove that all eigenvalues of a reversible transition matrix $P$ are real, we consider a similar matrix of $P$ , namely, $Q = D P D^{- 1}$ , where $D = diag (\sqrt{π_{i}}, \dots, \sqrt{π_{n}})$ . As such, $Q_{i j} = (D P D^{- 1})_{i j} = \sqrt{π_{i}} p_{i j} \frac{1}{\sqrt{π_{j}}}$ . (Note that the inverse of a diagonal matrix is obtained by replacing the main diagonal elements of the matrix with their reciprocals.) Now, given $π_{i} p_{i j} = π_{j} p_{j i}$ , we have $Q_{j i} = \sqrt{π_{j}} p_{j i} \frac{1}{\sqrt{π_{i}}} = \sqrt{π_{j}} p_{j i} \frac{1}{\sqrt{p i_{i}}} = \frac{1}{\sqrt{π_{j}}} π_{j} p_{j i} \frac{1}{\sqrt{π_{i}}} = \frac{1}{\sqrt{π_{j}}} π_{i} p_{i j} \frac{1}{\sqrt{π_{i}}} = Q_{i j}$ That is, $Q$ is symmetric and its eigenvalues are hence, real. (See Theorem 1 here.) Given that $P$ is similar to $Q$ , all eigenvalues of $P$ are also real. Additionally, since $Q$ is a symmetric matrix, it is always diagonalizable, which implies that $P$ , a reversible transition, should be also diagonalizable.

So, why is a large spectral gap indicative of faster mixing?

Finally, we are approaching the core question of this article: Why do Markov chains with larger spectral gaps mix faster?

Typically, discussions for answering this question are restricted to ergodic (meaning irreducible and aperiodic) and reversible Markov chains, which are not uncommon in the context of molecular dynamics. (See the Supplementary note below.) As a reminder, such Markov chains have the following properties:

An irreducible finite-state Markov chain always has one unique distribution $π$ . (Note that the mixing time is only meaningful when a stationary distribution exists.)
If a Markov chain is irreducible and aperiodic, the eigenvalue of 1 is unique (i.e., has a multiplicity of 1) and the moduli of all other eigenvalues are strictly less than 1. (This is not elaborated in this article due to limited space.)
For any reversible Markov chain, all eigenvalues of its transition matrix $P$ are real and $P$ is diagonalizable.

Now, let's consider the distribution of an ergodic, reversible Markov chain after

t

timesteps:

p_{0} P^{t}

, where

p_{0}

is the starting distribution. We diagonalize

P

P = A D A^{- 1}

, where

D

is a diagonal matrix whose main diagonal is composed of

λ_{i}

, the eigenvalues of

P

. Notably, the column vectors of

A = [a_{i j}]_{n \times n}

are the right eigenvectors of

P

, while the row vectors of

A^{- 1} = [a_{i j}^{'}]_{n \times n}

are the left eigenvectors of

P

. Accordingly, we have

p_{0} P^{t} = p_{0} (A D A^{- 1}) (A D A^{- 1}) . . . (A D A^{- 1}) = p_{0} A D^{t} A^{- 1}

. In matrix form, this can be expressed as follows

\begin{aligned} p_{0} P^{t} = [\begin{array}{c} p_{1} & p_{2} & . . . & p_{n} \end{array}] & [\begin{array}{c} a_{11} & a_{12} & \dots & a_{1 n} \\ a_{21} & a_{22} & \dots & a_{2 n} \\ ⋮ & ⋮ & ⋱ & ⋮ \\ a_{n 1} & a_{n 2} & \dots & a_{n n} \end{array}] [\begin{array}{c} λ_{1}^{t} & 0 & \dots & 0 \\ 0 & λ_{2}^{t} & \dots & 0 \\ ⋮ & ⋮ & ⋱ & ⋮ \\ 0 & 0 & \dots & λ_{n}^{t} \end{array}] [\begin{array}{c} a_{11}^{'} & a_{12}^{'} & \dots & a_{1 n}^{'} \\ a_{21}^{'} & a_{22}^{'} & \dots & a_{2 n}^{'} \\ ⋮ & ⋮ & ⋱ & ⋮ \\ a_{n 1}^{'} & a_{n 2}^{'} & \dots & a_{n n}^{'} \end{array}] \\ = [\begin{array}{c} v_{1} & v_{2} & . . . & v_{n} \end{array}] [\begin{array}{c} λ_{1}^{t} & 0 & \dots & 0 \\ 0 & λ_{2}^{t} & \dots & 0 \\ ⋮ & ⋮ & ⋱ & ⋮ \\ 0 & 0 & \dots & λ_{n}^{t} \end{array}] [\begin{array}{c} a_{11}^{'} & a_{12}^{'} & \dots & a_{1 n}^{'} \\ a_{21}^{'} & a_{22}^{'} & \dots & a_{2 n}^{'} \\ ⋮ & ⋮ & ⋱ & ⋮ \\ a_{n 1}^{'} & a_{n 2}^{'} & \dots & a_{n n}^{'} \end{array}] \end{aligned}

where

v_{i} = p_{1} a_{1 i} + p_{2} a_{2 i} + . . . + p_{n} a_{n i} = \sum_{j = 1}^{n} p_{j} a_{j i}

. Multiplying the row vector

(v_{i})

with the diagonal matrix

diag (λ_{1}^{t}, λ_{2}^{t}, . . ., λ_{n}^{t})

, we get a row vector

(c_{i})

with

c_{i} = λ_{i}^{t} \sum_{j = 1}^{n} p_{j} a_{j i}

. Again, multiplying

(c_{i})

with the matrix

[a_{i j}^{'}]

, we get

p_{0} P^{t} = (w_{i})

, with

w_{i} = \sum_{j = 1}^{n} c_{j} λ_{j}^{t} a_{j i}^{'}

. Finally, denoting the

i

-th left eigenvector of

P

q_{j} = [\begin{matrix} a_{j 1}^{'} & a_{j 2}^{'} & . . . & a_{j n}^{'} \end{matrix}] = (a_{j i}^{'})

, we have

p_{0} P^{t} = \sum_{i = 1}^{n} c_{i} λ_{i}^{t} q_{i}

. Notably, the term corresponding to

i = 1

c_{1} λ_{1}^{t} q_{1} = c_{1} π

, given that

λ_{1}

is 1 and its corresponding eigenvector

q_{1}

is the stationary distribution

π

. Additionally,

c_{1} = \sum_{j = 1}^{n} p_{j} a_{j 1}

, where

a_{j 1}

are the entries of the right eigenvector of

P

corresponding to the eigenvalue 1, i.e.,

1

. That is,

a_{11} = a_{21} = . . . = a_{n 1} = 1

and

c_{1} = \sum_{j = 1}^{n} = 1

, which gives

p_{0} P = π + \sum_{i = 2}^{n} c_{i} λ_{i}^{t} q_{i}

As $t$ increases, each term in the summation term decays exponentially. (Also note that $π = lim_{t \to \infty} p_{0} P^{t}$ .) Since $| λ_{2} | \geq | λ_{3} | \geq \dots \geq | λ_{n} |$ , the difference between $p_{0} P$ and the stationary distribution $π$ is basically dominated by $λ_{2}$ . That is, given a large spectral gap $1 - λ_{2}$ (small $λ_{2}$ ), the deviation decays faster to 0, leading to a shorter mixing time!

Supplementary note: Transition matrics in molecular dynamics

Notably, a reversible, irreducible, aperiodic, Markov chain is not uncommon in molecular dynamics given the following:

Reversibility: Many advanced sampling methods (e.g. replica exchange molecular dynamics) strictly enforce detailed balance by governing state transitions in the space of interest using the Metropolis-Hastings algorithm or other similar methods, in which case the transition matrix of the state space does satisfy $π_{i j} P_{i} = π_{j i} P_{j}$ . Additionally, reversibility is also guaranteed in irreducible Markov chains with $π_{i} > 0$ for all $i$ , which is also not uncommon in molecular dynamics. (See Theorem 5.3.2 in this post by LibreTexts for the proof of this.)
Irreducibility: Straightforwardly, a tridiagonal matrix $[a_{i j}]$ with $a_{i j} a_{j i} \neq 0$ for all $j = i + 1$ is irreducible. This condition is generally satisfied unless the sampling in the state space is extremely slow.
Aperiodicity: A transition matrix must be aperiodic if it is irreducible and contains at least one self-loop (i.e., $P_{i i} > 0$ ). Intuitively, this is almost always true in molecular dynamics.

This is the end of the article! 🎉🎉 If you enjoyed this article, you are welcome to share it or leave a comment below, so I will be more motivated to write more! Thank you for reading this far! 😃

Markov chains Molecular simulations

Why do Markov chains with larger spectral gaps mix faster?

Markov chains with larger spectral gaps mix faster.

Different definitions of stochastic/transition matrices

Eigenvalues and eigenvectors

The largest eigenvalue of a transition matrix is always 1.

Gershgorin circle theorem: Every eigenvalue of a square matrix lies within at least one of the Gershgorin discs.

Perron-Frobenius theorem: A positive square matrix has a unique maximal eigenvalue, which corresponds to a positive eigenvector.

For any reversible Markov chain, all eigenvalues of its transition matrix P are real and P is diagonalizable.

So, why is a large spectral gap indicative of faster mixing?

Wei-Tse Hsu

Postdoctoral Research Associate in Drug Design

For any reversible Markov chain, all eigenvalues of its transition matrix $P$ are real and $P$ is diagonalizable.