A simple Sequential Monte Carlo algorithm for posterior approximation

August 6, 2021

Introduction #

Sequential Monte Carlo (SMC) can be used to take samples from posterior distributions, as an alternative to popular methods such as Markov Chain Monte Carlo (MCMC) like Metropolis-Hastings, Gibbs Sampling or more state-of-the-art Hamiltonian Monte Carlo (HMC) and No-U-Turn Sampler (NUTS).

Instead of creating a single Markov Chain, SMC works by keeping track of a large population of samples, which, in a similar spirit to Genetic Algorithms can be “selected” and “mutated” to evolve a better “fit” to the posterior distribution.

Experience suggests SMC is particularly well fit to the following settings:

Multi-modal posterior distributions
Online inference when data accumulates over time
Approximate Bayesian Inference when the likelihood function cannot be easily written down

In this article I introduce the basic components that make up a simple SMC algorithm. In future articles, I present more complex variations that build up towards the specific algorithm that is employed in the PyMC3 library.

Toy 2D Gaussian example #

As with MCMC, SMC approximates the posterior distribution via careful weighted sampling. However, while MCMC approximates a distribution through a single random walk that unrolls over time, SMC does this in a parallel manner. At each step, it keeps track of a series of samples (also known as particles) that approximate the distribution as a whole.

This picture tries to illustrate this distinction for a simple 2D Gaussian posterior that will be used throughout this article:

\(\)

\[ \text{pdf}(x, y) = \text{exp}(-\begin{bmatrix}x\\ y\end{bmatrix}^T \begin{bmatrix}1&0.5\\ 0.5&1\end{bmatrix}^{-1} \begin{bmatrix}x\\ y\end{bmatrix}) \]

MCMC vs SMC approximation to the posterior

Approximation to the 2D Gaussian posterior shown on the left plot. The middle plot shows the approximation from a single MCMC chain, with the samples connected sequentially, emphasizing the autocorrelated random walk behavior. The right plot shows the approximation from the SMC algorithm, where each sample is independent from the rest.

Why are we even doing this?
The example above with a 2D Gaussian is deceptively simple and there is no reason why we would need to use Monte Carlo approximations (be it MCMC or SMC) to understand it. Usually in the context of Bayesian data analysis we are working with a model that describes a complicated and highly dimensional posterior distribution that cannot be written down mathematically (or as mathematicians prefer, analytically).
In addition, even if the posterior distribution could be described mathematically, it is often very difficult to work with. We are usually interested in chopping the posterior distribution into a smaller number of components that we care about (marginalization) or summarizing it into metrics that are understandable such as correlations, averages and standard deviations (expectations). Surprisingly, these sort of operations are trivial if we can get a hold of a representative set of samples from the distribution in question. All we need is to do some smart indexing and call functions such as numpy.mean() and numpy.std()!

The core of the SMC algorithm is very simple:

We start by creating a large population of particles, usually by sampling from the prior distribution.
We then compute a weight for each particle, which is proportional to the posterior probability at that “point”.
These weights are used to sample, with replacement, a new population of particles. This will lead to a population that is more representative of the posterior distribution. On average, a particle that is 3x more probable than another will be resampled 3x as often.
Finally, we introduce some “mutations” in the population. This could be as simple as adding a random uniform jittering to each particle dimension, but, we can also try to “guide” the particles in a more clever way towards areas of larger posterior mass.

The pseudocode for the SMC algorithm is given below:

# hyperparameters
n_particles = 2000
n_steps = 10

# initialize population of particles from prior
particles = sample_prior(n_particles)
# main loop
for step in range(n_steps):
    # compute particle weights in proportion
    # to the posterior distribution probability
    weights = compute_weights(particles, posterior)
    # resample particles (with replacement)
    # based on their weights
    particles = resample(particles, weights)
    # mutate particles to reintroduce variation
    # and better explore the posterior distribution
    particles = mutate(particles)
# final approximation is given by the final particles
posterior = particles

Let’s flesh out these steps in detail.

A particle is born #

What is a particle?
A particle (or a sample in conventional MCMC) is a numerical vector with one entry for each parameter in the posterior function that we are interested in. In the 2D Gaussian example above it was a 2D-vector with {x, y} coordinates.
particle = {"x": -0.5, "y": 0.2}
In a simple linear regression model it might be an \(N^D\)-vector containing a value for the intercept, a coefficient for each predictor in our dataset and a noise term for the observations.
particle = {"intercept": 1.0, "coef_1": 0.5, "coef_2": 0.2, "noise": 3.0}

import numpy as np
import scipy.stats as st
import matplotlib.pyplot as plt
import seaborn as sns

sns.set_style("dark")
sns.set(font_scale = 1.4)

def create_plot(rows=1, cols=1):
    fig, ax = plt.subplots(rows, cols, figsize=(6*cols, 6*rows), sharex=True, sharey=True)
    
    # single plot
    if rows + cols == 2:
        ax.set_aspect('equal', 'box')
        ax.set_xticks([])
        ax.set_yticks([])
        
    # multiple plot
    else:
        for axi in ax.ravel():
            axi.set_aspect('equal', 'box')
            axi.set_xticks([])
            axi.set_yticks([])
            
    return fig, ax

def make_arrow(a, b, axA, axB):
    from matplotlib.patches import ConnectionPatch
    
    return ConnectionPatch(
        a,
        b,
        coordsA="axes fraction",
        coordsB="axes fraction",
        axesA=axA,
        axesB=axB,
        color="k",
        fc="k",
        arrowstyle="-|>",
        mutation_scale=30,
        lw=2,
        zorder=10,
    )

Let’s create our very first particles:

class Particle:
    def __init__(self, x, y):
        self.x = x
        self.y = y

adam, eve = Particle(x=1, y=0), Particle(x=0, y=1)

_, ax = create_plot()
ax.scatter(adam.x, adam.y)
ax.annotate("Adam", (adam.x, adam.y+.1), ha="center", fontsize=16)
ax.scatter(eve.x, eve.y)
ax.annotate("Eve", (eve.x, eve.y+.1), ha="center", fontsize=16)
ax.set_xlim([-0.5, 1.5])
ax.set_ylim([-0.5, 1.5]);

png

Kidding aside, we need to get a whole population of particles. How should we choose?

Making the effort to pick a good initial population can be very useful. Remember that the goal is to obtain a representative sample from the posterior distribution, and so the closer our initial population is to the posterior, the closer we are to being done.

Regardless, and to keep things simple with our toy example, we will just take a random sample from the uniform grid:

def sample_uniform_grid(min_x, max_x, min_y, max_y, n):
    particles = []
    for i in range(n):
        x = np.random.uniform(min_x, max_x)
        y = np.random.uniform(min_y, max_y)
        particle = Particle(x=x, y=y)
        particles.append(particle)
    return particles

particles = sample_uniform_grid(-3, 3, -3, 3, n=1000)
xs = [particle.x for particle in particles]
ys = [particle.y for particle in particles]

_, ax = create_plot()
ax.scatter(xs, ys, alpha=0.3)
ax.set_title("The prior population");

png

Picking favorites #

Now that we have a population of particles, it’s time to evolve it into a proper approximation of the posterior distribution! To do this, we weight each particle by its posterior probability, normalize all the population weights so that they add up to 1.0 and use them to resample with replacement.

def posterior(particle):
    """ 
    Unnormalized 2D Gaussian with [0, 0] mean 
    and [[0.5, 1.0], [1.0, 0.5]] covariance matrix 
    """
    xy = np.array([particle.x, particle.y])
    cov = np.array([[1.0, 0.5], [0.5, 1.0]])
    return np.exp(-(xy) @ np.linalg.inv(cov) @ (xy))

def compute_weights(particles, posterior):
    weights = np.array([posterior(particle) for particle in particles])
    weights /= np.sum(weights)  # normalize weights
    return weights

def resample(particles, weights):
    idxs = np.arange(len(particles))
    sampled_idxs = np.random.choice(idxs, p=weights, replace=True, size=len(particles))
    resampled_particles = []
    for idx in sampled_idxs:
        resampled_particles.append(particles[idx])
    return resampled_particles

_, ax = create_plot(1, 3)

xs = [particle.x for particle in particles]
ys = [particle.y for particle in particles]
ax[0].scatter(xs, ys, alpha=0.2)
ax[0].set_title("Initial particles")

weights = compute_weights(particles, posterior)
rescaled_weights = 0.95 * (weights - np.min(weights)) / (np.max(weights) - np.min(weights)) + 0.05
ax[1].scatter(xs, ys, s=150 * rescaled_weights, alpha=0.2, color="C3")
ax[1].set_title("Particle weights")

resampled_particles = resample(particles, weights)
xs = [particle.x for particle in resampled_particles]
ys = [particle.y for particle in resampled_particles]
ax[2].scatter(xs, ys, s=50, alpha=0.2)
ax[2].set_title("Resampled particles")

ax[1].add_artist(make_arrow((0.85, 0.5), (0.15, 0.5), ax[0], ax[1]));
ax[2].add_artist(make_arrow((0.85, 0.5), (0.15, 0.5), ax[1], ax[2]));

png

The initial population of particles is shown on the left plot. In the middle plot each particle is given a weight proportional to its posterior probability. The weights are illustrated by the particle size. In the right plot, the particles were resampled with replacement based on their weights. Each particle is plotted with the same transparency, and those resampled multiple times appear to be darker.

Introducing variation #

The example above was a favorable one, where the initial population of particles happened to cover the bulk of the posterior distribution. We may not always be this lucky, for instance when working with very complex distributions or when our choice for initial population of particles is not great, as in the exaggerated example below:

_, ax = create_plot(1, 3)
ax[0].set_xlim([-3, 3])
ax[0].set_ylim([-3, 3])

bad_particles = sample_uniform_grid(-3, 0, -3, 0, 500)
xs = [particle.x for particle in bad_particles]
ys = [particle.y for particle in bad_particles]
ax[0].scatter(xs, ys, alpha=0.2)
ax[0].set_title("Initial particles")

weights = compute_weights(bad_particles, posterior)
rescaled_weights = 0.95 * (weights - np.min(weights)) / (np.max(weights) - np.min(weights)) + 0.05
ax[1].scatter(xs, ys, s=150 * rescaled_weights, alpha=0.2, color="C3")
ax[1].set_title("Particle weights")

bad_resampled_particles = resample(bad_particles, weights)
xs = [particle.x for particle in bad_resampled_particles]
ys = [particle.y for particle in bad_resampled_particles]
ax[2].scatter(xs, ys, s=50, alpha=0.2)
ax[2].set_title("Resampled particles")

ax[1].add_artist(make_arrow((0.85, 0.5), (0.15, 0.5), ax[0], ax[1]));
ax[2].add_artist(make_arrow((0.85, 0.5), (0.15, 0.5), ax[1], ax[2]));

png

To obtain a robust algorithm we need some way for the population to “evolve” towards the target distribution. This is achieved by mutating the particles.

The mutation step is also important for reintroducing variability after the resampling step. Because resampled particles are exact copies of each other, if we were to rerun the weighting and resampling steps several times, the population would quickly converge to a singular particle.

The simplest mutation strategy is to add a random offset to each particle parameter:

def random_mutation(particles):   
    new_particles = []
    for particle in particles:
        # Add Gaussian noise around each particle dimension
        new_x = particle.x + np.random.normal(0, .2)
        new_y = particle.y + np.random.normal(0, .2)
        new_particle = Particle(x=new_x, y=new_y)
        new_particles.append(new_particle)
    return new_particles

_, ax = create_plot(1, 4)
ax[0].set_xlim([-3, 3])
ax[0].set_ylim([-3, 3])

particles = bad_resampled_particles
xs = [particle.x for particle in particles]
ys = [particle.y for particle in particles]
ax[0].scatter(xs, ys, alpha=0.2)
ax[0].set_title("Resampled particles")

particles = random_mutation(particles)
new_xs = [particle.x for particle in particles]
new_ys = [particle.y for particle in particles]
ax[1].scatter(new_xs, new_ys, alpha=0.2)
ax[1].set_title("Mutated particles")

weights = compute_weights(particles, posterior)
particles = resample(particles, weights)
xs = [particle.x for particle in particles]
ys = [particle.y for particle in particles]
ax[2].scatter(xs, ys, alpha=0.2)
ax[2].set_title("Resampled particles")

particles = random_mutation(particles)
new_xs = [particle.x for particle in particles]
new_ys = [particle.y for particle in particles]
ax[3].scatter(new_xs, new_ys, alpha=0.2)
ax[3].set_title("Mutated particles")

ax[1].add_artist(make_arrow((0.85, 0.5), (0.15, 0.5), ax[0], ax[1]));
ax[2].add_artist(make_arrow((0.85, 0.5), (0.15, 0.5), ax[1], ax[2]));
ax[3].add_artist(make_arrow((0.85, 0.5), (0.15, 0.5), ax[2], ax[3]));

png

Intelligent design #

Creationists might have gotten it wrong when it comes to natural selection, but their premise makes a lot of sense: It would all have been much easier if someone had taken the driving seat. Fortunately that is something we can do in SMC, and in particular when it comes to how we mutate the population of particles.

Instead of using a blind form of “mutation” as above, we can intelligently drive the particles towards areas of high probability in the posterior distribution. One way of doing this is to run each particle through a traditional MCMC algorithm for a specific number of steps and take the final position as the new mutated particle.

def metropolis_hastings_mutation(particles): 
    # This should be parallelized
    new_particles = []
    for particle in particles:
        new_particle = single_particle_mh(particle, steps=10)
        new_particles.append(new_particle)
    return new_particles

def single_particle_mh(particle, steps):
    """
    Run a small Metropolis-Hastings MCMC chain around a single particle
    """
    # The proposal distribution should be tuned
    proposal_distribution = st.norm(0, .2)
    for step in range(steps):
        
        x_offset, y_offset = proposal_distribution.rvs(2)
        new_particle = Particle(
            x=particle.x + x_offset,
            y=particle.y + y_offset,
        )

        # Compute Metropolis-Hastings jump probability given the ratio
        # between current and proposed particle probability
        current_probability = posterior(particle)
        proposed_probability = posterior(new_particle)
        jump_probability = min(1, proposed_probability / current_probability)

        if jump_probability >= np.random.rand():
            particle = new_particle
        else:
            particle = particle

    return particle

_, ax = create_plot(1, 4)
ax[0].set_xlim([-3, 3])
ax[0].set_ylim([-3, 3])

particles = bad_resampled_particles
xs = [particle.x for particle in particles]
ys = [particle.y for particle in particles]
ax[0].scatter(xs, ys, alpha=0.2)
ax[0].set_title("Resampled particles")

particles = metropolis_hastings_mutation(particles)
new_xs = [particle.x for particle in particles]
new_ys = [particle.y for particle in particles]
ax[1].scatter(new_xs, new_ys, alpha=0.2)
ax[1].set_title("Mutated particles")

weights = compute_weights(particles, posterior)
particles = resample(particles, weights)
xs = [particle.x for particle in particles]
ys = [particle.y for particle in particles]
ax[2].scatter(xs, ys, alpha=0.2)
ax[2].set_title("Resampled particles")

particles = random_mutation(particles)
new_xs = [particle.x for particle in particles]
new_ys = [particle.y for particle in particles]
ax[3].scatter(new_xs, new_ys, alpha=0.2)
ax[3].set_title("Mutated particles")

ax[1].add_artist(make_arrow((0.85, 0.5), (0.15, 0.5), ax[0], ax[1]));
ax[2].add_artist(make_arrow((0.85, 0.5), (0.15, 0.5), ax[1], ax[2]));
ax[3].add_artist(make_arrow((0.85, 0.5), (0.15, 0.5), ax[2], ax[3]));

png

The population progressed much faster towards the target distribution.

Summary #

To summarize, here is the basic SMC algorithm again:

# hyperparameters
n_particles = 2000
n_steps = 10

# initialize population of particles from prior
particles = sample_prior(n_particles)
# main loop
for step in range(n_steps):
    # compute particle weights in proportion
    # to the posterior distribution probability
    weights = compute_weights(particles, posterior)
    # resample particles (with replacement)
    # based on their weights
    particles = resample(particles, weights)
    # mutate particles to reintroduce variation
    # and better explore the posterior distribution
    particles = mutate(particles)
# final approximation is given by the final particles
posterior = particles

What about the prior? #

You might have noticed a large repetition of the word “posterior”, and a strange absence of “prior”. To keep things simple I chose to work with an implicitly uniform prior which does not affect the posterior calculations. Indeed the two are needed for a complete Bayesian inference. And in fact there is something very interesting that can be done when combining the two into a general SMC algorithm. That’s the topic of the [next article]!