Metropolis-Hastings with Gaussian drift proposal on bounded support

Professor Darren Wilkinson has an excellent blog post on a common implementation issue that arises when the standard random-walk Metropolis Hastings algorithm, with a Gaussian proposal, is modified to sample from a target distribution $p(x)$ where $$\text{supp}(p) = { x : p(x) > 0 } = (0, \infty).$$ In this blog post, we are going to briefly review the setup, the problem, and then generalize the solution proposed by Wilkinson to the situation where $\text{supp}(p) = (a,b)$ (instead of the non-negative reals). We will also allow the Gaussian proposal to have a general variance $\sigma^2$ (rather than just $\sigma^2 = 1$). The purpose of this blog post is to

  • illustrate that small programmatic changes in a probabilistic algorithm can violate correctness;
  • help someone implementing this sampler will find the general derivation informative and useful for their own purposes.

Sampling from a Gamma using MH

To reuse the example from Wilkinson, suppose we are interested in sampling from a gamma with shape $\alpha = 2$ and scale $\theta =1$, $$p(x) = \frac{1}{Z} x\exp(-x), x > 0$$ by implementing random walk MH. The proposal $N(\cdot | x_i, \sigma^2)$ is a Gaussian with mean at the current sample $x_{i}$, and known variance $\sigma^2$ (aside: the gamma can be sampled much more efficiently). The standard MH algorithm proceeds as:

  1. initialize $x_0$ to some value in $(0, \infty)$.

  2. for $i = 0,1,2,\dots$

    2.1 sample from the proposal, $x’ \sim N(\cdot|x_i,\sigma^2)$

    2.2 compute the acceptance ratio $A(x’, x_i) = \frac{ p(x’)N(x_i| x’, \sigma^2) }{ p(x_{i})N(x’| x_i, \sigma^2) } = \frac{p(x’)}{p(x_{i})}$

    2.3 if $U[0,1] \le A(x’, x_{i})$ then $ x_{i+1} \leftarrow x’$ (accept) else $ x_{i+1} \leftarrow x_{i}$ (reject)

In the sample step 2.1, it is possible that we obtain $x’ < 0$ (for instance, if $x_{i}$ is near 0 or $\sigma^2$ is large). One intuitive approach is to automatically reject samples that are outside $\text{supp}(p)$ and repeat until $x>0$. In the case of $Ga(2,1)$, this rejection would be enforced anyway since the acceptance ratio $A(x’,x_{i})$ will be negative when $x’ < 0$ (if we ignore the fact that $p$ is defined on the positive reals and evaluate $p(x’)$ anyway). However, Wilkinson observes that if we were sampling from $Ga(3,1)$ for instance, then $$p(x) = \frac{1}{Z} x^2\exp(-x), x > 0$$ which is positive even when $x < 0$ (again, when ignoring the positive constraint on $x$). Our rejection step would not have been guaranteed to occur in 2.3 had we decided to keep the bad sample $x’ < 0$ in step 2.1. So we were just lucky with the $Ga(2,1)$ example, and need a more robust sampling strategy for the general case.

Rejection sampling changes the proposal distribution

By discarding samples outside the target support in step 2.1, we are rejection sampling the Gaussian proposal *$N(\cdot|x_i,\sigma^2)$ for all values $x’ < 0$. This strategy is equivalent to sampling from a truncated Gaussian, which is not symmetric in its mean and argument. Therefore, we cannot cancel terms in the acceptance ratio (step 2.2) anymore.

A Correct Sampler Needs A Correct Acceptance Ratio

Generalizing the results from Wilkinson, suppose that the target $p(x)$ has $\text{supp}(p) = (a,b)$ and the Gaussian has variance $\sigma^2$ (typically, this variance is adapted during runtime to achieve a desired acceptance rate). We will still 

reject samples $x’ \notin (a,b)$ in 2.1, and keep sampling until $x’ \in (a,b)$. We need to account for all the previous rejections. The proposal density is now $$N_{(a,b)}(x’ | x_i, \sigma^2) = \frac{ \frac{1}{\sigma} \phi (\frac{x’-x_i}{\sigma}) } {\Phi(\frac{b-x_i}{\sigma}) – \Phi(\frac{a-x_i}{\sigma})} \mathbf{I}_{(a,b)}(x’),$$ where $\phi$ is the standard Gaussian density, and $\Phi$ is the standard Gaussian cumulative distribution function. While the numerator is symmetric in $(x’,x_i)$, the denominator is not. It follows that with our new strategy, the correct acceptance ratio for step 2.2 $$A(x’,x_i) = \frac{p(x’) ( \Phi(\frac{b-x_i}{\sigma}) – \Phi(\frac{a-x_i}{\sigma}) ) }{ p(x_{i}) ( \Phi(\frac{b-x’}{\sigma}) – \Phi(\frac{a-x’}{\sigma})) } $$ A good step is to check we agree with Wilkinson in the special case that $a=0,b=\infty,\sigma^2=1$. In this case, the term $\Phi(\frac{b-x_i}{\sigma}) – \Phi(\frac{a-x_i}{\sigma}) = 1-\Phi(-x_i) = \Phi(x_i)$, and similarly for the term in the numerator. This result cross-checks with Wilkinson. Also remember to change the initialize step 1 to choose $x_0$ from the range of interest.

The rejection rate and numerical stability

The good news for this sampler is that the mean of the truncated normal $N_{(a,b)}(\cdot|x_i,\sigma^2)$ is guaranteed to remain in the truncation interval, which means the rejection rate is not going to be problematic (unless for some reason $\sigma^2$ is unusually large, which should be designed in tandem with $\text{supp}(p)$ to avoid this issue). The rejection rate is a problem when the mean $x_i$ of the underlying, non-truncated normal is not in the interval $(a,b)$, and more elaborate algorithms exist (or use a different proposal). Another critical issue in this case is that $\Phi(\frac{b-x_i}{\sigma}) – \Phi(\frac{a-x_i}{\sigma})$ becomes numerically unstable, tends to zero. We would be able to get away with $x/0$ or $0/x$ in $A$, but we might also have the numerically disastrous $0/0$. Thankfully, we do not have these problems.

Conclusion

Samplers are delicate probabilistic objects, treat them with care.

Hello world!

Welcome to Random Seed. A good proportion of implementors of probabilistic computation fail to address two key issues relating to entropy control in their software

  • Keeping track of the state of the random number generator.
  • Permuting the random seed.

We will look into these two issues, as well as many other topics, throughout future blog posts.