On statistical two-sample testing

Statistical two-sample testing concerns the following question: given observations $X = \lbrace x_1, x_2, \dots, x_m \rbrace$ and $Y = \lbrace y_1, y_2, \dots, y_n \rbrace$ drawn from distributions $p$ and $q$, respectively, can we decide whether $p=q$?

Two-sample testing is an active field of research with an enormous amount of literature. Asymptotic theory of two-sample tests is typically concerned with establishing consistency, meaning that the probability of declaring $p=q$, when in fact $p \ne q$, goes to zero in the large sample limit.

Distinguishing distributions with finite number of samples

A very nice (negative) result is that it is generally impossible to distinguish two distributions with high probability, by only observing a finite pair of samples of fixed size. The result is also somewhat disappointing for those who are particularly interested in theoretical guarantees of statistical procedures in applied inference, where the ideal of asymptotia is a far reach the chaotic array of csv files and other inhomogeneous data sources sitting on the analyst’s hard drive. Anyway, consider the following scenario, which is from Gretton, et al. “A Kernel Two-Sample Test.” JMLR. 2012. (a highly recommended paper to read):

Assume we have a distribution $p$ from which have drawn $m$ iid observations. Construct a distribution $q$ by drawing $m^2$ iid observations from $p$ and define a discrete distribution over these $m^2$ observations with probability $m^{-2}$ each. It is easy to check that if we now draw $m$ observations from $q$, there is at least a ${m^2 \choose m}\frac{m!}{m^{2m}} > 1 – e^{-1} > 0.63$ probability that we thereby obtain an $m$ sample from $p$. Hence, no test will be able to distinguish samples from $p$ and $q$ in this case. The probability of detecting can be made arbitrarily small by increasing the size $m$ of the sample from which we construct $q$.

To understand why this setup implies the impossibility of two-sample testing with high probability in the finite setting, we need to think about the frequentist properties of the procedure. Suppose we repeat the experiment a very large number of times, say 900,000 times. Marginalizing over all experiments, the samples drawn from $q$ will follow a different distribution than samples drawn from $p$. However, in roughly 2/3 or 600,000 of the experiments, the observation set drawn from $q$ is a random sample from $p$. Conditioned on being in one of the 600,000 trials where the $m$ samples are from the same distribution, an ideal two-sample test should report $p=q$, which is committing a Type II error with arbitrarily large probability.

Now, the authors left the derivation of the combinatorial bound ${m^2 \choose m}\frac{m!}{m^{2m}}$, for the probability that the $m$ draws from $q$ are an $m$ sample from $p$, as an exercise for the reader. It actually makes a nice homework or exam question for a 101 probability course (as do many other byproducts of academic papers, so I have learned). For those interested in elementary probability and combinatorics, here is a candidate derivation.

First note ambiguities in the setup of the experiment: (i) Is $p$ intended to be a continuous distribution, so that there are no duplicates in the $m^2$ samples? (ii) What exactly is the definition of the event “thereby obtain an $m$ sample from $p$”? To make progress, we assume that (i) $p$ is continuous and the $m^2$ samples in $q$ are unique, and (ii) an “$m$ sample from $p$” means that the resampled sequence contains no duplicates (which occurs with probability zero when sampling directly from $p$). With these assumptions, the bound becomes trivial. The size of the sample space is equal to the number of length $m$ strings from $\lbrace x_1, \dots, x_{m^2} \rbrace$, which is $(m^2)^m$. The number of sequences of length $m$ with unique entries is ${m^2 \choose m}m!$. Dividing the latter quantity by the former gives the bound.

If the assumptions above are too strong and $p$ is discrete, then I am no longer sure what the definition of an “$m$ sample from $p$” is. Perhaps someone can let me know if alternative meanings are clear to them. Discreteness of $p$ would indeed provide a much more compelling example, since one can arrive at arbitrary results based on pathologies of real number being uncomputable.

A simpler counter-example with infinite number of samples, and a paradox?

To the computer scientist, real numbers are uninteresting. Let us suppose $p$ is a discrete distribution with finite support, and additionally restrict its probability density at each point in the support to be a rational number. One might simplify the above experiment by defining $q_m$ to:

  • generate a size $m$ sample from $p$, with probability $1-1/m$;

  • generate a size $m$ sample from an arbitrary distribution, with probability $1/m$.

Here is an interesting question — is there any test which can have non-zero power for testing $p$ versus $q_m$? Note that it is always the case that $p \ne q_m$. The probability of detection is $1/m$, which decays to zero as $m \to \infty$. Therefore, the Type II error necessarily goes to one, irrespective of the testing procedure. More confusing is the fact that $q_m$ converges to $p$ as the sample size increases. There seems to be a paradox underlying this experiment — how can we resolve it? (Hint: consider whether the frequentist/large-sample intuition of two-sample testing is coherent with our construction of the limiting process for testing $p$ against the sequence of distributions $(q_m)$).

A topic “I want” to discuss

Having wants and expressing them to others is a very natural thing to do. Discussions about plans or ambitions typically make use of the phrase “I want”, and will summarize a position in terms of “wanting” X or Y. In discussions that involve conflict between two parties, however, the conflicting wants are left unstated, and each side will instead argue in favor of the consequences of their wants. Given how central “wants” are to dictating our behavior (economists formally treat the combination of “wants” and “needs” as core motivators for decision-making), it is worth thinking about how we can reason about our own wants, and how we can communicate them effectively to others.

The term “want” is vague. It conveys a primitive emotion rooted somewhere unknown in our cognitive processes. In cognitive science, the notion of “want” is related to a concept named motivational salience (from Wikipedia):

Motivational salience is a cognitive process and a form of attention that motivates, or propels, an individual’s behavior towards or away from a particular object, perceived event, or outcome.

Motivational salience regulates the intensity of behaviors that facilitate the attainment of a particular goal, the amount of time and energy that an individual is willing to expend to attain a particular goal, and the amount of risk that an individual is willing to accept while working to attain a particular goal.

In particular, “want” relates to incentive salience, which is an attractive form of motivational salience that causes approach behavior toward an objective:

Incentive salience is a cognitive process which confers a “desire” or “want” attribute, which includes a motivational component, to a rewarding stimulus. Reward is the attractive and motivational property of a stimulus that induces appetitive behavior — also known as approach behavior — and consummatory behavior.

The “wanting” of incentive salience differs from “liking” in the sense that liking is the pleasure that is immediately gained from the acquisition or consumption of a rewarding stimulus; the “wanting” of incentive salience serves a “motivational magnet” quality of a rewarding stimulus that makes it a desirable and attractive goal, transforming it from a mere sensory experience into something that commands attention, induces approach, and causes it to be sought out.

The above description of wants in terms of stimuli, goals, sensory experiences, risk, time, energy, attention, and rewards shows why summarizing an idea simply as “I want X” carries low information content. It leaves the main aspects that characterize the “want” almost entirely unaddressed. And while many of these aspects may be implicit — surely some wants are just obvious — the question then becomes, why are they obvious?

Proposition: when expressing an idea (usually some combination of a request, objective, and plan) which can be loosely phrased as “I want X”, we can avoid the term “I want” and draw out an explanation as follows:

  • the suggestion is X;
  • the objective is Y;
  • the investment effort is W;
  • the reason is Z.

This form of thinking may appear rather mechanical, and perhaps even frustrating, to express a simple statement such as “I want X”. More so, the terminal “reason Z” can itself be a “want”, which then necessitates a recursive evaluation of the rule, up to some point of satisfaction or satiation.

The exercise of investigating a “want”, which may resemble a form of Socratic dialogue with onesself, will often bottom out at the qualities at the core of our human characteristics, most of which are poorly understood — on the positive side: benevolence, empathy, security, rationality, or harmony; on the negative side: fear, insecurity, greed, jealousy, irrationality, or selfishness; on the neutral side: progression, efficiency, aesthetic, ambition, or curiosity.

A methodological unpacking of the bases of our “wants” is therefore fundamental for arriving at carefully examined intentions and motivations. It is not uncommon to fixate on a given desire, yet struggle to express why the fixation exists in the first place. Does the “want” arise from a need, from external pressures, from a projection, from an illusion? How much resources are to be allocated in pursuit of a want, and which resources need to be traded-off? What principles will need to be reinforced, and which will have to be relaxed? What is the fall-back for the event that the “want” turns out to be not as desired as originally anticipated?

One might say that the propositions above are uninterestingly obvious, or a mere list of truisms. However, there is a strong distinction between knowing these ideas in the safety of the abstract, and deploying these techniques with full force and discipline in the danger of practice: personal disputes, workplace rivalries, corporate warfare, partisan politics, international conflict.

We should be ready to accept that the examination of a “want” may lead to uncomfortable resolutions. At the same time, this discomfort is necessary. It is necessary to be maximally honest with ourselves; necessary to build emotional and intellectual stability; and necessary to foster cooperation with other agents who have conflicting and competing wants of their own.

The view from outside: reading international newspapers

It is natural to consume media products from the country we live in. For example, in the United States, sources such as CNN, Fox News, The Washington Post, and The New York Times, serve as the main sources of news about global affairs. Google News, a popular and powerful news aggregator, by default will present top results for the “World” section from only a collection of the most popular news outlets from the country of the detected IP address (as an exercise, visit Google News, click on “World” in the left panel, and see for yourself).

It is not surprising that US news outlets present world events (not just analysis, but also reporting) from a largely American perspective (a term which happens to have a name: Americentrism). But in the age of the internet, there is no excuse to read about world news from a restricted set of US-based media houses. It is unfortunate that reading, let alone being able to name, English newspapers from around the world is quite uncommon. The outcome is a blind-spot to the perspectives and sentiments of cultures and communities we read and make judgments about.

Reading a news piece from an international newspaper often evokes one of the two responses for me:

  • An appreciation of the neutrality of the presentation of world events. As an example, US reporting on foreign nations regularly contains assessments about which regimes are democratic and friendly, versus which are autocratic and antagonistic. Furthrmore, most international news will tie the event to US foreign policy. Both these patterns are less pronounced at non-US news desks, especially those which are not global players, and therefore have less direct relationship to a particular foreign event.

  • A surprise by how propagandist their depiction of world events can appear. (One common pattern to look out for is observing how, and in reference to which groups, a news source will use the word “terrorist” versus “rebel” or “resistance”.) We are accustomed to reading about world events from a fixed perspective, that foreign ones often come across strangely worded and manipulative. It takes effort to recognize that the issue goes both ways, and US media is likely subject to the same pitfalls when evaluated from outside.

Below are a collection of some English, international media sources which I enjoy reading during the early morning rounds. These have helped me appreciate the variety of perspectives and interpretations of supposed “factual news reports”, which are not apparent from reading one source alone:

There are obviously thousands of news sources (and the above list misses African or Latin American spheres, parts of the world I remain woefully uniformed about), but even a small sample from outside our bubble can broaden one-dimensional views of the greater world outside.

Can you read my post?

Is the phrase “can you [do X]?” considered a polite way to request something in written communication on the internet, or perhaps even in American English more generally?

It is a common phrase I encounter in semi-professional communication through e-mails, Slack, etc. It almost always appears without an associated “please”. The most confusing aspect is that, in its typical usage “can you [do X]” is hardly a question with a reasonable yes/no answer (e.g. “can you send me this file?”), it is instead a straightforward request.

I propose the following rule of thumb: ask yourself, is it easy to replace “can you” by “are you able to” without sounding overly sarcastic? If not, then a more accurate way to ask “can you [do X]?” is instead to say “please [do X]”.

I hypothesize that “can you [do X]?” is indeed intended to sound like a less demanding way to request something, by situating the request as a question rather than a command. However, the phrase has a few shortcomings:

  • regardless of intent, it still is missing “please”, (which is important when many small requests add up and eat away at time!);

  • in most contexts it is not really a question.

Can you leave a comment below?

A thought experiment with the Bayesian posterior predictive distribution

Let $\pi(\theta)$ be a prior for parameter $\Theta$, and $p(x|\theta)$ a likelihood which generates an exchangeable sequence of random variables $(X_1,X_2,X_3\dots)$.

Given a set of observations $D := \lbrace X_0=x_0, X_1=x_1, \dots, X_{N-1}=x_{N-1}\rbrace$, the posterior predictive distribution for the next random variable in the sequence $X_N$ is defined as $$p(X_{N}=s | D) = \int p(X_{N}=s | D,\theta) \pi(\theta|D)d\theta = \int p(X_{N}=s|\theta) \pi(\theta|D)d\theta, $$

where the second equality follows from assuming the data is exchangeable (or i.i.d conditioned on latent parameter $\theta$). The posterior predictive density evaluated at ${X_{N}=s}$ is an expectation under the posterior distribution $\pi(\theta|D)$. Define the function $g(s,\theta) := p(X_{N}=s|\theta)$, (note that $g$ does not depend on the index $N$ of the random variable since $\theta$ is known), and then compute the expectation of the random variable $g(s,\Theta)$ under $\pi(\theta|D)$, $$ p(X_N=s | D) = \mathbb{E}_{\pi(\cdot|D)}\left[ g(s,\Theta) \right]. $$

Now consider the case where each random variable $X_i$ is a two-dimensional vector $X_i = (X_{[i,1]}, X_{[i,2]}).$ The data $D = \lbrace X_0=x_0, X_1=x_1, \dots, X_{N-1}=x_{N-1}\rbrace$ is thus an exchangeable sequence of bivariate observations. (Assume for simplicity that marginalizing and conditioning the joint distribution $p(X_{[i,1]},X_{[i,2]}|\theta)$ are easy operations.) We again perform inference to obtain the posterior $\pi(\theta|D)$.

Suppose we wish to evaluate the probability (density) of the event $\lbrace X_{[N,1]}=s \mid X_{[N,2]}=r \rbrace$ under the posterior predictive. I am in two minds about what this quantity could mean:

Approach 1

Define the conditional probability density again as an expectation of a function of $\Theta$ under the posterior distribution. In particular, let the probe function $g(s,r,\theta) := p(X_{[N,1]}=s|X_{[N,2]}=r,\theta)$ (recalling that $g$ does not depend on $N$ when $\theta$ is known) and then compute the expectation of $g(s,r,\Theta)$ under $\pi(\theta|D)$, $$ p_{\text{approach 1}}(X_{[N,1]}=s|X_{[N,2]}=r,D) = \mathbb{E}_{\pi(\cdot|D)}\left[ g(s,r,\Theta) \right]. $$

Approach 2

Define the desired conditional probability density by application of the Bayes Rule. Namely, separately compute two quantities

joint density: $ p(X_{[N,1]}=s,X_{[N,2]}=r|D) = \int p(X_{[N,1]}=s,X_{[N,2]}=r|\theta) \pi(\theta|D)d\theta $

marginal density: $ p(X_{[N,2]}=r|D) = \int p(X_{[N,2]}=r|\theta) \pi(\theta|D)d\theta $

and then return their ratio, $$ p_{\text{approach 2}}(X_{[N,1]}=s|X_{[N,2]}=r,D) = \frac{p(X_{[N,1]}=s,X_{[N,2]}=r|D)}{p(X_{[N,2]}=r|D)}. $$

Note that Approach 2 is equivalent to appending the condition $\lbrace X_{[n,2]}=r \rbrace$ to the observation set $D$ so that $D’ := D \cup \lbrace X_{[N,2]}=r \rbrace$ and the new posterior distribution is $\pi(\theta|D’)$. It then computes the expectation of $g(s,r,\Theta)$ under $\pi(\cdot|D’)$, $$ p_{\text{approach 2}}(X_{[N,1]}=s|X_{[N,2]}=r,D) = \mathbb{E}_{\pi(\cdot|D’)}\left[ g(s,r,\Theta) \right] $$

Exercise: Show why the two expressions for $p_{\text{approach 2}}$ are equivalent (or let me know if I made a mistake!)


The question is thus, does the Bayesian reasoner update their beliefs about $\theta$ based on the condition ${X_{[N,2]}=r}$? I think both approaches can make sense:

In Approach 1, we do not treat $\lbrace X_{[N,2]}=r \rbrace$ as a new element of the observation sequence $D$; instead we define the probe function $g(s,r,\theta)$ based on the conditional probability (which is a function of the population parameter), and then compute its expectation.

Approach 2 follows more directly from the “laws of probability” but is less interpretable from the Bayesian paradigm. Why? Because if ${\Theta = \theta}$ is known, then $p(X_{[N,1]}=s|X_{[N,2]}=r,\theta)$ is just a real-number — since the Bayesian does not know $\theta$, they marginalize over it. But it is unclear why the probe function $g(s,r,\theta)$ should influence the distribution of $\pi(\theta|D)$, regardless of whether it happens to represent a density parameterized by $\theta$.

Next Steps

Perhaps I should numerically/analytically compute the difference between Approach 1 and Approach 2 for a bivariate Gaussian with known covariance and unknown mean. For simplicity, just use the prior predictive, letting $D=\varnothing$.

Trump knew it, Hillary blew it — the failures of polling

Election season has come and gone. Donald Trump pulled off what has been repeatedly characterized as a “stunning upset” over bitter rival Hillary Clinton. The web has gone rampant with postmortem analysis about the failures of election polling. But was it really that stunning?

Several polls conducted in the lead up to the election reported on the virtually deadlocked race, all well-within any reasonable margin of error:

Quinnipiac University reported on the situation in key battleground states (Nov 2)

Democrat Hillary Clinton’s October momentum comes to a halt as she clings to a small lead in Pennsylvania, while Republican Donald Trump moves ahead in Ohio, leaving Florida and North Carolina too close to call.

Probability forecast models on the other hand were remarkably off-mark and predicted Clinton well-ahead just the night before the election:

  • New York Times Upshot: Clinton 84%, Trump 16%
  • FiveThirtyEight: Clinton 66.9%, Trump 33%
  • PredictWise: Clinton 89%, Trump 11%

I was watching the blitz of last-ditch rallies held by Trump and Clinton the night before election day, to learn about the sentiments they were expressing about their chances. Here is a revealing segment from Trump’s penultimate war cry in New Hampshire (8pm on Nov 7):

We are going right after this to Michigan, because Michigan is in play… The polls just came out: we are leading in Michigan; we are leading in New Hampshire; we are leading in Ohio; we are leading in Iowa; leading in North Carolina; I think we are doing really, really well in Pennsylvania; and I do believe we are leading in Florida.

In the meantime, according to the New York Times:

Mrs. Clinton’s campaign was so confident in her victory that her aides popped open Champagne on the campaign plane early Tuesday.

Either way, each candidate and their popular base was clearly happy to live in their own reality right up to the wire. Some personal take-aways from this whole affair:

  • Polling and forecasting is a messy, complex, empirical problem which next-to-nobody understands. I doubt it is statistical.

  • Well-informed voters do not derive the bulk of their information from cursory reading of social or national news media. They are vigilant about critiquing every aspect of information they consume. Otherwise, they are very, very sad.

Everything you need to know about … headlines!

There used to be a time when journalism was a profession. In the information age, however, the production and consumption of public media has become a long gone art. Data streams chaotically and continuously from all directions through the social networks — Facebook, Witter, Slacks, SMS — and it is a mystery how anyone can distill a meaningful signal from the noise. One might rather call it the mis-information age.

One can also write a volume, critiquing the sloppiness in today’s written content. For this post, let us focus right at the start; headlines. The purpose of a headline is to serve as a useful, succinct summary for the content of an article.

Many headlines on the web are instead extremely predictable, repetitive, and often appear to be pulled right out of a book called “Click Bait 101”. I recently browsed through the front page of Google News, and selected an assortment of representative headlines that echo some of the most recurring motifs.

Everything you need to know about …

These articles are a hold-my-hand guide through a dangerously oversimplified presentation of some complex issue. They usually receive very high number of comments and page views. I attribute most the blame for the success of this headline to the intellectual laziness of readers, who wish to quickly be knowledgeable and form opinions about a topic that society agrees is important.

It is also troubling that the writer is confident that they are telling you “everything you need to know about…” It would be more honest and accurate to rephrase as “some things we want you to know about…” Here are some examples.

  • Everything You Need to Know About Britain’s New Prime Minister. ABC
  • Everything You Need to Know About K2, the Drug Linked to Mass Overdose. NBC
  • Everything You Need to Know About Today’s GOP Rules Committee Meeting. ABC
  • Everything you need to know about the net neutrality debate in India. India Times
  • Everything you need to know about Theresa May’s Brexit nightmare in five minutes. Politics UK

The last one is the exemplar — promising a comprehensive coverage of the Brexit in five minutes of your valuable time.

[someone] just did [something shocking]!

Intended for maximum shock and urgency. The [something shocking] event is typically a straight-out-deception of something trivial. The primary purpose of the headline is rather to make a statement about, or build a persona for, the [someone]. Examples:

  • Bernie Sanders Just Made Jill Stein The Most Powerful Woman In American Politics. Huffington Post
  • The FDA Just Declared War on Cookie Dough. Smithsonian
  • Putin Just Created His Own Personal Army. Daily Caller
  • Vladimir Putin just invited Kim Jong Un to visit Russia. Really. Washington Post

Even though a given reader may not read much about Putin, glancing through enough headlines like these work to build the negative, dictatorial character that is widespread in the West today. And who has the remotest idea about who is Jill Stein? Maybe she is too busy working in the Earth’s core and shifting tectonic plates…

[Yes, No, Sorry], …

These headlines sound like bitter responses in an argument on a Youtube comment forum. For each of these examples, consider how much more readable and professional the title would be, without the silly bolded words.

  • No, Bernie Sanders Did Not Sell Out by Endorsing Hillary Clinton — Just the Opposite. Forward
  • Sorry, You’re Just Going To Have to Save More Money.​ Wall St. Journal
  • Sorry 538, La Taqueria is delicious — but it’s also fatally flawed.​ Vox
  • Yes, Clinton is sinking in the polls. No, you should not panic. Here’s why.​ Washington Post

Some similar prefixes to look out for the future are “Right,” and “Wait,”. It is pretty surprising they have not been picked up already.


The next time you come across a headline along these lines, think about why the writer chose that particular pattern, what emotional or intellectual reaction they are trying to evoke, and how it contributes to their propaganda.

May I borrow your name?

Although quite simple, my name is not easy to pronounce in English (even for me). Older people sometimes say “ferris” like a ferris wheel (perhaps in reference to the famous 80s movie Ferris Bueller’s Day Off). I am not particularly bothered by the different permutations either way, but it can be inconvenient at times.

When visiting a coffee shop, or engaging in any other unimportant event where a name is required for reference, I typically give the simplest two-letter name: “Jo”. It is impossible to cause any confusion or unnecessary back-and-forth exchange. The strategy usually works fine (although on rare occasion I stare blank-faced at the poor barista, yelling repetitively for “Jo” and wondering why I am ignoring them).

More recently I have decided to have a bit more fun with using names. The last few times I was grabbing a coffee with a friend, I would use their name at the counter instead. What I found most surprising is the unexpected response it stirred — a mix of confusion, defensiveness, and a feeling of being insulted:

Friend: “That is my name! What is wrong with you?

Me: Relax. (Always fun to tell someone getting worked up to relax)

Friend: “No! You cannot just use my name!

A name is a highly personal matter. I wonder what set of experiments one can design to formally study the question: to what extent do people feel their names are related to their “identity”? Why does casually using someone else’s name (not even posing as that person, which is creepy, but using a general first name which they happen to have) suddenly make them uncomfortable?

I think Shakespeare said it right. Try it with your friends, and see what reactions it invokes.

Enough urban legends about QWERTY

Over the past few years there has been much on and off talk about the origins of the QWERTY keyboard. Here is a Google Trends plot showing the “interest” (based on search volume?) of the term qwerty vs dvorak.  Screenshot from 2016-02-14 21:51:47 I am not sure why Google believe that internet users were suddenly interested in this topic for one day in August 2010 after zero search traffic for half a decade. But the key point is, news headlines (and sometimes “tidbits” at a dinner conversation) now and again keep reminding us of some fables about the much maligned keyboard layout. Some of the best anecdotes are
* Bars collided and jammed together in early typewriters, so qwerty designer Christopher Sholes arranged the most common letters in the worst possible locations to slow down typists.

  • The qwerty keyboard arose from telegraph operators who used morse code (a Japanese study looks into the whole prehistory) and found it to be the best arrangement, in comparison to earlier keyboards based on alphabetical order.

  • Salespeople pushed for qwerty because they could type the word T-Y-P-E-W-R-I-T-E-R without having to cross their fingers over one another (a key selling point, of course, because typewriter is a very common English word and our tools need to be specially optimized to type it …)

  • The Illuminati popularized qwerty, in collaboration with the CIA, to slow down Russian spies infiltrating NASA during in the Space Race. Regardless of the “true” history (which probably is not that interesting) the worst part about all this coverage is the portrayal of qwerty as an historic tragedy which we are now doomed to suffer with forever. Some studies look into the

efficacy of keyboard layout and suggest that the current design is not optimal. College students who are experts with dvorak smirk at their classmates with pride for their superior typing abilities. It is time to put an end to all this boring droll. So what if we type 3 fewer words per minute? Let us eliminate this great inefficiency for once and for all, and watch as labor productivity, youth employment, and gender equality rise to unprecedented levels, while crime rates, the GINI coefficient and CO2 emissions plummet. The next time Google News suggest an article about qwerty I am changing news sources.

Metropolis-Hastings with Gaussian drift proposal on bounded support

Professor Darren Wilkinson has an excellent blog post on a common implementation issue that arises when the standard random-walk Metropolis Hastings algorithm, with a Gaussian proposal, is modified to sample from a target distribution $p(x)$ where $$\text{supp}(p) = { x : p(x) > 0 } = (0, \infty).$$ In this blog post, we are going to briefly review the setup, the problem, and then generalize the solution proposed by Wilkinson to the situation where $\text{supp}(p) = (a,b)$ (instead of the non-negative reals). We will also allow the Gaussian proposal to have a general variance $\sigma^2$ (rather than just $\sigma^2 = 1$). The purpose of this blog post is to

  • illustrate that small programmatic changes in a probabilistic algorithm can violate correctness;
  • help someone implementing this sampler will find the general derivation informative and useful for their own purposes.

Sampling from a Gamma using MH

To reuse the example from Wilkinson, suppose we are interested in sampling from a gamma with shape $\alpha = 2$ and scale $\theta =1$, $$p(x) = \frac{1}{Z} x\exp(-x), x > 0$$ by implementing random walk MH. The proposal $N(\cdot | x_i, \sigma^2)$ is a Gaussian with mean at the current sample $x_{i}$, and known variance $\sigma^2$ (aside: the gamma can be sampled much more efficiently). The standard MH algorithm proceeds as:

  1. initialize $x_0$ to some value in $(0, \infty)$.

  2. for $i = 0,1,2,\dots$

    2.1 sample from the proposal, $x’ \sim N(\cdot|x_i,\sigma^2)$

    2.2 compute the acceptance ratio $A(x’, x_i) = \frac{ p(x’)N(x_i| x’, \sigma^2) }{ p(x_{i})N(x’| x_i, \sigma^2) } = \frac{p(x’)}{p(x_{i})}$

    2.3 if $U[0,1] \le A(x’, x_{i})$ then $ x_{i+1} \leftarrow x’$ (accept) else $ x_{i+1} \leftarrow x_{i}$ (reject)

In the sample step 2.1, it is possible that we obtain $x’ < 0$ (for instance, if $x_{i}$ is near 0 or $\sigma^2$ is large). One intuitive approach is to automatically reject samples that are outside $\text{supp}(p)$ and repeat until $x>0$. In the case of $Ga(2,1)$, this rejection would be enforced anyway since the acceptance ratio $A(x’,x_{i})$ will be negative when $x’ < 0$ (if we ignore the fact that $p$ is defined on the positive reals and evaluate $p(x’)$ anyway). However, Wilkinson observes that if we were sampling from $Ga(3,1)$ for instance, then $$p(x) = \frac{1}{Z} x^2\exp(-x), x > 0$$ which is positive even when $x < 0$ (again, when ignoring the positive constraint on $x$). Our rejection step would not have been guaranteed to occur in 2.3 had we decided to keep the bad sample $x’ < 0$ in step 2.1. So we were just lucky with the $Ga(2,1)$ example, and need a more robust sampling strategy for the general case.

Rejection sampling changes the proposal distribution

By discarding samples outside the target support in step 2.1, we are rejection sampling the Gaussian proposal *$N(\cdot|x_i,\sigma^2)$ for all values $x’ < 0$. This strategy is equivalent to sampling from a truncated Gaussian, which is not symmetric in its mean and argument. Therefore, we cannot cancel terms in the acceptance ratio (step 2.2) anymore.

A Correct Sampler Needs A Correct Acceptance Ratio

Generalizing the results from Wilkinson, suppose that the target $p(x)$ has $\text{supp}(p) = (a,b)$ and the Gaussian has variance $\sigma^2$ (typically, this variance is adapted during runtime to achieve a desired acceptance rate). We will still 

reject samples $x’ \notin (a,b)$ in 2.1, and keep sampling until $x’ \in (a,b)$. We need to account for all the previous rejections. The proposal density is now $$N_{(a,b)}(x’ | x_i, \sigma^2) = \frac{ \frac{1}{\sigma} \phi (\frac{x’-x_i}{\sigma}) } {\Phi(\frac{b-x_i}{\sigma}) – \Phi(\frac{a-x_i}{\sigma})} \mathbf{I}_{(a,b)}(x’),$$ where $\phi$ is the standard Gaussian density, and $\Phi$ is the standard Gaussian cumulative distribution function. While the numerator is symmetric in $(x’,x_i)$, the denominator is not. It follows that with our new strategy, the correct acceptance ratio for step 2.2 $$A(x’,x_i) = \frac{p(x’) ( \Phi(\frac{b-x_i}{\sigma}) – \Phi(\frac{a-x_i}{\sigma}) ) }{ p(x_{i}) ( \Phi(\frac{b-x’}{\sigma}) – \Phi(\frac{a-x’}{\sigma})) } $$ A good step is to check we agree with Wilkinson in the special case that $a=0,b=\infty,\sigma^2=1$. In this case, the term $\Phi(\frac{b-x_i}{\sigma}) – \Phi(\frac{a-x_i}{\sigma}) = 1-\Phi(-x_i) = \Phi(x_i)$, and similarly for the term in the numerator. This result cross-checks with Wilkinson. Also remember to change the initialize step 1 to choose $x_0$ from the range of interest.

The rejection rate and numerical stability

The good news for this sampler is that the mean of the truncated normal $N_{(a,b)}(\cdot|x_i,\sigma^2)$ is guaranteed to remain in the truncation interval, which means the rejection rate is not going to be problematic (unless for some reason $\sigma^2$ is unusually large, which should be designed in tandem with $\text{supp}(p)$ to avoid this issue). The rejection rate is a problem when the mean $x_i$ of the underlying, non-truncated normal is not in the interval $(a,b)$, and more elaborate algorithms exist (or use a different proposal). Another critical issue in this case is that $\Phi(\frac{b-x_i}{\sigma}) – \Phi(\frac{a-x_i}{\sigma})$ becomes numerically unstable, tends to zero. We would be able to get away with $x/0$ or $0/x$ in $A$, but we might also have the numerically disastrous $0/0$. Thankfully, we do not have these problems.


Samplers are delicate probabilistic objects, treat them with care.