Quick tip on arXiv LaTeX submissions

While posting a submission to the arXiv, I had some filenames which contained an @ character and that were referenced in an \includegraphics command. This situation caused the arXiv LaTeX build system to fail noisily. The failure log will typically say that the offending file with the @ character cannot be found. The solution is to replace @ characters in such filenames with - (dash).

Notating “conditional” probabilities

Let $p$ denote a probability density with support $x \in \mathcal{X}$, and let $\theta \in \mathbf{T}$ parametrize $p$. It is common in the literature to express this situation using the notation $p(x\mid \theta)$. However, this “conditional” notation using the vertical bar (or mid in LaTeX) is often used ambiguously.

Notation with the conditional symbol $p(x \mid \theta)$ should only be used when there exists a joint density $p(x,\theta)$ with normalized marginals that properly define a conditional density; i.e.

  1. $p(x) = \int_{\theta\in \mathbf{T}}p(x,\theta)\mathrm{d}\theta$
  2. $p(\theta) = \int_{x \in \mathcal{X}}p(x,\theta)\mathrm{d}x$
  3. $p(x\mid\theta) := p(x,\theta)/p(\theta)$, for $p(\theta) \ne 0$.

If there is no underlying joint density $p(x,\theta)$ in the probabilistic setup, then the symbol $p(x\mid\theta)$ is not well-defined, which leads to ambiguity or confusion.

One place this issue arises is in variational inference. We have a generative model given by $p(x,z) = p(x \mid z)p(z)$ and an intractable posterior $p(z\mid x)$ which we need to estimate. The variational (or approximating) density is written as $q(z \mid \phi)$, and we minimize the KL divergence $D_{\mathrm{KL}}(q(z\mid \phi) || p(z \mid x))$ over $\phi \in \mathcal{S}$. The reason that $q(z \mid \phi)$ does not make sense, however, is that the variational distribution $q$ does not have a prior over its parameter $\phi$, so $q(z,\phi)$ is not a well-defined joint density.

Bayesians who use this notation (of which there are many) may argue that the condition sign just means “parametrized by” — regardless of whether there is a joint probability model over all terms on both sides on the conditioning sign. However, this convention is problematic. Reusing the variational inference example from the previous paragraph, our variational distribution $q$ may additionally be parametrized by the observations $x$, so that one is minimizing $D_{\rm KL}(q(z\mid \phi, x) || p(z \mid x))$ over $\phi \in \mathcal{S}$ with $x$ (a is semantically distinct object from $\phi$) held fixed at the observed values.

The situation becomes even more complicated in settings such as variational auto-encoders, where $p$ also has some additional parameters $\theta$. The true posterior density is notated as $p_\theta(z \mid x)$ and the variational density is $q_\phi(z \mid x)$. There is a type mismatch in this notation, since $x$ from the perspective of $p$ (a random variable with a generative prior) is of different “type” than the $x$ from the perspective of $q$ (an exogenous parameter). Using the convention that exogenous parameters appear as subscripts, the type-correct notation would be $p_\theta(z\mid x)$ for the Bayesian posterior density, and $q_{\phi,x}(z)$ for the variational density.

An alternative convention, which appears in frequentist statistics, is to use semi-colons for exogenous parameters — this approach yields $p(z \mid x; \theta)$ and $q(z; x, \phi)$ for the VAE. A third convention, which is not widely used, is to remember that densities are nothing more than non-negative functions in some arguments, and to use different symbols for the different densities. More precisely, the joint density would be $p(x,z)$; the marginal density $p(x)$ becomes $m(x)$; and the prior $p(z)$ becomes $\pi(z)$. If there are exogenous parameters $\theta$ which do not represent realizations of random variables, they can simply be added as additional arguments, which gives $p(x,z,\theta)$, $m(x,\theta)$, and $\pi(z,\theta)$. The third approach quickly grows unwieldy, however, and scales poorly as the number of random variables grows.

The purpose of notation in mathematical writing is to use symbols as an aid for conveying technical ideas. Imprecise notation that is inconsistent (or asymmetric) actively works against this goal by obfuscating the key ideas. Clear notation is not only essential for understanding a piece of writing at hand, but also essential for making connections between related concepts across many writings (e.g., the relationship between EM, variational EM, variational Bayes, variational auto-encoders, …). Our brains are particularly good at “pattern matching” when we see familiar notation in unfamiliar contexts, which makes consistent, symmetric, and formally-defined symbols a highly effective technique extrapolating ideas to new settings.

The fine print: This post has dealt loosely with conditional densities for real-valued random variables. We have avoided measure-theoretic questions regarding the existence of densities when conditioning on events of probability zero, a topic for a future post.

A pair of probability problems

Let $w = (w_1,w_2,\dots)$ be an infinite sequence which sums to unity. Consider the function $$f_w(u) = \sum_{j=1}^{\infty}\mathbf{I}(u < w_j)$$.

  1. Prove that $f_w$ is a probability density in $u > 0$.
  2. Describe a simulation technique for generating a random variable $U \sim f_w(u)$.

On statistical two-sample testing

Statistical two-sample testing concerns the following question: given observations $X = \lbrace x_1, x_2, \dots, x_m \rbrace$ and $Y = \lbrace y_1, y_2, \dots, y_n \rbrace$ drawn from distributions $p$ and $q$, respectively, can we decide whether $p=q$?

Two-sample testing is an active field of research with an enormous amount of literature. Asymptotic theory of two-sample tests is typically concerned with establishing consistency, meaning that the probability of declaring $p=q$, when in fact $p \ne q$, goes to zero in the large sample limit.

Distinguishing distributions with finite number of samples

A very nice (negative) result is that it is generally impossible to distinguish two distributions with high probability, by only observing a finite pair of samples of fixed size. The result is also somewhat disappointing for those who are particularly interested in theoretical guarantees of statistical procedures in applied inference, where the ideal of asymptotia is a far reach from the chaotic array of csv files, and other inhomogeneous data sources and formats, sitting on the analyst’s hard drive. Anyway, consider the following scenario, which is from Gretton et al. “A Kernel Two-Sample Test.” JMLR. 2012. (a highly recommended paper to read):

Assume we have a distribution $p$ from which have drawn $m$ iid observations. Construct a distribution $q$ by drawing $m^2$ iid observations from $p$ and define a discrete distribution over these $m^2$ observations with probability $m^{-2}$ each. It is easy to check that if we now draw $m$ observations from $q$, there is at least a ${m^2 \choose m}\frac{m!}{m^{2m}} > 1 – e^{-1} > 0.63$ probability that we thereby obtain an $m$ sample from $p$. Hence, no test will be able to distinguish samples from $p$ and $q$ in this case. The probability of detecting can be made arbitrarily small by increasing the size $m$ of the sample from which we construct $q$.

To understand why this setup implies the impossibility of two-sample testing with high probability in the finite setting, we need to think about the frequentist properties of the procedure. Suppose we repeat the experiment a very large number of times, say 900,000 times. Marginalizing over all experiments, the samples drawn from $q$ will follow a different distribution than samples drawn from $p$. However, in roughly 2/3 or 600,000 of the experiments, the observation set drawn from $q$ is a random sample from $p$. Conditioned on being in one of the 600,000 trials where the $m$ samples are from the same distribution, an ideal two-sample test should report $p=q$, which is committing a Type II error with arbitrarily large probability.

Now, the authors left the derivation of the combinatorial bound ${m^2 \choose m}\frac{m!}{m^{2m}}$, for the probability that the $m$ draws from $q$ are an $m$ sample from $p$, as an exercise for the reader. It actually makes a nice homework or exam question for a 101 probability course (as do many other byproducts of academic papers, so I have learned). For those interested in elementary probability and combinatorics, here is a candidate derivation.

First note ambiguities in the setup of the experiment: (i) Is $p$ intended to be a continuous distribution, so that there are no duplicates in the $m^2$ samples? (ii) What exactly is the definition of the event “thereby obtain an $m$ sample from $p$”? To make progress, we assume that (i) $p$ is continuous and the $m^2$ samples in $q$ are unique, and (ii) an “$m$ sample from $p$” means that the resampled sequence contains no duplicates (which occurs with probability zero when sampling directly from $p$). With these assumptions, the bound becomes trivial. The size of the sample space is equal to the number of length $m$ strings from $\lbrace x_1, \dots, x_{m^2} \rbrace$, which is $(m^2)^m$. The number of sequences of length $m$ with unique entries is ${m^2 \choose m}m!$. Dividing the latter quantity by the former gives the bound.

If the assumptions above are too strong and $p$ is discrete, then I am no longer sure what the definition of an “$m$ sample from $p$” is. Perhaps someone can let me know if alternative meanings are clear to them. Discreteness of $p$ would indeed provide a much more compelling example, since one can arrive at arbitrary results based on pathologies of real number being uncomputable.

A simpler counter-example with infinite number of samples

To the computer scientist, real numbers are uninteresting. Let us suppose $p$ is a discrete distribution with finite support, and additionally restrict its probability density at each point in the support to be a rational number. One might simplify the above experiment by defining $q_m$ to:

  • generate a size $m$ sample from $p$, with probability $1-1/m$;

  • generate a size $m$ sample from an arbitrary distribution, with probability $1/m$.

Here is an interesting question — is there any test which can have non-zero power for testing $p$ versus $q_m$? Note that it is always the case that $p \ne q_m$. The probability of detection, also known as the power $\beta$ of the statistical test, is $1/m$, which decays to zero as $m \to \infty$. Therefore, the Type II error $(1-\beta)$ necessarily goes to one, irrespective of the testing procedure. Stated otherwise, for any testing procedure $F$ and tolerance $\epsilon > 0$, there exists a pair of distributions $(p,q_m)$ such that $\Pr\left[F\; \textrm{declares}\; p \ne q_m \right] = \beta < \epsilon$, where the probability is taken under the sampling distribution of datasets $D_p \sim p, D_q \sim q_m$.

Also note that $q_m$ converges in distribution to $p$ as $m \to \infty$. This construction is rather different than standard Type II error analysis, since our distribution $q_m$ is itself a function of the sample size $m$, so we cannot formally take the limit as $m \to \infty$ to compute the Type II error. The counter-example above should instead be seen as a “meta counter-example”, showing a single adversarial pair of distributions $(p,q_m)$ and finite sample size $m$ which together will trick an arbitrary testing procedure $F$ into obtaining arbitrarily low power.

A thought experiment with the Bayesian posterior predictive distribution

Let $\pi(\theta)$ be a prior for parameter $\Theta$, and $p(x|\theta)$ a likelihood which generates an exchangeable sequence of random variables $(X_1,X_2,X_3\dots)$.

Given a set of observations $D := \lbrace X_0=x_0, X_1=x_1, \dots, X_{N-1}=x_{N-1}\rbrace$, the posterior predictive distribution for the next random variable in the sequence $X_N$ is defined as $$p(X_{N}=s | D) = \int p(X_{N}=s | D,\theta) \pi(\theta|D)d\theta = \int p(X_{N}=s|\theta) \pi(\theta|D)d\theta, $$

where the second equality follows from assuming the data is exchangeable (or i.i.d conditioned on latent parameter $\theta$). The posterior predictive density evaluated at ${X_{N}=s}$ is an expectation under the posterior distribution $\pi(\theta|D)$. Define the function $g(s,\theta) := p(X_{N}=s|\theta)$, (note that $g$ does not depend on the index $N$ of the random variable since $\theta$ is known), and then compute the expectation of the random variable $g(s,\Theta)$ under $\pi(\theta|D)$, $$ p(X_N=s | D) = \mathbb{E}_{\pi(\cdot|D)}\left[ g(s,\Theta) \right]. $$

Now consider the case where each random variable $X_i$ is a two-dimensional vector $X_i = (X_{[i,1]}, X_{[i,2]}).$ The data $D = \lbrace X_0=x_0, X_1=x_1, \dots, X_{N-1}=x_{N-1}\rbrace$ is thus an exchangeable sequence of bivariate observations. (Assume for simplicity that marginalizing and conditioning the joint distribution $p(X_{[i,1]},X_{[i,2]}|\theta)$ are easy operations.) We again perform inference to obtain the posterior $\pi(\theta|D)$.

Suppose we wish to evaluate the probability (density) of the event $\lbrace X_{[N,1]}=s \mid X_{[N,2]}=r \rbrace$ under the posterior predictive. I am in two minds about what this quantity could mean:

Approach 1

Define the conditional probability density again as an expectation of a function of $\Theta$ under the posterior distribution. In particular, let the probe function $g(s,r,\theta) := p(X_{[N,1]}=s|X_{[N,2]}=r,\theta)$ (recalling that $g$ does not depend on $N$ when $\theta$ is known) and then compute the expectation of $g(s,r,\Theta)$ under $\pi(\theta|D)$, $$ p_{\text{A}1}(X_{[N,1]}=s|X_{[N,2]}=r,D) = \mathbb{E}_{\pi(\cdot|D)}\left[ g(s,r,\Theta) \right]. $$

Approach 2

Define the desired conditional probability density by application of the Bayes Rule. Namely, separately compute two quantities:

  1. joint: $ p(X_{[N,1]}=s,X_{[N,2]}=r|D) = \int p(X_{[N,1]}=s,X_{[N,2]}=r|\theta) \pi(\theta|D)d\theta $
  2. marginal: $ p(X_{[N,2]}=r|D) = \int p(X_{[N,2]}=r|\theta) \pi(\theta|D)d\theta $

and then return their ratio, $$ p_{\text{A}2}(X_{[N,1]}=s|X_{[N,2]}=r,D) = \frac{p(X_{[N,1]}=s,X_{[N,2]}=r|D)}{p(X_{[N,2]}=r|D)}. $$

Note that Approach 2 is equivalent to appending the condition $\lbrace X_{[n,2]}=r \rbrace$ to the observation set $D$ so that $D’ := D \cup \lbrace X_{[N,2]}=r \rbrace$ and the new posterior distribution is $\pi(\theta|D’)$. It then computes the expectation of $g(s,r,\Theta)$ under $\pi(\cdot|D’)$, $$ p_{\text{A}2}(X_{[N,1]}=s|X_{[N,2]}=r,D) = \mathbb{E}_{\pi(\cdot|D’)}\left[ g(s,r,\Theta) \right] $$

Exercise: Show why the two expressions for $p_{\text{A}2}$ are equivalent.


The question is thus, does the Bayesian reasoner update their beliefs about $\theta$ based on the condition ${X_{[N,2]}=r}$? I think both approaches can make sense:

In Approach 1, we do not treat $\lbrace X_{[N,2]}=r \rbrace$ as a new element of the observation sequence $D$; instead we define the probe function $g(s,r,\theta)$ based on the conditional probability (which is a function of the population parameter), and then compute its expectation.

Approach 2 follows more directly from the “laws of probability” but is less interpretable from the Bayesian paradigm. Why? Because if ${\Theta = \theta}$ is known, then $p(X_{[N,1]}=s|X_{[N,2]}=r,\theta)$ is just a real-number — since the Bayesian does not know $\theta$, they marginalize over it. But it is unclear why the probe function $g(s,r,\theta)$ should influence the distribution of $\pi(\theta|D)$, regardless of whether it happens to represent a density parameterized by $\theta$.

Next Steps

Perhaps I should numerically/analytically compute the difference between Approach 1 and Approach 2 for a bivariate Gaussian with known covariance and unknown mean. For simplicity, just use the prior predictive, letting $D=\varnothing$.

Trump, Clinton, and the failures of election forecasting

Election season has come and gone. Donald Trump pulled off what has been repeatedly characterized as a “stunning upset” over Hillary Clinton. The web has gone rampant with postmortem analysis about the failures of election polling and forecasting. But was it really that surprising?

Several polls conducted in the lead up to the election reported on the virtually deadlocked race, all well-within any reasonable margin of error:

Quinnipiac University reported on the situation in key battleground states (Nov 2)

Democrat Hillary Clinton’s October momentum comes to a halt as she clings to a small lead in Pennsylvania, while Republican Donald Trump moves ahead in Ohio, leaving Florida and North Carolina too close to call.

Probability forecast models on the other hand were remarkably off-mark and predicted Clinton well-ahead just the night before the election:

  • New York Times Upshot: Clinton 84%, Trump 16%
  • FiveThirtyEight: Clinton 66.9%, Trump 33%
  • PredictWise: Clinton 89%, Trump 11%

I was watching the blitz of last-ditch rallies held by Trump and Clinton the night before election day. The candidates expressed incompatible sentiments about their chances to win. Here is a revealing segment from Trump’s penultimate rally in New Hampshire (8pm on Nov 7):

We are going right after this to Michigan, because Michigan is in play… The polls just came out: we are leading in Michigan; we are leading in New Hampshire; we are leading in Ohio; we are leading in Iowa; leading in North Carolina; I think we are doing really, really well in Pennsylvania; and I do believe we are leading in Florida.

In the meantime, according to the New York Times:

Mrs. Clinton’s campaign was so confident in her victory that her aides popped open Champagne on the campaign plane early Tuesday.

Either way, each candidate and their popular base was clearly happy to live in their own reality right down to the wire. Some personal take-aways from this whole affair:

  • The law of total variance for polling. Uncertainty about polling results is best understood not by not only assessing the variance within a single poll, but the studying the variance across multiple competing polls.

  • Forecasting is an enormously complex empirical problem. Formal statistical methods are unable to accurately quantifying uncertainty around the web of non-statistical effects that ultimately influence a voter’s decision at the ballot box.

  • A common reaction was for people to blame journalists and cancel their newspaper subscriptions. Well-informed voters should not derive the bulk of their information from cursory reading of social or national news media. They are conscious in their critique of information, especially of numbers and “facts” that are confirmatory of their own hopes and beliefs.

Metropolis-Hastings with Gaussian drift proposal on bounded support

Professor Darren Wilkinson has an excellent blog post on a common implementation issue that arises when the standard random-walk Metropolis Hastings algorithm, with a Gaussian proposal, is modified to sample from a target distribution $p(x)$ where $$\text{supp}(p) = { x : p(x) > 0 } = (0, \infty).$$ In this blog post, we are going to briefly review the setup, the problem, and then generalize the solution proposed by Wilkinson to the situation where $\text{supp}(p) = (a,b)$ (instead of the non-negative reals). We will also allow the Gaussian proposal to have a general variance $\sigma^2$ (rather than just $\sigma^2 = 1$). The purpose of this blog post is to

  • illustrate that small programmatic changes in a probabilistic algorithm can violate correctness;
  • help someone implementing this sampler will find the general derivation informative and useful for their own purposes.

Sampling from a Gamma using MH

To reuse the example from Wilkinson, suppose we are interested in sampling from a gamma with shape $\alpha = 2$ and scale $\theta =1$, $$p(x) = \frac{1}{Z} x\exp(-x), x > 0$$ by implementing random walk MH. The proposal $N(\cdot | x_i, \sigma^2)$ is a Gaussian with mean at the current sample $x_{i}$, and known variance $\sigma^2$ (aside: the gamma can be sampled much more efficiently). The standard MH algorithm proceeds as:

  1. initialize $x_0$ to some value in $(0, \infty)$.

  2. for $i = 0,1,2,\dots$

    2.1 sample from the proposal, $x’ \sim N(\cdot|x_i,\sigma^2)$

    2.2 compute the acceptance ratio $A(x’, x_i) = \frac{ p(x’)N(x_i| x’, \sigma^2) }{ p(x_{i})N(x’| x_i, \sigma^2) } = \frac{p(x’)}{p(x_{i})}$

    2.3 if $U[0,1] \le A(x’, x_{i})$ then $ x_{i+1} \leftarrow x’$ (accept) else $ x_{i+1} \leftarrow x_{i}$ (reject)

In the sample step 2.1, it is possible that we obtain $x’ < 0$ (for instance, if $x_{i}$ is near 0 or $\sigma^2$ is large). One intuitive approach is to automatically reject samples that are outside $\text{supp}(p)$ and repeat until $x>0$. In the case of $Ga(2,1)$, this rejection would be enforced anyway since the acceptance ratio $A(x’,x_{i})$ will be negative when $x’ < 0$ (if we ignore the fact that $p$ is defined on the positive reals and evaluate $p(x’)$ anyway). However, Wilkinson observes that if we were sampling from $Ga(3,1)$ for instance, then $$p(x) = \frac{1}{Z} x^2\exp(-x), x > 0$$ which is positive even when $x < 0$ (again, when ignoring the positive constraint on $x$). Our rejection step would not have been guaranteed to occur in 2.3 had we decided to keep the bad sample $x’ < 0$ in step 2.1. So we were just lucky with the $Ga(2,1)$ example, and need a more robust sampling strategy for the general case.

Rejection sampling changes the proposal distribution

By discarding samples outside the target support in step 2.1, we are rejection sampling the Gaussian proposal *$N(\cdot|x_i,\sigma^2)$ for all values $x’ < 0$. This strategy is equivalent to sampling from a truncated Gaussian as the proposal itself, which is not symmetric in its mean and argument. Therefore, we cannot cancel terms in the acceptance ratio (step 2.2) anymore.

A correct sampler needs a correct acceptance ratio

Generalizing the results from Wilkinson, suppose that the target $p(x)$ has $\text{supp}(p) = (a,b)$ and the Gaussian has variance $\sigma^2$ (typically, this variance is adapted during runtime to achieve a desired acceptance rate). We will still  reject samples $x’ \notin (a,b)$ in 2.1, and keep sampling until $x’ \in (a,b)$. We need to account for all the previous rejections. The proposal density is now

$$ N_{(a,b)}(x’ | x_i, \sigma^2) = \frac{ \frac{1}{\sigma} \phi (\frac{x’-x_i}{\sigma}) } {\Phi(\frac{b-x_i}{\sigma}) – \Phi(\frac{a-x_i}{\sigma})} \mathbf{I}_{(a,b)}(x’), $$

where $\phi$ is the standard Gaussian density, and $\Phi$ is the standard Gaussian cumulative distribution function. While the numerator is symmetric in $(x’,x_i)$, the denominator is not. It follows that with our new strategy, the correct acceptance ratio for step 2.2 $$A(x’,x_i) = \frac{p(x’) ( \Phi(\frac{b-x_i}{\sigma}) – \Phi(\frac{a-x_i}{\sigma}) ) }{ p(x_{i}) ( \Phi(\frac{b-x’}{\sigma}) – \Phi(\frac{a-x’}{\sigma})) } $$ A good step is to check we agree with Wilkinson in the special case that $a=0,b=\infty,\sigma^2=1$. In this case, the term $\Phi(\frac{b-x_i}{\sigma}) – \Phi(\frac{a-x_i}{\sigma}) = 1-\Phi(-x_i) = \Phi(x_i)$, and similarly for the term in the numerator. This result cross-checks with Wilkinson. Also remember to change the initialize step 1 to choose $x_0$ from the range of interest.

The rejection rate and numerical stability

The good news for this sampler is that the mean of the truncated normal $N_{(a,b)}(\cdot|x_i,\sigma^2)$ is guaranteed to remain in the truncation interval, which means the rejection rate is not going to be problematic (unless for some reason $\sigma^2$ is unusually large, which should be designed in tandem with $\text{supp}(p)$ to avoid this issue). The rejection rate is a problem when the mean $x_i$ of the underlying, non-truncated normal is not in the interval $(a,b)$, and more elaborate algorithms exist (or use a different proposal). Another critical issue in this case is that $\Phi(\frac{b-x_i}{\sigma}) – \Phi(\frac{a-x_i}{\sigma})$ becomes numerically unstable, tends to zero. We would be able to get away with $x/0$ or $0/x$ in $A$, but we might also have the numerically disastrous $0/0$. Thankfully, we do not have these problems.


Samplers are delicate probabilistic objects, treat them with care.

Hello world!

Welcome to Random Seed. A good proportion of implementors of probabilistic computation fail to address two key issues relating to entropy control in their software

  • Keeping track of the state of the random number generator.
  • Permuting the random seed.

We will look into these two issues, as well as many other topics, throughout future blog posts.