Fun with two-sample testing

Statistical two-sample testing concerns the following question: given observations $X = \lbrace x_1, x_2, \dots, x_m \rbrace$ and $Y = \lbrace y_1, y_2, \dots, y_n \rbrace$ drawn from distributions $p$ and $q$, respectively, can we decide whether $p=q$?

Two-sample testing is an active field of research with an enormous amount of literature. Asymptotic theory of two-sample tests is typically concerned with establishing consistency, meaning that the probability of declaring $p=q$, when in fact $p \ne q$, goes to zero in the large sample limit.

Distinguishing distributions with finite number of samples

A very nice (negative) result is that it is generally impossible to distinguish two distributions with high probability, by only observing a finite pair of samples of fixed size. The result is also somewhat disappointing for those who are particularly interested in theoretical guarantees of statistical procedures in applied inference, where the ideal of asymptotia is a far reach the chaotic array of csv files and other inhomogeneous data sources sitting on the analyst’s hard drive. Anyway, consider the following scenario, which is from Gretton, et al. “A Kernel Two-Sample Test.” JMLR. 2012. (a highly recommended paper to read):

Assume we have a distribution $p$ from which have drawn $m$ iid observations. Construct a distribution $q$ by drawing $m^2$ iid observations from $p$ and define a discrete distribution over these $m^2$ observations with probability $m^{-2}$ each. It is easy to check that if we now draw $m$ observations from $q$, there is at least a ${m^2 \choose m}\frac{m!}{m^{2m}} > 1 – e^{-1} > 0.63$ probability that we thereby obtain an $m$ sample from $p$. Hence, no test will be able to distinguish samples from $p$ and $q$ in this case. The probability of detecting can be made arbitrarily small by increasing the size $m$ of the sample from which we construct $q$.

To understand why this setup implies the impossibility of two-sample testing with high probability in the finite setting, we need to think about the frequentist properties of the procedure. Suppose we repeat the experiment a very large number of times, say 900,000 times. Marginalizing over all experiments, the samples drawn from $q$ will follow a different distribution than samples drawn from $p$. However, in roughly 2/3 or 600,000 of the experiments, the observation set drawn from $q$ is a random sample from $p$. Conditioned on being in one of the 600,000 trials where the $m$ samples are from the same distribution, an ideal two-sample test should report $p=q$, which is committing a Type II error with arbitrarily large probability.

Now, the authors left the derivation of the combinatorial bound ${m^2 \choose m}\frac{m!}{m^{2m}}$, for the probability that the $m$ draws from $q$ are an $m$ sample from $p$, as an exercise for the reader. It actually makes a nice homework or exam question for a 101 probability course (as do many other byproducts of academic papers, so I have learned). For those interested in elementary probability and combinatorics, here is a candidate derivation.

First note ambiguities in the setup of the experiment: (i) Is $p$ intended to be a continuous distribution, so that there are no duplicates in the $m^2$ samples? (ii) What exactly is the definition of the event “thereby obtain an $m$ sample from $p$”? To make progress, we assume that (i) $p$ is continuous and the $m^2$ samples in $q$ are unique, and (ii) an “$m$ sample from $p$” means that the resampled sequence contains no duplicates (which occurs with probability zero when sampling directly from $p$). With these assumptions, the bound becomes trivial. The size of the sample space is equal to the number of length $m$ strings from $\lbrace x_1, \dots, x_{m^2} \rbrace$, which is $(m^2)^m$. The number of sequences of length $m$ with unique entries is ${m^2 \choose m}m!$. Dividing the latter quantity by the former gives the bound.

If the assumptions above are too strong and $p$ is discrete, then I am no longer sure what the definition of an “$m$ sample from $p$” is. Perhaps someone can let me know if alternative meanings are clear to them. Discreteness of $p$ would indeed provide a much more compelling example, since one can arrive at arbitrary results based on pathologies of real number being uncomputable.

A simpler counter-example with infinite number of samples, and a paradox?

To the computer scientist, real numbers are uninteresting. Let us suppose $p$ is a discrete distribution with finite support, and additionally restrict its probability density at each point in the support to be a rational number. One might simplify the above experiment by defining $q_m$ to:

  • generate a size $m$ sample from $p$, with probability $1-1/m$;

  • generate a size $m$ sample from an arbitrary distribution, with probability $1/m$.

Here is an interesting question — is there any test which can have non-zero power for testing $p$ versus $q_m$? Note that it is always the case that $p \ne q_m$. The probability of detection is $1/m$, which decays to zero as $m \to \infty$. Therefore, the Type II error necessarily goes to one, irrespective of the testing procedure. More confusing is the fact that $q_m$ converges to $p$ as the sample size increases. There seems to be a paradox underlying this experiment — how can we resolve it? (Hint: consider whether the frequentist/large-sample intuition of two-sample testing is coherent with our construction of the limiting process for testing $p$ against the sequence of distributions $(q_m)$).

A thought experiment with the Bayesian posterior predictive distribution

Let $\pi(\theta)$ be a prior for parameter $\Theta$, and $p(x|\theta)$ a likelihood which generates an exchangeable sequence of random variables $(X_1,X_2,X_3\dots)$.

Given a set of observations $D := \lbrace X_0=x_0, X_1=x_1, \dots, X_{N-1}=x_{N-1}\rbrace$, the posterior predictive distribution for the next random variable in the sequence $X_N$ is defined as $$p(X_{N}=s | D) = \int p(X_{N}=s | D,\theta) \pi(\theta|D)d\theta = \int p(X_{N}=s|\theta) \pi(\theta|D)d\theta, $$

where the second equality follows from assuming the data is exchangeable (or i.i.d conditioned on latent parameter $\theta$). The posterior predictive density evaluated at ${X_{N}=s}$ is an expectation under the posterior distribution $\pi(\theta|D)$. Define the function $g(s,\theta) := p(X_{N}=s|\theta)$, (note that $g$ does not depend on the index $N$ of the random variable since $\theta$ is known), and then compute the expectation of the random variable $g(s,\Theta)$ under $\pi(\theta|D)$, $$ p(X_N=s | D) = \mathbb{E}_{\pi(\cdot|D)}\left[ g(s,\Theta) \right]. $$

Now consider the case where each random variable $X_i$ is a two-dimensional vector $X_i = (X_{[i,1]}, X_{[i,2]}).$ The data $D = \lbrace X_0=x_0, X_1=x_1, \dots, X_{N-1}=x_{N-1}\rbrace$ is thus an exchangeable sequence of bivariate observations. (Assume for simplicity that marginalizing and conditioning the joint distribution $p(X_{[i,1]},X_{[i,2]}|\theta)$ are easy operations.) We again perform inference to obtain the posterior $\pi(\theta|D)$.

Suppose we wish to evaluate the probability (density) of the event $\lbrace X_{[N,1]}=s \mid X_{[N,2]}=r \rbrace$ under the posterior predictive. I am in two minds about what this quantity could mean:

Approach 1

Define the conditional probability density again as an expectation of a function of $\Theta$ under the posterior distribution. In particular, let the probe function $g(s,r,\theta) := p(X_{[N,1]}=s|X_{[N,2]}=r,\theta)$ (recalling that $g$ does not depend on $N$ when $\theta$ is known) and then compute the expectation of $g(s,r,\Theta)$ under $\pi(\theta|D)$, $$ p_{\text{approach 1}}(X_{[N,1]}=s|X_{[N,2]}=r,D) = \mathbb{E}_{\pi(\cdot|D)}\left[ g(s,r,\Theta) \right]. $$

Approach 2

Define the desired conditional probability density by application of the Bayes Rule. Namely, separately compute two quantities

joint density: $ p(X_{[N,1]}=s,X_{[N,2]}=r|D) = \int p(X_{[N,1]}=s,X_{[N,2]}=r|\theta) \pi(\theta|D)d\theta $

marginal density: $ p(X_{[N,2]}=r|D) = \int p(X_{[N,2]}=r|\theta) \pi(\theta|D)d\theta $

and then return their ratio, $$ p_{\text{approach 2}}(X_{[N,1]}=s|X_{[N,2]}=r,D) = \frac{p(X_{[N,1]}=s,X_{[N,2]}=r|D)}{p(X_{[N,2]}=r|D)}. $$

Note that Approach 2 is equivalent to appending the condition $\lbrace X_{[n,2]}=r \rbrace$ to the observation set $D$ so that $D’ := D \cup \lbrace X_{[N,2]}=r \rbrace$ and the new posterior distribution is $\pi(\theta|D’)$. It then computes the expectation of $g(s,r,\Theta)$ under $\pi(\cdot|D’)$, $$ p_{\text{approach 2}}(X_{[N,1]}=s|X_{[N,2]}=r,D) = \mathbb{E}_{\pi(\cdot|D’)}\left[ g(s,r,\Theta) \right] $$

Exercise: Show why the two expressions for $p_{\text{approach 2}}$ are equivalent (or let me know if I made a mistake!)

Thoughts

The question is thus, does the Bayesian reasoner update their beliefs about $\theta$ based on the condition ${X_{[N,2]}=r}$? I think both approaches can make sense:

In Approach 1, we do not treat $\lbrace X_{[N,2]}=r \rbrace$ as a new element of the observation sequence $D$; instead we define the probe function $g(s,r,\theta)$ based on the conditional probability (which is a function of the population parameter), and then compute its expectation.

Approach 2 follows more directly from the “laws of probability” but is less interpretable from the Bayesian paradigm. Why? Because if ${\Theta = \theta}$ is known, then $p(X_{[N,1]}=s|X_{[N,2]}=r,\theta)$ is just a real-number — since the Bayesian does not know $\theta$, they marginalize over it. But it is unclear why the probe function $g(s,r,\theta)$ should influence the distribution of $\pi(\theta|D)$, regardless of whether it happens to represent a density parameterized by $\theta$.

Next Steps

Perhaps I should numerically/analytically compute the difference between Approach 1 and Approach 2 for a bivariate Gaussian with known covariance and unknown mean. For simplicity, just use the prior predictive, letting $D=\varnothing$.