A thought experiment with the Bayesian posterior predictive distribution

Let $\pi(\theta)$ be a prior for parameter $\Theta$, and $p(x|\theta)$ a likelihood which generates an exchangeable sequence of random variables $(X_1,X_2,X_3\dots)$.

Given a set of observations $D := \lbrace X_0=x_0, X_1=x_1, \dots, X_{N-1}=x_{N-1}\rbrace$, the posterior predictive distribution for the next random variable in the sequence $X_N$ is defined as $$p(X_{N}=s | D) = \int p(X_{N}=s | D,\theta) \pi(\theta|D)d\theta = \int p(X_{N}=s|\theta) \pi(\theta|D)d\theta, $$

where the second equality follows from assuming the data is exchangeable (or i.i.d conditioned on latent parameter $\theta$). The posterior predictive density evaluated at ${X_{N}=s}$ is an expectation under the posterior distribution $\pi(\theta|D)$. Define the function $g(s,\theta) := p(X_{N}=s|\theta)$, (note that $g$ does not depend on the index $N$ of the random variable since $\theta$ is known), and then compute the expectation of the random variable $g(s,\Theta)$ under $\pi(\theta|D)$, $$ p(X_N=s | D) = \mathbb{E}_{\pi(\cdot|D)}\left[ g(s,\Theta) \right]. $$

Now consider the case where each random variable $X_i$ is a two-dimensional vector $X_i = (X_{[i,1]}, X_{[i,2]}).$ The data $D = \lbrace X_0=x_0, X_1=x_1, \dots, X_{N-1}=x_{N-1}\rbrace$ is thus an exchangeable sequence of bivariate observations. (Assume for simplicity that marginalizing and conditioning the joint distribution $p(X_{[i,1]},X_{[i,2]}|\theta)$ are easy operations.) We again perform inference to obtain the posterior $\pi(\theta|D)$.

Suppose we wish to evaluate the probability (density) of the event $\lbrace X_{[N,1]}=s \mid X_{[N,2]}=r \rbrace$ under the posterior predictive. I am in two minds about what this quantity could mean:

Approach 1

Define the conditional probability density again as an expectation of a function of $\Theta$ under the posterior distribution. In particular, let the probe function $g(s,r,\theta) := p(X_{[N,1]}=s|X_{[N,2]}=r,\theta)$ (recalling that $g$ does not depend on $N$ when $\theta$ is known) and then compute the expectation of $g(s,r,\Theta)$ under $\pi(\theta|D)$, $$ p_{\text{A}1}(X_{[N,1]}=s|X_{[N,2]}=r,D) = \mathbb{E}_{\pi(\cdot|D)}\left[ g(s,r,\Theta) \right]. $$

Approach 2

Define the desired conditional probability density by application of the Bayes Rule. Namely, separately compute two quantities:

  1. joint: $ p(X_{[N,1]}=s,X_{[N,2]}=r|D) = \int p(X_{[N,1]}=s,X_{[N,2]}=r|\theta) \pi(\theta|D)d\theta $
  2. marginal: $ p(X_{[N,2]}=r|D) = \int p(X_{[N,2]}=r|\theta) \pi(\theta|D)d\theta $

and then return their ratio, $$ p_{\text{A}2}(X_{[N,1]}=s|X_{[N,2]}=r,D) = \frac{p(X_{[N,1]}=s,X_{[N,2]}=r|D)}{p(X_{[N,2]}=r|D)}. $$

Note that Approach 2 is equivalent to appending the condition $\lbrace X_{[n,2]}=r \rbrace$ to the observation set $D$ so that $D’ := D \cup \lbrace X_{[N,2]}=r \rbrace$ and the new posterior distribution is $\pi(\theta|D’)$. It then computes the expectation of $g(s,r,\Theta)$ under $\pi(\cdot|D’)$, $$ p_{\text{A}2}(X_{[N,1]}=s|X_{[N,2]}=r,D) = \mathbb{E}_{\pi(\cdot|D’)}\left[ g(s,r,\Theta) \right] $$

Exercise: Show why the two expressions for $p_{\text{A}2}$ are equivalent.

Thoughts

The question is thus, does the Bayesian reasoner update their beliefs about $\theta$ based on the condition ${X_{[N,2]}=r}$? I think both approaches can make sense:

In Approach 1, we do not treat $\lbrace X_{[N,2]}=r \rbrace$ as a new element of the observation sequence $D$; instead we define the probe function $g(s,r,\theta)$ based on the conditional probability (which is a function of the population parameter), and then compute its expectation.

Approach 2 follows more directly from the “laws of probability” but is less interpretable from the Bayesian paradigm. Why? Because if ${\Theta = \theta}$ is known, then $p(X_{[N,1]}=s|X_{[N,2]}=r,\theta)$ is just a real-number — since the Bayesian does not know $\theta$, they marginalize over it. But it is unclear why the probe function $g(s,r,\theta)$ should influence the distribution of $\pi(\theta|D)$, regardless of whether it happens to represent a density parameterized by $\theta$.

Next Steps

Perhaps I should numerically/analytically compute the difference between Approach 1 and Approach 2 for a bivariate Gaussian with known covariance and unknown mean. For simplicity, just use the prior predictive, letting $D=\varnothing$.

2 thoughts on “A thought experiment with the Bayesian posterior predictive distribution

  1. Did you mean to define $g(s,r,\theta) := p(X_{[N,1]}{=}s|X_{[N,2]}{=}r,\theta)$, that is the conditional probability, not the joint probability? It seems to me that Approach 2 is the more "correct"/Bayesian one. Although Approach 1 evaluates the expectation of $g(s,r,\theta)$ with respect to $\pi(\theta|D)$, I think this does not give the desired conditional probability — in general you have to add the condition $\lbrace X_{[N,2]}{=}r \rbrace$ to all probabilities on the right hand side, including the posterior over $\theta$. Approach 1 might still be justified for computational reasons, if you have reason to believe that it's "close enough". In particular, I think it is exact when the posterior factorizes as $\pi(\theta_1|D) \pi(\theta_2|D)$, where $p(X_N = [s,r]|\theta) = p(X_{[N,1]}{=}s|X_{[N,2]}{=}r,\theta_1) p(X_{[N,2]}{=}r|\theta_2)$.

    • Thanks for correcting the definition of $g(s,r,\theta)$. I agree that “Approach 2” feels more Bayesian, and the computational approximation might be justified in the limit of infinite data. One attribute of “Approach 1”, perhaps not Bayesian but which I find appealing, is that the “post-learning” query about a hypothetical member $N$ does not cause one to re-evaluate their beliefs about the global state $\theta$ learned from actual data $D$.

      From the perspective of Bayesian cognition, it seems rather reasonable to separate the ways we update beliefs into two categories (i) actual experiences, captured by the dataset $D$ and its induced posterior $\pi(\theta\mid D)$, versus (ii) “what-if?” queries, captured by the hypothetical member $X_N$ and the posterior distribution over $X_{N,1}$ given $\lbrace X_{[N,2]} = r \rbrace$, where $\theta$ is held at its posterior from actual observations $D$.

      From the statistical perspective, no standard taxonomy exists that can differentiate between cases (i) and (ii). Borrowing terminology from Tenenbaum, I think these type of questions are characteristic of the Bayesian theme of “learning as inference”. Directly applying the Bayes Rule makes it easy to conflate “structure learning” with “strength” or “parameter estimation”. The question in this post can then be reduced into (i) taxonomizing the type of observations a Bayesian reasoner can encounter, and (ii) deciding what parts of the hierarchy (structure/parameter/etc) are updated in light of the different observation types.

Leave a Reply

Your email address will not be published. Required fields are marked *