Let $\pi(\theta)$ be a prior for parameter $\Theta$, and $p(x|\theta)$ a likelihood which generates an exchangeable sequence of random variables $(X_1,X_2,X_3\dots)$.

Given a set of observations $D := \lbrace X_0=x_0, X_1=x_1, \dots, X_{N-1}=x_{N-1}\rbrace$, the *posterior predictive* distribution for the next random variable in the sequence $X_N$ is defined as $$p(X_{N}=s | D) = \int p(X_{N}=s | D,\theta) \pi(\theta|D)d\theta = \int p(X_{N}=s|\theta) \pi(\theta|D)d\theta, $$

where the second equality follows from assuming the data is exchangeable (or i.i.d conditioned on latent parameter $\theta$). The posterior predictive density evaluated at ${X_{N}=s}$ is an expectation under the posterior distribution $\pi(\theta|D)$. Define the function $g(s,\theta) := p(X_{N}=s|\theta)$, (note that $g$ does not depend on the index $N$ of the random variable since $\theta$ is known), and then compute the expectation of the random variable $g(s,\Theta)$ under $\pi(\theta|D)$, $$ p(X_N=s | D) = \mathbb{E}_{\pi(\cdot|D)}\left[ g(s,\Theta) \right]. $$

Now consider the case where each random variable $X_i$ is a two-dimensional vector $X_i = (X_{[i,1]}, X_{[i,2]}).$ The data $D = \lbrace X_0=x_0, X_1=x_1, \dots, X_{N-1}=x_{N-1}\rbrace$ is thus an exchangeable sequence of bivariate observations. (Assume for simplicity that marginalizing and conditioning the joint distribution $p(X_{[i,1]},X_{[i,2]}|\theta)$ are easy operations.) We again perform inference to obtain the posterior $\pi(\theta|D)$.

Suppose we wish to evaluate the probability (density) of the event $\lbrace X_{[N,1]}=s \mid X_{[N,2]}=r \rbrace$ under the posterior predictive. I am in two minds about what this quantity could mean:

**Approach 1**

Define the conditional probability density again as an expectation of a function of $\Theta$ under the posterior distribution. In particular, let the probe function $g(s,r,\theta) := p(X_{[N,1]}=s|X_{[N,2]}=r,\theta)$ (recalling that $g$ does not depend on $N$ when $\theta$ is known) and then compute the expectation of $g(s,r,\Theta)$ under $\pi(\theta|D)$, $$ p_{\text{A}1}(X_{[N,1]}=s|X_{[N,2]}=r,D) = \mathbb{E}_{\pi(\cdot|D)}\left[ g(s,r,\Theta) \right]. $$

**Approach 2**

Define the desired conditional probability density by application of the Bayes Rule. Namely, separately compute two quantities:

- joint: $ p(X_{[N,1]}=s,X_{[N,2]}=r|D) = \int p(X_{[N,1]}=s,X_{[N,2]}=r|\theta) \pi(\theta|D)d\theta $
- marginal: $ p(X_{[N,2]}=r|D) = \int p(X_{[N,2]}=r|\theta) \pi(\theta|D)d\theta $

and then return their ratio, $$ p_{\text{A}2}(X_{[N,1]}=s|X_{[N,2]}=r,D) = \frac{p(X_{[N,1]}=s,X_{[N,2]}=r|D)}{p(X_{[N,2]}=r|D)}. $$

Note that Approach 2 is equivalent to appending the condition $\lbrace X_{[n,2]}=r \rbrace$ to the observation set $D$ so that $D’ := D \cup \lbrace X_{[N,2]}=r \rbrace$ and the new posterior distribution is $\pi(\theta|D’)$. It then computes the expectation of $g(s,r,\Theta)$ under $\pi(\cdot|D’)$, $$ p_{\text{A}2}(X_{[N,1]}=s|X_{[N,2]}=r,D) = \mathbb{E}_{\pi(\cdot|D’)}\left[ g(s,r,\Theta) \right] $$

Exercise: Show why the two expressions for $p_{\text{A}2}$ are equivalent.

**Thoughts**

The question is thus, does the Bayesian reasoner update their beliefs about $\theta$ based on the condition ${X_{[N,2]}=r}$? I think both approaches can make sense:

In Approach 1, we do not treat $\lbrace X_{[N,2]}=r \rbrace$ as a new element of the observation sequence $D$; instead we define the probe function $g(s,r,\theta)$ based on the conditional probability (which is a function of the population parameter), and then compute its expectation.

Approach 2 follows more directly from the “laws of probability” but is less interpretable from the Bayesian paradigm. Why? Because if ${\Theta = \theta}$ is known, then $p(X_{[N,1]}=s|X_{[N,2]}=r,\theta)$ is just a real-number — since the Bayesian does not know $\theta$, they marginalize over it. But it is unclear why the probe function $g(s,r,\theta)$ should influence the distribution of $\pi(\theta|D)$, regardless of whether it happens to represent a density parameterized by $\theta$.

**Next Steps**

Perhaps I should numerically/analytically compute the difference between Approach 1 and Approach 2 for a bivariate Gaussian with known covariance and unknown mean. For simplicity, just use the prior predictive, letting $D=\varnothing$.