Local Exchangeability

Trevor Campbell; Saifuddin Syed; Chiao-Yu Yang; Michael I. Jordan; and; Tamara Broderick

arXiv:1906.09507·math.ST·July 25, 2022

Local Exchangeability

Trevor Campbell, Saifuddin Syed, Chiao-Yu Yang, Michael I. Jordan, and, Tamara Broderick

PDF

1 Repo

TL;DR

This paper introduces the concept of local exchangeability, a relaxation of exchangeability allowing bounded distributional changes under local data swaps, with implications for Bayesian nonparametrics and permutation tests.

Contribution

It formalizes local exchangeability, proves its connection to measure-valued processes, and demonstrates practical applications in Bayesian inference and hypothesis testing.

Findings

01

Local empirical measures approximate underlying processes.

02

Local exchangeability characterizes certain stochastic processes.

03

Applications include Bayesian nonparametrics and covariate-dependent tests.

Abstract

Exchangeability -- in which the distribution of an infinite sequence is invariant to reorderings of its elements -- implies the existence of a simple conditional independence structure that may be leveraged in the design of statistical models and inference procedures. In this work, we study a relaxation of exchangeability in which this invariance need not hold precisely. We introduce the notion of local exchangeability -- where swapping data associated with nearby covariates causes a bounded change in the distribution. We prove that locally exchangeable processes correspond to independent observations from an underlying measure-valued stochastic process. Using this main probabilistic result, we show that the local empirical measure of a finite collection of observations provides an approximation of the underlying measure-valued process and Bayesian posterior predictive distributions.…

Equations336

X_{1}, X_{2}, \dots = d X_{π (1)}, X_{π (2)}, \dots .

X_{1}, X_{2}, \dots = d X_{π (1)}, X_{π (2)}, \dots .

P (X \in \cdot ∣ G) = a . s . G^{\infty},

P (X \in \cdot ∣ G) = a . s . G^{\infty},

P (X \in \cdot ∣ (G_{t})_{t \in T}) = a . s . n = 1 \prod \infty G_{t_{n}} .

P (X \in \cdot ∣ (G_{t})_{t \in T}) = a . s . n = 1 \prod \infty G_{t_{n}} .

\forall t \in T, (X_{T})_{t} := X_{t} (X_{π, T})_{t} := X_{π (t)} .

\forall t \in T, (X_{T})_{t} := X_{t} (X_{π, T})_{t} := X_{π (t)} .

d_{TV} (X_{T}, X_{π, T}) \leq t \in T \sum d (t, π (t)) .

d_{TV} (X_{T}, X_{π, T}) \leq t \in T \sum d (t, π (t)) .

d_{TV} (Y, Z) := A \in Ξ sup ∣ P (Y \in A) - P (Z \in A) ∣ .

d_{TV} (Y, Z) := A \in Ξ sup ∣ P (Y \in A) - P (Z \in A) ∣ .

d_{TV} (X_{T}, X_{π, T}) \leq t \in T \sum d (t, π (t)) .

d_{TV} (X_{T}, X_{π, T}) \leq t \in T \sum d (t, π (t)) .

θ \sim N (0, 1), \forall t \in R, X_{t} \sim indep N (θ t^{2}, 1) .

θ \sim N (0, 1), \forall t \in R, X_{t} \sim indep N (θ t^{2}, 1) .

d_{TV} (X_{T}, X_{π, T})

d_{TV} (X_{T}, X_{π, T})

E [d_{TV} (N (θ t^{2}, 1), N (θ π (t)^{2}, 1))]

E [d_{TV} (N (θ t^{2}, 1), N (θ π (t)^{2}, 1))]

\leq \frac{E ∣ θ ∣∣ t ^{2} - π ( t ) ^{2} ∣}{2 π \lower 0.4 pt \vrule height=6.44444pt,depth=-5.15558pt} \leq \frac{∣ t ^{2} - π ( t ) ^{2} ∣}{2 π \lower 0.4 pt \vrule height=6.44444pt,depth=-5.15558pt} .

P (X_{T} \in \cdot ∣ G)

P (X_{T} \in \cdot ∣ G)

d_{c} (t, t^{'})

d_{c} (t, t^{'})

G_{t} = N (θ t^{2}, 1), t \in T .

G_{t} = N (θ t^{2}, 1), t \in T .

A sup E ∣ G_{t} (A) - G_{t^{'}} (A) ∣

A sup E ∣ G_{t} (A) - G_{t^{'}} (A) ∣

d_{P} (G_{N}, G) \to a . s . 0, N \to \infty,

d_{P} (G_{N}, G) \to a . s . 0, N \to \infty,

M_{τ}

M_{τ}

G_{τ}

G_{τ}

∥ ν - η ∥_{A} = i = 1 \sum \infty c_{i} ∣ ν (A_{i}) - η (A_{i}) ∣, ν, η probability measures,

∥ ν - η ∥_{A} = i = 1 \sum \infty c_{i} ∣ ν (A_{i}) - η (A_{i}) ∣, ν, η probability measures,

\forall τ \in T, A sup E [∥ G_{τ} - G_{τ} ∥_{A}^{2}]

\forall τ \in T, A sup E [∥ G_{τ} - G_{τ} ∥_{A}^{2}]

A sup P (∥ G_{τ} - G_{τ} ∥_{A} > δ + 2 μ_{τ} + 1/ M_{τ} \lower 0.4 pt \vrule height=7.5pt,depth=-6.00003pt)

A sup P (∥ G_{τ} - G_{τ} ∥_{A} > δ + 2 μ_{τ} + 1/ M_{τ} \lower 0.4 pt \vrule height=7.5pt,depth=-6.00003pt)

A sup E [∥ G_{τ} - G_{τ} ∥_{A}^{2}] = O (∣ T ∣^{- 1}), A sup P (∥ G_{τ} - G_{τ} ∥_{A} > δ + ∣ T ∣^{- 1/2})

A sup E [∥ G_{τ} - G_{τ} ∥_{A}^{2}] = O (∣ T ∣^{- 1}), A sup P (∥ G_{τ} - G_{τ} ∥_{A} > δ + ∣ T ∣^{- 1/2})

d_{P} (G_{τ}, G_{τ}) \to p 0 and d_{P} (G_{τ}, P (X_{τ} \in \cdot ∣ X_{T_{n}})) \to p 0, n \to \infty.

d_{P} (G_{τ}, G_{τ}) \to p 0 and d_{P} (G_{τ}, P (X_{τ} \in \cdot ∣ X_{T_{n}})) \to p 0, n \to \infty.

ℓ (t_{n}, t_{n}^{'}) \to 0 ⟹ d (t_{n}, t_{n}^{'}) \to 0, n \to \infty.

ℓ (t_{n}, t_{n}^{'}) \to 0 ⟹ d (t_{n}, t_{n}^{'}) \to 0, n \to \infty.

p (x_{τ}, x_{T})

p (x_{τ}, x_{T})

X_{τ} \sim N (\frac{τ ^{2} \sum _{t \in T} X _{t} t ^{2}}{1 + \sum _{t \in T} t ^{4}}, \frac{1 + τ ^{4} + \sum _{t \in T} t ^{4}}{1 + \sum _{t \in T} t ^{4}}),

X_{τ} \sim N (\frac{τ ^{2} \sum _{t \in T} X _{t} t ^{2}}{1 + \sum _{t \in T} t ^{4}}, \frac{1 + τ ^{4} + \sum _{t \in T} t ^{4}}{1 + \sum _{t \in T} t ^{4}}),

X_{τ} \sim N (Y, 1), where Y \sim N (0, τ^{4}) .

X_{τ} \sim N (Y, 1), where Y \sim N (0, τ^{4}) .

G_{τ_{1}} = N (θ τ_{1}^{2}, 1) G_{τ_{2}} = N (θ τ_{2}^{2}, 1), θ \sim N (0, 1) .

G_{τ_{1}} = N (θ τ_{1}^{2}, 1) G_{τ_{2}} = N (θ τ_{2}^{2}, 1), θ \sim N (0, 1) .

η \to 0 lim t : d (t, t_{0}) \leq η sup P (∣ G_{t} (A) - G_{t_{0}} (A) ∣ > ϵ) = 0.

η \to 0 lim t : d (t, t_{0}) \leq η sup P (∣ G_{t} (A) - G_{t_{0}} (A) ∣ > ϵ) = 0.

E [h (X_{1}, \dots, X_{N}) ∣ G, G] = E [h (X_{1}, \dots, X_{N}) ∣ G] .

E [h (X_{1}, \dots, X_{N}) ∣ G, G] = E [h (X_{1}, \dots, X_{N}) ∣ G] .

E E [h (X_{T}) ∣ G, G] - E [h (X_{T}) ∣ G] \leq 4∥ h ∥_{\infty} E [t \in T \sum d (t, π (t))],

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

trevorcampbell/localexch
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Full text

Local Exchangeability

Trevor Campbelllabel=e1 [

mark][email protected]

Saifuddin Syedlabel=e2 [

mark][email protected]

Chiao-Yu Yanglabel=e3 [

mark][email protected]

Michael I. Jordanlabel=e4 [

mark][email protected]

Tamara Brodericklabel=e5 [

mark][email protected]

Department of Statistics, University of British Columbia, Vancouver, Canada.

Department of Electrical Engineering and Computer Science, University of California Berkeley, Berkeley, USA.

Laboratory for Information and Decision Systems, Massachusetts Institute of Technology, Cambridge, USA.

Abstract

Exchangeability—in which the distribution of an infinite sequence is invariant to reorderings of its elements—implies the existence of a simple conditional independence structure that may be leveraged in the design of statistical models and inference procedures. In this work, we study a relaxation of exchangeability in which this invariance need not hold precisely. We introduce the notion of local exchangeability—where swapping data associated with nearby covariates causes a bounded change in the distribution. We prove that locally exchangeable processes correspond to independent observations from an underlying measure-valued stochastic process. Using this main probabilistic result, we show that the local empirical measure of a finite collection of observations provides an approximation of the underlying measure-valued process and Bayesian posterior predictive distributions. The paper concludes with applications of the main theoretical results to a model from Bayesian nonparametrics and covariate-dependent permutation tests.

exchangeability,

local,

representation,

de Finetti,

Bayesian nonparametrics,

keywords:

\startlocaldefs\endlocaldefs

, , , , and

1 Introduction

Let $X=X_{1},X_{2},\dots$ be an infinite sequence of random elements in a standard Borel space $(\mathcal{X},\Sigma)$ . The sequence is said to be exchangeable if for any finite permutation $\pi$ of $\mathbb{N}$ ,

[TABLE]

At first sight this assumption appears innocent; intuitively, it suggests only that the order in which observations appear provides no information about those or future observations. But despite its apparent innocence, exchangeability has a powerful implication. In particular, the well-known de Finetti’s theorem (e.g. Kallenberg, 2002, Theorem 11.10) states that an infinite sequence is exchangeable if and only if it is mixture of i.i.d. sequences, i.e., there exists a unique random probability measure $G$ on $\mathcal{X}$ such that

[TABLE]

where $G^{\infty}$ is the countable infinite product measure constructed from $G$ . Thus, exchangeability provides a strong justification for the Bayesian approach to modeling (Jordan, 2010), and guarantees a latent conditional independence structure of $X$ useful in the design of computationally efficient inference algorithms. Exchangeability is also the basis of well-known nonparametric permutation testing procedures (Pitman, 1937a, b, c; Fisher, 1966, Ch. 3; Ernst, 2004; Lehmann and Romano, 2005, Ch. 15).

However, although exchangeability may be a useful idealization in modeling and analysis, many data come with covariates that preclude an honest belief in its validity. For example, given a corpus of documents tagged by publication date, one might reasonably expect the data to exhibit a time-dependence that is incompatible with exchangeability. Nevertheless, one might still expect the distribution not to change too much if we permuted documents published only one day apart; i.e., observations with similar covariates are intuitively “nearly exchangeable.” In this work, we investigate how to codify this intuition.

One option is to use a kind of partial exchangeability (de Finetti, 1938; Lauritzen, 1974; Diaconis and Freedman, 1978; Camerlenghi et al., 2019) in which the distribution is invariant to permutations within equivalence classes. Formally, we endow each observation $X_{n}$ with a covariate $t_{n}$ from a set $\mathcal{T}$ , and assert that the sequence distribution is invariant only to reordering observations with equivalent covariate values. Under this assumption as well as the availability of infinitely many observations at each covariate value, we have a similar representation of $X$ as a mixture of independent sequences given random probability measures $(G_{t})_{t\in\mathcal{T}}$ ,

[TABLE]

The random probability measures $(G_{t})_{t\in\mathcal{T}}$ can have an arbitrary dependence on one another; partially exchangeable sequences encompass those that are exchangeable (where the covariate does not matter), decoupled (where subsequences for each different covariate value are mutually independent), and the full range of models in between. In particular, partial exchangeability does not enforce the desideratum that observations with nearby covariates should have a similar law, and is too weak to be useful for restricting the class of underlying mixing measures for the data.

In this work, we introduce a new notion of local exchangeability—lying between partial and exact exchangeability—in which swapping data associated with nearby covariates causes a bounded change in total variation distance. We begin by studying probabilistic properties of locally exchangeable processes in Sections 2.1 and 2.2. The main result from this section is in the spirit of de Finetti’s theorem: we prove that locally exchangeable processes correspond to independent observations from a unique underlying smooth measure-valued stochastic process. To the best of our knowledge, this representation theorem is the first to arise from an approximate probabilistic symmetry. Further, the existence of such an underlying process not only shows that de Finetti’s theorem is robust to perturbations away from exact exchangeability, justifying the Bayesian analysis of real data, but also imposes a useful constraint on the space of models one should consider when dealing with data that one suspects follows a locally exchangeable random process. Next in Section 2.3, we use this result to show that the local empirical measure of a finite collection of observations can be used to provide an approximation of the underlying measure-valued process, Bayesian predictive posterior distributions, and the premetric that governs local exchangeability. These results rely heavily on the intuition that locally exchangeable observations from nearby covariates behave essentially like exchangeable observations. Finally, in Section 3, we provide example applications in two statistical models exhibiting local exchangeability—Gaussian processes (Rasmussen and Williams, 2006) and dependent Dirichlet processes (MacEachern, 1999, 2000)—as well as grouped permutation tests in the presence of covariates. The paper concludes with a discussion of directions for future work. Proofs of all results are provided in the appendix.

1.1 Related work

Beyond de Finetti’s original result for infinite binary sequences (de Finetti, 1931) and its extensions to more general range spaces (de Finetti, 1937; Hewitt and Savage, 1955) and finite sequences (Diaconis, 1977; Diaconis and Freedman, 1980a)—see Aldous (1985) for an in-depth introduction—correspondences between probabilistic invariances and conditional latent structure (known as representation theorems) have been studied extensively. Notions of exchangeability and corresponding latent conditional structure now exist for a wide variety of probabilistic models, such as arrays (Aldous, 1981; Hoover, 1979; Austin and Panchenko, 2014; Jung et al., 2021), Markov processes (Diaconis and Freedman, 1980b), networks (Caron and Fox, 2017; Veitch and Roy, 2015; Borgs et al., 2018; Crane and Dempsey, 2016; Cai, Campbell and Broderick, 2016; Janson, 2017), combinatorial structures (Kingman, 1978; Pitman, 1995; Broderick, Pitman and Jordan, 2013; Campbell, Cai and Broderick, 2018; Crane and Dempsey, 2019), random measures (Kallenberg, 1990), and more (Diaconis, 1988; Kallenberg, 2005; Orbanz and Roy, 2015). Furthermore, weaker notions of exchangeability such as conditionally identical distributions (Berti, Pratelli and Rigo, 2004; Kallenberg, 1988) have been developed. All past work on probabilistic invariance and its consequences has pertained to exact invariance.

2 Local exchangeability

2.1 Definition

Let $X=(X_{t})_{t\in\mathcal{T}}$ be a stochastic process on an index (or covariate) set $\mathcal{T}$ taking values in a standard Borel space $(\mathcal{X},\Sigma)$ . To encode distance between covariates, we endow the set $\mathcal{T}$ with a premetric $d:\mathcal{T}\times\mathcal{T}\to[0,1]$ satisfying $d(t,t^{\prime})=d(t^{\prime},t)$ and $d(t,t)=0$ for $t,t^{\prime}\in\mathcal{T}$ . We will formalize local exchangeability based on the finite dimensional projections of $X$ . For any subset $T\subset\mathcal{T}$ and injection $\pi:T\to\mathcal{T}$ , let $X_{T}$ and $X_{\pi,T}$ denote stochastic processes on index set $T$ such that

[TABLE]

In other words, $X_{T}$ is the restriction of $X$ to index set $T$ , while $X_{\pi,T}$ is the restriction to $T$ under the mapping $\pi$ . Definition 1 captures the notion that observations with similar covariates should be close to exchangeable, i.e., the total variation between $X_{T}$ and $X_{\pi,T}$ is small as long as the distances between $t$ and $\pi(t)$ are small for all $t\in T$ .

Definition 1.

The process $X$ is locally exchangeable with respect to a premetric $d$ if for any finite subset $T\subset\mathcal{T}$ and injection $\pi:T\to\mathcal{T}$ ,

[TABLE]

Definition 1 generalizes both exchangeability and partial exchangeability among equivalence classes. In particular, the zero premetric where $d(t,t^{\prime})=0$ identically yields classical exchangeability, while the premetric $d(t,t^{\prime})=1-\mathds{1}[t\sim t^{\prime}]$ for equivalence relation $\sim$ yields partial exchangeability. Further, any process is locally exchangeable with respect to the discrete premetric $d(t,t^{\prime})=1-\mathds{1}[t=t^{\prime}]$ ; in order to say something of value about a process $X$ , it must satisfy Eq. 6 for a tighter premetric.

To quantify differences in distributions, Definition 1 employs the total variation distance, which for random elements $Y,Z$ in a measurable space $(\mathcal{Y},\Xi)$ is defined as

[TABLE]

The choice of total variation distance (as opposed to other metrics and divergences, see e.g. (Gibbs and Su, 2002)) is motivated by its symmetry and generality. We make $d$ a premetric—as opposed to a (pseudo)metric, say—as the triangle inequality and positive definiteness are unused in the theory below. Further we use a premetric with range $[0,1]$ because total variation always lies in this range, and so any valid bound in Eq. 6 for a premetric $d:\mathcal{T}\times\mathcal{T}\to\mathbb{R}_{+}$ can be improved by replacing $d$ with $\min(d,1)$ . And although Definition 1 imposes a total variation bound only for all finite sets of covariates, it is equivalent to do so for all countable sets of covariates, as shown in Proposition 2.

Proposition 2.

If $X$ is locally exchangeable with respect to $d$ , then for any countable subset $T\subset\mathcal{T}$ and injection $\pi:T\to\mathcal{T}$ ,

[TABLE]

Example 3.

A simple example of local exchangeability that we will return to throughout the paper is the process of observable measurements $X$ from a Bayesian linear regression model on $\mathcal{T}=\mathbb{R}$ with a quadratic trend,

[TABLE]

By Lemma 14, since the $X_{t}$ are independent conditioned on $\theta$ ,

[TABLE]

We bound the terms in the sum using the Lipschitz continuity of the standard normal CDF $\Phi$ ,

[TABLE]

Therefore the process $X$ in the Bayesian linear regression model Eq. 9 is locally exchangeable with respect to the premetric $d(t,t^{\prime})=\min(|t^{2}-t^{\prime 2}|/\mathchoice{{\hbox{$ \displaystyle\sqrt{2\pi,} $}\lower 0.4pt\hbox{\vrule height=6.44444pt,depth=-5.15558pt}}}{{\hbox{$ \textstyle\sqrt{2\pi,} $}\lower 0.4pt\hbox{\vrule height=6.44444pt,depth=-5.15558pt}}}{{\hbox{$ \scriptstyle\sqrt{2\pi,} $}\lower 0.4pt\hbox{\vrule height=4.51111pt,depth=-3.6089pt}}}{{\hbox{$ \scriptscriptstyle\sqrt{2\pi,} $}\lower 0.4pt\hbox{\vrule height=3.22221pt,depth=-2.57779pt}}},1)$ . Note that we are free to take $\min(\cdot,1)$ because the total variation is bounded above by 1. This example illustrates why we opt for the generality of a premetric; here, observations at points $t$ and $-t$ are exactly exchangeable since $d(t,-t)=0$ , which does not generally hold for a metric, and $|t^{2}-t^{\prime 2}|$ does not satisfy the triangle inequality. Also note that the marginal distribution of $X_{T}$ is a multivariate Gaussian with off-diagonal covariance terms $\mathbb{E}\left[X_{t}X_{t^{\prime}}\right]\propto t^{2}t^{\prime 2}$ , which varies with $t,t^{\prime}$ ; multivariate Gaussians with exchangeable components must have constant off-diagonal covariance terms. Therefore this example also shows that there exist processes that are locally exchangeable but not exchangeable.

2.2 de Finetti representation

In the previous example, we used the fact that the variables $X_{t}$ were conditionally independent given a latent random variable $\theta$ to demonstrate their local exchangeability. A natural question to ask is whether all locally exchangeable processes exhibit a similar structure. Theorem 5 answers this question in the affirmative, by providing a de Finetti-like representation of locally exchangeable processes similar to Eq. 3 and Eq. 4. This representation guarantees the existence of a simple conditional structure that can be leveraged in the design of statistical inference procedures, and justifies a Bayesian approach when dealing with covariate-dependent data. We first require a weak assumption on the space $\mathcal{T}$ .

Definition 4 (Infinitely-separable space).

A premetric space $(d,\mathcal{T})$ is infinitely separable if there exists a countable subset $\mathfrak{T}\subseteq\mathcal{T}$ such that for all $t\in\mathcal{T}$ , there exists a Cauchy sequence $(t_{n})_{n\in\mathbb{N}}$ in $\mathfrak{T}$ such that $t_{n}\to t$ and $|\{t_{n}:n\in\mathbb{N}\}|=\infty$ .

When $d$ is a metric, infinite separability is equivalent to $\mathcal{T}$ being separable with no isolated points. When $d$ is a pseudometric, it is equivalent to the existence of a countable dense subset $\mathfrak{T}\subseteq\mathcal{T}$ such that for all $t\in\mathcal{T}$ and $\epsilon>0$ , $|\{t^{\prime}\in\mathfrak{T}:d(t,t^{\prime})<\epsilon\}|=\infty$ . In general, infinite separability ensures that there are infinitely many elements to swap “nearby” each covariate value of interest $t\in\mathcal{T}$ . This assumption precludes the situation where observations satisfy finite exchangeability (Diaconis, 1977; Diaconis and Freedman, 1980a) but not infinite exchangeability.

Theorem 5 shows that under infinite separability, the desired de Finetti-like representation indeed does exist. In particular, we show that there is a unique probability measure-valued process $G$ that renders $X$ conditionally independent, and that $G$ satisfies a continuity property with the same “smoothness” as the observed process. For the precise statement of the result in Theorem 5, recall that a modification of a stochastic process $G$ on $\mathcal{T}$ is any other process $G^{\prime}$ on $\mathcal{T}$ such that $\forall t\in\mathcal{T},\,\mathbb{P}\left(G_{t}=G^{\prime}_{t}\right)=1$ .

Theorem 5.

Suppose $(d,\mathcal{T})$ is infinitely separable. Then the process $X$ is locally exchangeable with respect to $d$ if and only if there exists a random measure-valued stochastic process $G=(G_{t})_{t\in\mathcal{T}}$ (unique up to modification) such that for any finite subset of covariates $T\subset\mathcal{T}$ and $t,t^{\prime}\in\mathcal{T}$ ,

[TABLE]

For example, given $\mathcal{T}=\mathbb{N}$ and the zero premetric $d(t,t^{\prime})=0$ , one recovers the de Finetti representation of exchangeable sequences; the smoothness condition asserts that $G_{t}$ must be constant for all $t\in\mathcal{T}$ as expected. Similarly, suppose we are given an equivalence relation $\sim$ on $\mathbb{N}$ where each equivalence class has infinite cardinality. Then setting $\mathcal{T}=\mathbb{N}$ and $d(t,t^{\prime})=1-\mathds{1}[t\sim t^{\prime}]$ recovers the de Finetti representation of partially exchangeable sequences under permutation within equivalence classes; here the smoothness condition asserts that $G_{t}$ must be constant within each equivalence class, but allows for general dependence between $G_{t}$ across the equivalence classes. Thus, in the same way that Definition 1 generalizes (partial) exchangeability, Theorem 5 generalizes the de Finetti representation theorem.

Note that we still obtain the “if” direction of Theorem 5 without imposing the infinite separability assumption on $(d,\mathcal{T})$ . In particular, if we are given a process $G$ satisfying Eq. 13, then the process $X$ is locally exchangeable with respect to both

[TABLE]

We refer to $d_{c}$ as the canonical premetric and $d_{sc}$ as the strong canonical premetric. Note that $X$ is locally exchangeable with respect any premetric $d$ satisfying $d\geq d_{c}$ , and in particular, $d_{sc}\geq d_{c}$ . Given a particular $G$ , one can use Lemma 14 to derive an upper bound on these two premetrics (as demonstrated in Example 3), which then provides insight into the extent to which data $X$ generated from $G$ are exchangeable. Note that $(d_{c},\mathcal{T})$ and $(d_{sc},\mathcal{T})$ may or may not be infinitely separable, depending on the characteristics of the process $G$ .

Example (continued).

In the linear regression example, the underlying measure-valued process is the collection of normal distributions

[TABLE]

Theorem 5 guarantees that this process is unique up to modification. In this case, the randomness in $G$ is entirely due to the latent variable $\theta\sim\mathcal{N}(0,1)$ ; in general $G$ need not be determined by a finite-dimensional quantity. We can also verify that $G$ satisfies the required smoothness condition with respect to $d$ , although it is not surprising in this case given that we originally derived the premetric using the same technique:

[TABLE]

2.3 Local empirical measure process

The de Finetti result in Theorem 5 guarantees the existence of a unique underlying measure-valued process $G$ , but does not provide any direct insight into the distribution of $G$ or whether it is identifiable given only (countably many) measurements of the process $X$ . In the classical setting of an exchangeable sequence $X_{1},X_{2},\dots$ , the empirical measure $\widehat{G}_{N}=\frac{1}{N}\sum_{n=1}^{N}\delta_{X_{n}}$ of a finite collection of observations $(X_{n})_{n=1}^{N}$ serves this purpose, as it converges weakly to $G$ almost surely (Varadarajan, 1958), i.e.,

[TABLE]

where $d_{\mathrm{P}}$ denotes the Lévy-Prokhorov metric. In the setting of local exchangeability more generally, however, the usual empirical measure does not provide a result similar to Eq. 17. If we are interested in understanding the distribution of $G_{\tau}$ for some $\tau\in\mathcal{T}$ , and we collect measurements $(X_{t})_{t\in T}$ of $X$ at a finite set of covariates $T\subset\mathcal{T}$ , the presence of far-away covariates in $T$ from $\tau$ can result in a non-vanishing bias in the empirical measure. To address this issue, for each $\tau\in\mathcal{T}$ , let $t_{i}(\tau)$ , $i=1,\dots,|T|$ be an ordering of the set $T$ such that the values $d_{i}(\tau)=d(t_{i}(\tau),\tau)$ are ordered from smallest to largest. Then define

[TABLE]

We construct the local empirical measure process $(\widehat{G}_{\tau})_{\tau\in\mathcal{T}}$ via

[TABLE]

The local empirical measure process $\widehat{G}$ serves as an approximation of the measure-valued process $G$ underlying the locally exchangeable process $X$ . Note that $\sum_{t\in T}\max\{0,\frac{1}{M_{\tau}}+2(\mu_{\tau}-d(t,\tau))\}=1$ , so $\widehat{G}_{\tau}$ is a probability measure for each $\tau\in\mathcal{T}$ . Further note that $(\widehat{G})_{\tau\in\mathcal{T}}$ is measurable with respect to $(X_{t})_{t\in T}$ . Intuitively, $\widehat{G}$ includes only those observations at covariates sufficiently close to the point of interest $\tau\in\mathcal{T}$ such that the decrease in variance associated with adding another observation outweighs the potential increase in bias. The value $M_{\tau}$ represents how many observations are included in the local empirical measure at that location, and $\mu_{\tau}$ represents the average distance of their covariates to $\tau$ .

Our goal now is to provide a weak convergence result for the local empirical measure process $\widehat{G}$ in the limit of many observations, similar to that of Eq. 17. As a key step towards that goal, Theorem 6 provides bounds on both the expected squared estimation error (Eq. 21) as well as error tail probabilities (Eq. 22) when using the local empirical measure process $\widehat{G}_{\tau}$ in place of $G_{\tau}$ or $\mathbb{P}\left(X_{\tau}\in\cdot\,|\,X_{T}\right)$ , for all $\tau\in\mathcal{T}$ . Each bound in Theorem 6 has two terms: the first is related to the variance incurred by estimation via independent sampling, and the second is related to the bias incurred by using observations from $t\neq\tau$ . Note that Theorem 6 quantifies the approximation error using the metric

[TABLE]

where $\mathcal{A}=\left\{c_{i},A_{i}\right\}_{i=1}^{\infty}$ , $A_{i}$ are measurable subsets of $\mathcal{X}$ , $c_{i}\geq 0$ , and $\sum_{i}c_{i}=1$ . We work with $\|\cdot\|_{\mathcal{A}}$ rather than standard metrics because it simplifies the analysis substantially. Although the properties of $\|\cdot\|_{\mathcal{A}}$ depend on the choice of $\mathcal{A}$ in general, there exists a choice such that $\|\cdot\|_{\mathcal{A}}\to 0$ implies weak convergence (see Lemma 16 in the appendix), and the bounds below in Theorem 6 are valid for any choice of $\mathcal{A}$ , as indicated by the supremum. We will use the metric $\|\cdot\|_{\mathcal{A}}$ and the results in Theorem 6 as a stepping stone to obtain weak convergence in Corollary 7 below.

Theorem 6.

Let $(d,\mathcal{T})$ be infinitely separable and $X$ be locally exchangeable with respect to $d$ . Then

[TABLE]

and for all $\delta>0$ , $\tau\in\mathcal{T}$ ,

[TABLE]

Furthermore, the same bounds in Eqs. 21 and 22 apply when $G_{\tau}$ is replaced with $\mathbb{P}\left(X_{\tau}\in\cdot\,|\,X_{T}\right)$ .

When all of the covariates in the observed set $T$ are close to $\tau$ , the bounds in Theorem 6 provide essentially the same guarantees as one would expect for exchangeable random variables. In particular, suppose for all $t\in T$ , $d(t,\tau)\lesssim\exp(-|T|)$ , and so $\xi_{t}(\tau)\approx 1/|T|$ . In this situation the bounds above reduce to

[TABLE]

Corollary 7 uses the results in Theorem 6 to obtain a weak convergence result for $\widehat{G}_{\tau}$ similar to Eq. 17. In particular, if we collect measurements of $X$ from a sequence of sets that concentrate around $\tau$ —for example, $T_{n}=\{t_{i}\}_{i=1}^{n}$ such that there exists a subsequence $t_{i_{k}}\to\tau$ —then the local empirical measure $\widehat{G}_{\tau}$ converges weakly to both $G_{\tau}$ and the Bayesian posterior predictive distribution in probability. Recall that $d_{\mathrm{P}}$ denotes the Lévy-Prokhorov metric.

Corollary 7.

Fix $\tau\in\mathcal{T}$ . Suppose we make observations at a sequence of finite sets $T_{n}\subset\mathcal{T}$ , $n\in\mathbb{N}$ of covariates such that for all $\epsilon>0$ , $\left|\{t\in T_{n}:d(t,\tau)\leq\epsilon\}\right|\to\infty$ . Then

[TABLE]

A byproduct of Corollary 7 is that one can characterize the distribution of $G_{\tau}$ by analyzing the distribution of $X_{\tau}$ conditioned on $X_{T_{n}}$ for a sequence of sets of covariates $T_{n}$ that concentrate around $\tau$ , i.e., $|T_{n}|\to\infty$ and $\max\{d(t,\tau):t\in T_{n}\}\to 0$ as $n\to\infty$ . Note that it is not required to know the premetric $d$ governing local exchangeability in order to identify $G$ using this technique; one can instead construct the set of covariates $T_{n}$ such that $\max\{\ell(t,\tau):t\in T_{n}\}\to 0$ for any premetric $\ell:\mathcal{T}\times\mathcal{T}\to[0,1]$ that dominates $d$ in the sense that for any two sequences of covariates $t_{n},t^{\prime}_{n}$ , $n\in\mathbb{N}$ ,

[TABLE]

The requirement in Eq. 25 is typically not stringent; it states only that when covariates get close under $\ell$ , they must also get close under $d$ , with no other stipulation about relative rates, bounds, etc. In the following linear regression example, we will use the usual metric $\ell(t,t^{\prime})=|t-t^{\prime}|$ on $\mathbb{R}$ .

Example (continued).

We return to the linear regression example to show how the distribution of $G_{\tau}$ can be recovered from the process $X$ via Corollary 7. The joint density of $X_{T},X_{\tau}$ is

[TABLE]

Therefore the conditional distribution of $X_{\tau}$ given $X_{T}$ is given by

[TABLE]

If we then consider a sequence of sets $T_{n}$ of covariates that grows in size and concentrates quickly around $\tau$ —e.g., $T_{n}=\{\tau+i\exp(-n):i=1,\dots,n\}$ —we find that the conditional distribution of $X_{\tau}$ given $X_{T}$ converges to

[TABLE]

By setting $\theta=Y\tau^{-2}$ , we recover the fact that $X_{\tau}$ is generated from $G_{\tau}=\mathcal{N}(\theta\tau^{2},1)$ , $\theta\sim\mathcal{N}(0,1)$ , i.e., the marginal of the original Bayesian linear regression model. Note that one can repeat essentially the same analysis for multiple covariates $\tau_{1},\dots,\tau_{K}$ to recover finite marginal distributions. For example, if we consider the bivariate distribution of $G_{\tau_{1}},G_{\tau_{2}}$ , we find that $X_{\tau_{1}},X_{\tau_{2}}$ are generated independently from

[TABLE]

The analysis from the example in Section 2.1 can then be used to bound the strong canonical premetric $d_{sc}(t,t^{\prime})=d_{\mathrm{TV}}(G_{t},G_{t^{\prime}})\leq\min\left(|t-t^{\prime}|/\mathchoice{{\hbox{$ \displaystyle\sqrt{2\pi,} $}\lower 0.4pt\hbox{\vrule height=6.44444pt,depth=-5.15558pt}}}{{\hbox{$ \textstyle\sqrt{2\pi,} $}\lower 0.4pt\hbox{\vrule height=6.44444pt,depth=-5.15558pt}}}{{\hbox{$ \scriptstyle\sqrt{2\pi,} $}\lower 0.4pt\hbox{\vrule height=4.51111pt,depth=-3.6089pt}}}{{\hbox{$ \scriptscriptstyle\sqrt{2\pi,} $}\lower 0.4pt\hbox{\vrule height=3.22221pt,depth=-2.57779pt}}},1\right)$ . Thus, given only the process $X$ , we have identified a premetric $d$ under which $X$ is locally exchangeable as well as the measure-valued process $G$ .

2.4 Regularity

The smoothness property of $G$ in Eq. 13 may seem unsatisfying at a first glance; it bounds the absolute difference in the underlying mixing measure process at nearby locations only in expectation, leaving room for the possibility of sample discontinuities in $G_{t}$ as a function of $t$ . However, there are many probabilistic models that, intuitively, generate observations that should be considered locally exchangeable but which have discontinuous latent mixing measures. For example, some dynamic nonparametric mixture models (Lin and Fisher, 2010; Chen et al., 2013) have components that are created and destroyed over time, causing discrete jumps in the mixing measure. As long as the jumps happen at diffuse random times, the probability of a jump occurring between two times decreases as the difference in time decreases, and the observations may still be locally exchangeable. However, intuitively, if there is a fixed location $t_{0}$ with a nonzero probability of a discrete jump in the mixing measure process, the observations $X$ cannot be locally exchangeable. Corollary 8 provides the precise statement.

Corollary 8.

Suppose $(d,\mathcal{T})$ is infinitely separable and $X$ is locally exchangeable with respect to $d$ . Then for all $A\in\Sigma$ , $t_{0}\in\mathcal{T}$ , and $\epsilon>0$ ,

[TABLE]

That being said, it is worth examining whether different guarantees on properties of the underlying measure process $G$ result as a consequence of different properties of the premetric $d$ . Theorem 9 answers this question in the affirmative for processes on $\mathcal{T}=\mathbb{R}$ ; in particular, the faster the decay of $d(t,t^{\prime})$ relative to $|t-t^{\prime}|$ as $t\to t^{\prime}$ , the stronger the guarantees on the behavior of the mixing measure $G$ . Note that while this result is presented for covariate space $\mathbb{R}$ , the result can be extended to processes on $\mathbb{R}\times\mathbb{N}$ and more general separable spaces (Pothoff, 2009, Theorems 2.8, 2.9, 4.5).

Theorem 9.

Let $\mathcal{T}=\mathbb{R}$ , $\gamma\geq 0$ , and $X$ be locally exchangeable with respect to a premetric $d$ satisfying $d(t,t^{\prime})=O(|t-t^{\prime}|^{1+\gamma})$ as $|t-t^{\prime}|\to 0$ . Then:

( $\gamma>1$ ): $X$ is exchangeable and $G$ is a constant process. 2. 2.

( $0<\gamma\leq 1$ ): $X$ is stationary and for any $A\in\Sigma$ and $\alpha\in(0,\gamma)$ , $(G_{t}(A))_{t\in\mathbb{R}}$ is weak-sense stationary with an $\alpha$ -Hölder continuous modification. 3. 3.

( $\gamma=0$ ): $G$ may have no continuous modification.

*Remark**.*

A rough converse of the first point holds: $X$ exchangeable implies constant $G$ , and $d(t,t^{\prime})=0$ is trivially $O(|t-t^{\prime}|^{1+\gamma})$ for $\gamma>1$ . But a similar claim for the second point is not true in general: $X$ stationary and locally exchangeable does not necessarily imply that $d(t,t^{\prime})=O(|t-t^{\prime}|^{1+\gamma})$ for $0<\gamma\leq 1$ . For a counterexample, consider a square wave shifted by a uniform random variable, i.e., the process $X_{t}=\mathrm{sign}\left(\sin(2\pi(t-U))\right)$ for $U\sim{\sf{Unif}}[0,1]$ . Here $X_{t}$ is stationary and locally exchangeable with $d(t,t^{\prime})=\min(|t-t^{\prime}|,1)$ , but $|t-t^{\prime}|\neq O(|t-t^{\prime}|^{1+\gamma})$ for any $\gamma>0$ as $|t-t^{\prime}|\to 0$ .

2.5 Approximate conditional independence

In the classical setting of exchangeable sequences $X_{1},X_{2},\dots$ , the empirical measure $\widehat{G}=\frac{1}{N}\sum_{n=1}^{N}\delta_{X_{n}}$ satisfies the following property: for all bounded measurable functions $h:\mathcal{X}^{N}\to\mathbb{R}$ ,

[TABLE]

Thus $G$ and $(X_{1},\dots,X_{N})$ are conditionally independent given $\widehat{G}$ . In other words, the fact that $(X_{1},\dots,X_{N})$ corresponds to covariate values $(1,\dots,N)$ provides no additional information about $G$ beyond $\widehat{G}$ itself.

In the setting of local exchangeability, the question of how important the covariate values are in inferring the measure-valued process $G$ is relevant in practice: we do not often get to observe the true covariate values $\{t_{1},\dots,t_{N}\}=T\subset\mathcal{T}$ , but rather we observe discretized versions that are grouped into “bins.” For example, if $X_{T}$ corresponds to observed document data with timestamps $T$ , we may know those timestamps up to only a certain precision (e.g. days, months, years). This section shows that a “binned” version of the empirical measure $\widehat{G}$ provides an approximate conditional independence similar to Eq. 31, where the error of approximation decays smoothly by an amount corresponding to the uncertainty in covariate values.

Formally, suppose we partition our covariate space $\mathcal{T}$ into disjoint bins $\{\mathcal{T}_{k}\}_{k=1}^{\infty}$ , where each bin has observations $T_{k}=\mathcal{T}_{k}\cap T$ . We may use a finite partition by setting all but finitely many $\mathcal{T}_{k}$ to the empty set. Although we know the number of points in each bin (i.e., the cardinality of $T_{k}$ ), we will encode our lack of knowledge of their positions as randomness: $T_{k}\sim\mu_{k}$ , where $\mu_{k}$ is a probability distribution capturing our belief of how the unobserved covariates are generated within each bin. Following the intuition from the classical de Finetti’s theorem, we define the binned empirical measures $\widetilde{G}_{k}=\sum_{t\in T_{k}}\delta_{X_{t}}$ , $\widetilde{G}:=(\widetilde{G}_{1},\widetilde{G}_{2},\dots)$ , and let $\mathcal{G}$ denote the subgroup of permutations $\pi:T\to T$ that permute observations only within each bin, i.e., such that $\forall k\in\mathbb{N}$ , $\pi(T_{k})=T_{k}$ . Note that $|\mathcal{G}|=\prod_{k=1}^{\infty}|T_{k}|!<\infty$ since there are only finitely many observations in total. Unlike classical exchangeability, $\widetilde{G}$ does not provide exact conditional independence of $X_{T}$ and $G$ ; but Theorem 10 guarantees that it provides a form of approximate conditional independence, with error that depends on $(\mu_{k})_{k=1}^{\infty}$ .

Theorem 10.

Suppose $(d,\mathcal{T})$ is infinitely separable. If $X$ is locally exchangeable with respect to $d$ , and $h:\mathcal{X}^{T}\to\mathbb{R}$ is a bounded measurable function,

[TABLE]

where $\pi\sim{\sf{Unif}}\left(\mathcal{G}\right)$ and $T_{k}\overset{\textrm{\tiny{indep}}}{\sim}\mu_{k}$ .

*Remark**.*

Note that the expectation on the right hand side averages over the randomness both in the uncertain covariates $T$ and the permutation $\pi$ .

If $X$ is exchangeable within each bin $\mathcal{T}_{k}$ , Theorem 10 states that $X_{T}$ and $G$ are conditionally independent given $\widetilde{G}$ , as desired. Further, the deviance from independence is controlled by the deviance from exchangeability within each bin. In particular,

[TABLE]

where $\operatorname{diam}{\mathcal{T}_{k}}:=\sup_{t,t^{\prime}\in\mathcal{T}_{k}}d(t,t^{\prime})$ . Both bounds in Eq. 33 are independent of $\mu_{k}$ ; thus the result holds even if we are unwilling to express our uncertainty in the binned covariates via a distribution.

3 Examples

In this section, we provide example applications of the theory in Section 2. First, we use a case study of Gaussian processes to show how one can use posterior predictive distributions to analyze the local exchangeability of a process. In particular, we show how to derive the underlying measure process $G$ , as well as an appropriate premetric $d$ governing local exchangeability, using only finite marginals of the process $X$ . Second, we use a case study of dependent Dirichlet processes to show that one can use local empirical measures as a surrogate for otherwise intractable posterior predictive distributions in discrete Bayesian nonparametric models. See the appendix for other examples of Bayesian nonparametric models exhibiting local exchangeability—e.g., kernel beta process feature models (Hjort, 1990; Ren et al., 2011) and dynamic topic models (Blei and Lafferty, 2006; Wang, Blei and Heckerman, 2008), among others. Finally, we demonstrate a usage of local exchangeability as a tool to analyze the inflation of type-I error in matched permutation tests involving covariates.

3.1 Obtaining the underlying measure-valued process and premetric

We will first provide an example of how one can use the Bayesian posterior predictive distributions of a locally exchangeable process $X$ to derive the distribution of the underlying measure-valued process $G$ as well as the premetric of local exchangeability $d$ . This example applies the same strategy as in the running example from Section 2.3, albeit in a more sophisticated nonparametric model.

Consider a Gaussian process $X\sim{\mathrm{GP}}(m,\kappa)$ on $\mathcal{T}=\mathbb{R}^{d}$ with continuous mean function $m:\mathbb{R}^{d}\to\mathbb{R}$ , and covariance function $\kappa(x,y)=\sigma^{2}(x)\mathds{1}[x=y]+k(x,y)$ for continuous nonnegative $\sigma^{2}:\mathbb{R}^{d}\to\mathbb{R}_{+}$ and continuous symmetric positive-definite kernel $k:\mathbb{R}^{d}\times\mathbb{R}^{d}\to\mathbb{R}_{+}$ . Define a set of $k$ unique covariate values $\tau_{1},\dots,\tau_{k}\in\mathcal{T}$ , and consider the Euclidean metric on $\mathcal{T}$ . For each $n\in\mathbb{N}$ and $i=1,\dots,k$ , let $T_{in}$ be a finite subset of covariates such that $|T_{in}|=n$ and $\max\{\|\tau_{i}-t\|:t\in T_{in}\}=o(1/n)$ . Direct analysis of the conditional density yields that as $n\to\infty$ , the conditional distribution of $X_{\tau_{1}},\dots,X_{\tau_{k}}$ given $X_{T_{1n}},\dots,X_{T_{kn}}$ converges to

[TABLE]

where

[TABLE]

Eqs. 34 and 35 demonstrate that $X$ is conditionally independently drawn from the process $G$ where

[TABLE]

We now derive the strong canonical premetric of local exchangeability. In this setting,

[TABLE]

By Devroye, Mehrabian and Reddad (2020, Theorem 1.3),

[TABLE]

Applying Jensen’s inequality $\mathbb{E}|Y_{t}-Y_{t^{\prime}}|\leq\mathchoice{{\hbox{$ \displaystyle\sqrt{\mathbb{E}(Y_{t}-Y_{t^{\prime}})^{2},} $}\lower 0.4pt\hbox{\vrule height=9.30444pt,depth=-7.44359pt}}}{{\hbox{$ \textstyle\sqrt{\mathbb{E}(Y_{t}-Y_{t^{\prime}})^{2},} $}\lower 0.4pt\hbox{\vrule height=9.30444pt,depth=-7.44359pt}}}{{\hbox{$ \scriptstyle\sqrt{\mathbb{E}(Y_{t}-Y_{t^{\prime}})^{2},} $}\lower 0.4pt\hbox{\vrule height=6.53888pt,depth=-5.23112pt}}}{{\hbox{$ \scriptscriptstyle\sqrt{\mathbb{E}(Y_{t}-Y_{t^{\prime}})^{2},} $}\lower 0.4pt\hbox{\vrule height=5.03888pt,depth=-4.03113pt}}}$ , then evaluating the expectation and using the bounds $|\sigma^{2}(t)-\sigma^{2}(t^{\prime})|\leq 2\max\{\sigma(t),\sigma(t^{\prime})\}|\sigma(t)-\sigma(t^{\prime})|$ , and $\mathchoice{{\hbox{$ \displaystyle\sqrt{x^{2}+y^{2},} $}\lower 0.4pt\hbox{\vrule height=6.10999pt,depth=-4.88802pt}}}{{\hbox{$ \textstyle\sqrt{x^{2}+y^{2},} $}\lower 0.4pt\hbox{\vrule height=6.10999pt,depth=-4.88802pt}}}{{\hbox{$ \scriptstyle\sqrt{x^{2}+y^{2},} $}\lower 0.4pt\hbox{\vrule height=4.30276pt,depth=-3.44223pt}}}{{\hbox{$ \scriptscriptstyle\sqrt{x^{2}+y^{2},} $}\lower 0.4pt\hbox{\vrule height=3.44165pt,depth=-2.75334pt}}}\leq x+y$ yields

[TABLE]

In the usual setting with zero mean $m(t)=0$ , constant noise variance $\sigma(t)=\sigma$ for some $\sigma>0$ , and stationary kernel $k(t,t^{\prime})=r(\|t-t^{\prime}\|)$ for some $r:\mathbb{R}_{+}\to\mathbb{R}_{+}$ , Eq. 39 reduces to

[TABLE]

This example demonstrates that Gaussian processes are locally exchangeable in the presence of measurement noise, i.e. where $\sigma(t)>0$ . However, note that $\sigma(t)>0$ is not strictly necessary for local exchangeability; to obtain a necessary and sufficient characterization of local exchangeability in Gaussian processes, we could instead analyze the canonical metric $d_{c}$ per Theorem 5.

3.2 Approximate predictive distributions in discrete Bayesian nonparametrics

Next, we demonstrate that the local empirical measure can serve as a useful surrogate for otherwise intractable posterior predictive distributions in discrete Bayesian nonparametric models. The Dirichlet process (Ferguson, 1973) is a popular prior for the weights and component parameters in nonparametric mixture models. Draws from a Dirichlet process are discrete probability measures,

[TABLE]

where $(w_{k})_{k=1}^{\infty}$ are weights satisfying $w_{k}\geq 0$ , $\sum_{k}w_{k}=1$ , and $(\theta_{k})_{k=1}^{\infty}$ are component parameters, each with distribution given by (Sethuraman, 1994)

[TABLE]

for some distribution $H$ and concentration parameter $\alpha>0$ . Given draws $X_{n}\overset{\textrm{\tiny{i.i.d.}\@}}{\sim}G$ , the posterior predictive distribution of $X_{N+1}$ given the first $N$ draws $X_{1},\dots,X_{N}$ is

[TABLE]

The fact that one can marginalize the (infinitely many) weights and parameters to arrive at Eq. 43 is critical in tractable computational inference for models involving the Dirichlet process (Neal, 2000).

When the observations come with additional covariate information, the dependent Dirichlet process mixture model (MacEachern, 1999, 2000) may be used instead. There are many instantiations of the dependent Dirichlet process; for simplicity we consider a model where the weights are a function of a covariate but the component parameters are constant across covariate values, i.e.,

[TABLE]

where $w_{x,k}=v_{x,k}\prod_{i=1}^{k-1}(1-v_{x,i})$ , and the stick variables $v_{x,k}$ are now i.i.d. stochastic processes on $\mathbb{R}$ . The marginal distributions of $v_{x,k}$ at $x\in\mathbb{R}$ are designed to be ${\sf{Beta}}(1,\alpha)$ so that the dependent Dirichlet process is marginally a Dirichlet process for each covariate value. But even for simple stochastic processes $v_{x,k}$ , the posterior predictive distribution is not tractable to obtain in closed-form. However, we can note that the process $X$ is locally exchangeable with strong canonical premetric

[TABLE]

where $t=(x,n)$ and $t^{\prime}=(x^{\prime},n^{\prime})$ . Since $w_{x,k}$ is a product of independent variables, Lemma 13 yields

[TABLE]

The infinite sum converges to some $0<C<\infty$ , and so

[TABLE]

Therefore, as long as the stochastic process $v_{x,1}$ is smooth enough, and we condition on $X_{T}$ , where $T$ concentrates closely around $\tau\in\mathcal{T}$ , the posterior predictive distribution of $X_{\tau}$ given $X_{T}$ is approximately equal to the local empirical measure $\widehat{G}_{\tau}$ , by Theorem 6; the latter has a tractable closed-form expression.

3.3 Type-I error inflation in grouped permutation tests

One of the key applications of exchangeability in statistical data analysis is in the design of nonparametric permutation tests with exact type-I error bounds (Pitman, 1937a, b, c; Fisher, 1966, Ch. 3). In the notation of this work, we are given observations of a stochastic process $X$ at a finite set of covariates $T\subset\mathcal{T}$ , a subgroup of $\mathcal{G}$ permutations $\pi:T\to T$ , and a test statistic $S:\mathcal{X}^{T}\to\mathbb{R}$ . The null hypothesis is that $X_{T}$ is exchangeable; so we set a desired threshold $\alpha\in[0,1]$ , and reject the null with type-I error at most $\alpha$ if

[TABLE]

where $X_{\pi,T}$ is defined as in Eq. 5. This setup is commonly used in observational studies with a control group and treatment group, where $\mathcal{G}$ consists of permutations that swap matched pairs of elements in the control and treatment groups. However, a typical problem is that elements in the two groups are not exactly comparable due to the presence of covariates. In this case, a standard approach is to construct $\mathcal{G}$ to permute only those elements with similar covariates from the control and treatment groups, under some metric $d$ (Cochran, 1965; Rubin, 1973a, b; Rosenbaum, 1989, 2002; Lu and Rosenbaum, 2004; Greevy et al., 2004; Hansen, 2004; Hansen and Klopfer, 2006; Baiocchi et al., 2010; Lu et al., 2011). Local exchangeability provides a general way to analyze the type-I error of these methods; Proposition 11 shows that for a locally exchangeable process, the type-I error $\alpha$ may potentially be increased by the average distance between pairs of covariates permuted by $\pi\in\mathcal{G}$ . Eq. 49 also incidentally provides a rigorous justification for past work that formulates the construction of $\mathcal{G}$ as the minimization of this penalty (e.g., Rosenbaum (1989)).

Proposition 11.

Let $X$ be locally exchangeable with respect to $d$ . For $\alpha\in[0,1]$ ,

[TABLE]

4 Discussion

The major question posed in this paper is what we can do with data when we do not believe that they are exchangeable, but are willing to believe that they are nearly exchangeable. This paper answers the question with a relaxed notion of local exchangeability in which swapping data associated with nearby covariates causes a bounded change in total variation distance. We have demonstrated that classical results for exchangeable processes are “robust to the real world;” indeed, locally exchangeable processes have a de Finetti representation that may be leveraged in the design of statistical models and inference procedures. Finally, many popular covariate-dependent statistical models—which violate the assumptions of exchangeability—satisfy local exchangeability, extending the reach of exchangeability-based analyses to these models.

One limitation of local exchangeability is the infinite separability assumption. There are applications in which the covariate space $\mathcal{T}$ has isolated points that violate this condition, e.g., discrete time series where the covariate space is $\mathcal{T}=\mathbb{N}$ endowed with the Euclidean metric. However, if $X$ can be extended to a process on $\mathcal{S}\supseteq\mathcal{T}$ such that $(d,\mathcal{S})$ is infinitely separable and $(X_{s})_{s\in\mathcal{S}}$ is locally exchangeable with respect to $d$ , then the theoretical results from this work hold for the marginal process $(X_{t})_{t\in\mathcal{T}}$ . Another limitation is that the total variation bound in the definition of local exchangeability is quite weak, which has downstream consequences for the tightness of the error bounds in Section 2.3. Further study on alternate definitions of local exchangeability is warranted to strengthen these guarantees.

As a final note, it is also possible that an analogue of the theory of finite exchangeability (Diaconis and Freedman, 1980a) holds in the local setting; but it is not yet clear whether this is indeed true or what form it would take. It would also be of interest to investigate more general notions of local exchangeability under group actions, e.g., permutations that preserve some statistic of the data, which have been used in past work on randomization testing in the presence of covariates (Rosenbaum, 1984).

Acknowledgements

The authors thank Jonathan Huggins for illuminating discussions. T. Campbell is supported by a National Sciences and Engineering Research Council of Canada (NSERC) Discovery Grant and Discovery Launch Supplement. T. Broderick is supported in part by an NSF CAREER Award, an ARO YIP Award, ONR, and a Sloan Research Fellowship.

Appendix A Proofs

Proof of Proposition 2.

Choose some ordering of the countable set $T=(t_{1},t_{2},\dots)$ . We note that $(X_{t_{n}})_{n=1}^{\infty}$ and $(X_{\pi(t_{n})})_{n=1}^{\infty}$ are $\mathcal{X}^{\infty}$ -valued random variables that are measurable with respect to $\Sigma^{\infty}$ , which is generated by the algebra of cylinder sets of the form $U\times\mathcal{X}^{\infty}$ for $U\in\Sigma^{N}$ . Therefore,

[TABLE]

where we have replaced $\Sigma^{\infty}$ with its generator by the fact that for any algebra of sets $\mathcal{A}$ , $\epsilon>0$ , $B\in\sigma(\mathcal{A})$ , and probability measures $\mu,\nu$ on $\sigma(\mathcal{A})$ , there exists an $A\in\mathcal{A}$ such that $\frac{1}{2}(\mu+\nu)(B\triangle A)<\epsilon$ . So by the definition of local exchangeability for finite sets of covariates,

[TABLE]

∎

Proof of Theorem 5.

We start with the reverse direction. Define the two product measures $G_{T}=\prod_{t\in T}G_{t}$ and $G_{\pi,T}=\prod_{t\in T}G_{\pi(t)}$ . Then since $\mathbb{P}(X_{T}\in A)=\mathbb{E}\left[G_{T}(A)\right]$ and $\mathbb{P}(X_{\pi,T}\in A)=\mathbb{E}\left[G_{\pi,T}(A)\right]$ , by Jensen’s inequality,

[TABLE]

Finally, the proof technique of Sendler (1975, Lem. 2.1) and the smoothness of $G$ yields the conclusion,

[TABLE]

For the forward direction, suppose $X$ is locally exchangeable. Let $(t_{n})_{n=1}^{\infty}$ be any ordering of the countable set $\mathfrak{T}$ from Definition 4, and let $\mathcal{F}$ be the tail $\sigma$ -algebra of $(X_{t_{n}})_{n=1}^{\infty}$ . We will show that for any two covariates $r,s\in\mathcal{T}\setminus\mathfrak{T}$ , $r\neq s$ , $X_{r}$ and $X_{s}$ are conditionally independent given $\mathcal{F}$ . The argument extends via standard methods to $r,s$ that may be elements of $\mathfrak{T}$ , and then to any finite subset of $\mathcal{T}$ .

By infinite separability (Definition 4), there exists a subsequence $i_{1}<i_{2}<\dots$ of indices such that $t_{i_{n}}$ is Cauchy and converges to $s$ . By taking another subsequence we can assume without loss of generality for all $N\in\mathbb{N}$ , $i_{N}>N$ and $d(s,t_{i_{N}})+\sum_{n=N}^{\infty}d(t_{i_{n}},t_{i_{n+1}})<1/N$ . Let $\pi_{N}$ be the mapping that takes $s\to t_{i_{N}}$ , $t_{i_{n}}\to t_{i_{n+1}}$ for all $n\geq N$ , and leaves all other $t\in\mathcal{T}$ fixed. Then denote $Y_{N}=(X_{s},X_{t_{N}},X_{t_{N+1}},\dots)$ , and let $Z_{N}$ be the sequence with covariates mapped under $\pi_{N}$ . By reverse martingale convergence, for any bounded measurable $\phi:\mathcal{X}\to\mathbb{R}$ ,

[TABLE]

as $N\to\infty$ . Next, by local exchangeability and Proposition 2,

[TABLE]

and by Lemma 12(2), we have that the Wasserstein distance between $\mathbb{E}\left[\phi(X_{r})\,|\,Y_{N}\right]$ and $\mathbb{E}\left[\phi(X_{r})\,|\,Z_{N}\right]$ converges to 0 as $N\to\infty$ . Together, the Wasserstein distance bound and reverse martingale above yield

[TABLE]

By Aldous (1985, Lemma 3.4),

[TABLE]

and thus $X_{r}$ and $X_{s}$ are conditionally independent given $\mathcal{F}$ . As mentioned earlier this argument extends to any finite subset $T$ of covariates, by considering subsequences of $(t_{n})_{n=1}^{\infty}$ converging to each $t\in T$ . Since $X$ takes values in a standard Borel space, there is a random measure $G_{t}$ for each $t\in\mathcal{T}$ for which $G_{t}(A)\overset{a.s.}{=}\mathbb{E}\left[\mathds{1}[X_{t}\in A]\,|\,\mathcal{F}\right]$ (e.g. Kallenberg, 2002, Theorem 6.3). The collection of these random measures forms the desired stochastic process $G=(G_{t})_{t\in\mathcal{T}}$ .

Next, we develop the smoothness property of $G$ . By both reverse and forward martingale convergence, we have that

[TABLE]

Using dominated convergence to move the limits out of the expectation, local exchangeability to bound the total variation between $(X_{t},X_{t_{n:n+m}})$ and $(X_{t^{\prime}},X_{t_{n:n+m}})$ , and Lemma 12(1),

[TABLE]

Finally, we show that $G$ is approximated by empirical averages of the observations $X$ ; this property will be used below to show that $G$ is unique up to modification. Consider any $A\in\Sigma$ and any sequence $(t^{\prime}_{n})_{n=1}^{\infty}$ converging to $s\in\mathcal{T}$ such that $d(t^{\prime}_{n},s)\leq 2^{-n}$ for each $n\in\mathbb{N}$ . Define $S_{s,N}=\frac{1}{N}\sum_{n=1}^{N}\mathds{1}[X_{t^{\prime}_{n}}\in A]$ . Then

[TABLE]

Noting that the right term is $\mathcal{F}$ -measurable and applying Hoeffding’s inequality to the left,

[TABLE]

Splitting the above expectation across two events—one where the measures satisfy

[TABLE]

and the other its complement—yields

[TABLE]

Applying Markov’s inequality, the triangle inequality, and Eq. 63,

[TABLE]

Thus, $S_{s,N}\overset{p}{\to}G_{s}(A)$ . We now show that $G$ is unique. Suppose there is another measure process $G^{\prime}$ that satisfies Eq. 13, from which $X$ is generated conditionally independently given some $\sigma$ -algebra $\mathcal{F}^{\prime}$ . By repeating the steps above, one can show that $S_{s,N}\overset{p}{\to}G^{\prime}_{s}(A)$ . Therefore,

[TABLE]

Since $(\mathcal{X},\Sigma)$ is a standard Borel space, $\Sigma=\sigma(\mathcal{A})$ for some countable algebra of sets $\mathcal{A}$ (Preston, 2008, Prop. 3.1, 3.3). By noting that the countable intersection of unit-measure sets is also unit-measure,

[TABLE]

Finally by Carathéodory’s extension theorem (Kallenberg, 2002, Theorem 2.5), the probability measures $G_{s}$ and $G^{\prime}_{s}$ are almost surely equal. The extension of this argument to any finite subset of covariates $T\subset\mathcal{T}$ is straightforward, implying that $(G_{t})_{t\in\mathcal{T}}$ is uniquely determined up to modification. ∎

Proof of Theorem 6.

First, since $c_{i}\geq 0$ and $\sum_{i}c_{i}=1$ , by Jensen’s inequality,

[TABLE]

We will focus on a single term in the sum for some $A\in\Sigma$ and drop the $i$ subscript, as the bound for all terms will be identical. Adding and subtracting $\sum_{t\in T}\xi_{t}(\tau)G_{t}(A)$ ,

[TABLE]

Since $X$ is locally exchangeable, by Theorem 5, it is conditionally independently drawn from $G$ . Therefore $\mathbb{E}\left[\widehat{G}_{\tau}(A)\,|\,G\right]=\sum_{t\in T}\xi_{t}(\tau)G_{t}(A)$ . Hence we can use the tower property and expand the square to find that

[TABLE]

The first term can be bounded by using the same conditional independence property again—in particular, that $\mathbb{E}\left[\mathds{1}\left[X_{t}\in A\right]\,|\,G\right]=G_{t}(A)$ —followed by Popoviciu’s inequality:

[TABLE]

For the second term, we first apply Jensen’s inequality by noting that $\xi_{t}(\tau)\geq 0$ , $\sum_{t\in T}\xi_{t}(\tau)=1$ ,

[TABLE]

Since $0\leq|G_{t}(A)-G_{\tau}(A)|\leq 1$ , we have that $(G_{t}(A)-G_{\tau}(A))^{2}\leq|G_{t}(A)-G_{\tau}(A)|$ . Finally by Theorem 5, we know that $\mathbb{E}\left[\left|G_{t}(A)-G_{\tau}(A)\right|\right]\leq d(t,\tau)$ . Hence

[TABLE]

We can combine the bounds on the first and second terms for each set $A_{i}$ , $i\in\mathbb{N}$ , since $\sum_{i=1}^{\infty}c_{i}=1$ :

[TABLE]

Before proceeding further with this bound by substituting the definition of $\xi_{t}(\tau)$ , we will obtain a similar result for the tail bound. We add and subtract $\sum_{t\in T}\xi_{t}(\tau)G_{t}$ and use the triangle inequality:

[TABLE]

By Lemma 15, we have

[TABLE]

For the first term in the sum, note that $\widehat{G}_{\tau}$ is a function of $X_{T}$ , which are conditionally independent given $G$ . Further note that for each $t\in T$ , the value of $\left\|\widehat{G}_{\tau}-\sum_{t\in T}\xi_{t}(\tau)G_{t}\right\|_{\mathcal{A}}$ can change by at most $\xi_{t}(\tau)$ when varying the value of $X_{t}$ . Therefore by McDiarmid’s inequality,

[TABLE]

whenever $\delta/2\geq\mathbb{E}\left\|\widehat{G}_{\tau}-\sum_{t\in T}\xi_{t}(\tau)G_{t}\right\|_{\mathcal{A}}$ . Expanding the definition of the norm and using Jensen’s inequality yields

[TABLE]

at which point the same logic as in Eq. 79 yields

[TABLE]

and hence for all $\delta\geq\mathchoice{{\hbox{$ \displaystyle\sqrt{\sum_{t\in T}\xi_{t}(\tau)^{2},} $}\lower 0.4pt\hbox{\vrule height=9.30444pt,depth=-7.44359pt}}}{{\hbox{$ \textstyle\sqrt{\sum_{t\in T}\xi_{t}(\tau)^{2},} $}\lower 0.4pt\hbox{\vrule height=9.30444pt,depth=-7.44359pt}}}{{\hbox{$ \scriptstyle\sqrt{\sum_{t\in T}\xi_{t}(\tau)^{2},} $}\lower 0.4pt\hbox{\vrule height=6.53888pt,depth=-5.23112pt}}}{{\hbox{$ \scriptscriptstyle\sqrt{\sum_{t\in T}\xi_{t}(\tau)^{2},} $}\lower 0.4pt\hbox{\vrule height=5.03888pt,depth=-4.03113pt}}}$ ,

[TABLE]

For the second term in the sum, we apply Markov’s inequality to find that

[TABLE]

Noting that $\sum_{t\in T}\xi_{t}(\tau)=1$ , we can apply Jensen’s inequality,

[TABLE]

Finally by local exchangeability and Theorem 5,

[TABLE]

Combining the bounds for the first and second term and shifting $\delta$ yields, for all $\delta>0$ ,

[TABLE]

We now substitute the definition of $\xi_{t}(\tau)=\max\{0,1/M_{\tau}+2(\mu_{\tau}-d(t,\tau))\}$ into both results in Eqs. 84 and 97. First, note that (suppressing $(\tau)$ notation in the remainder of the proof for brevity),

[TABLE]

where

[TABLE]

Further, by Bhatia and Davis (2000, Theorem 1) and the definition of $M_{\tau}$ ,

[TABLE]

Therefore

[TABLE]

and

[TABLE]

Finally, because neither upper bound depends explicitly on $\mathcal{A}$ , we can take the supremum. To obtain the same results for $\mathbb{P}\left(X_{\tau}\in\cdot\,|\,X_{T}\right)$ , we apply the same proof technique, noting that (1) $\widehat{G}_{\tau}(A)=\mathbb{E}\left[\widehat{G}_{\tau}(A)\,|\,X_{T}\right]$ and (2) by the tower property and de Finetti result in Theorem 5, $\mathbb{E}\left[\mathds{1}[X_{\tau}\in A]\,|\,X_{T}\right]=\mathbb{E}\left[G_{\tau}(A)\,|\,X_{T}\right]$ . ∎

Proof of Corollary 7.

For each $M\in[|T_{n}|]$ , denote $\mu_{M}=\frac{1}{M}\sum_{m=1}^{M}d_{m}$ . Note that for any $M<M_{\tau}$ ,

[TABLE]

Therefore, for all $M<M_{\tau}$ ,

[TABLE]

We iterate this bound from $m=M$ to $m=M_{\tau}-1$ to find that for all $M<M_{\tau}$ ,

[TABLE]

Finally, we rearrange this bound to obtain an upper bound on $\frac{1}{2M_{\tau}}+\mu_{\tau}$ for any $M<M_{\tau}$ :

[TABLE]

We also have that $M_{\tau}\to\infty$ as $n\to\infty$ : by definition of $M_{\tau}$ ,

[TABLE]

so if $\liminf_{n\to\infty}M_{\tau}=C<\infty$ , then there would exist a subsequence such that $\frac{1}{C+1}\leq 2d_{C+1}$ for all $n$ sufficiently large. But this is not possible, since for any fixed $C\in\mathbb{N}$ , $d_{C}\to 0$ as $T_{n}$ concentrates around $\tau$ . Therefore $M_{\tau}\to\infty$ , so that for any $M\in\mathbb{N}$ ,

[TABLE]

and hence

[TABLE]

Theorem 6 implies that both $\mathbb{E}\left[\|\widehat{G}_{\tau}-G_{\tau}\|^{2}_{\mathcal{A}}\right]\to 0$ and $\mathbb{E}\left[\|\widehat{G}_{\tau}-\mathbb{P}\left(X_{\tau}\in\cdot\,|\,X_{T_{n}}\right)\|^{2}_{\mathcal{A}}\right]\to 0$ as $n\to\infty$ . By Markov’s inequality,

[TABLE]

Finally, note that Eq. 122 implies that any subsequence likewise satisfies $\|\cdot\|_{\mathcal{A}}\overset{p}{\to}0$ , and hence any subsequence has a further subsequence such that $\|\cdot\|_{\mathcal{A}}\overset{a.s.}{\to}0$ . Since $\mathcal{A}$ was arbitrary, Lemma 16 asserts that we can choose $\mathcal{A}$ such that $\|\cdot\|_{\mathcal{A}}\to 0$ implies weak convergence, i.e., $d_{\mathrm{P}}(\cdot,\cdot)\to 0$ . Thus any subsequence has a further subsequence that satisfies $d_{\mathrm{P}}(\cdot,\cdot)\overset{a.s.}{\to}0$ . Hence $d_{\mathrm{P}}(\cdot,\cdot)\overset{p}{\to}0$ by (Durrett, 2010, Theorem 2.3.2). ∎

Proof of Corollary 8.

By Markov’s inequality,

[TABLE]

and by Theorem 5,

[TABLE]

∎

Proof of Theorem 9.

First, note that by assumption, the space $(d,\mathbb{R})$ is infinitely separable. By local exchangeability and Theorem 5, for any $t,\Delta\in\mathbb{R}$ , finite subset $T\subset\mathbb{R}$ , and $A\in\Sigma$ , Theorem 5 implies that

[TABLE]

where $T+\Delta$ denotes the translation of all covariates in $T$ by $\Delta$ . The Kolmogorov continuity theorem (Kallenberg, 2002, Theorem 3.23) implies that for all $\alpha\in(0,\gamma)$ , $(G_{t}(A))_{t\in\mathbb{R}}$ has an $\alpha$ -Hölder continuous modification. Note that an $\alpha$ -Hölder continuous function for $\alpha>1$ is constant.

First, assume $\gamma>1$ . If we select $\alpha\in(1,\gamma)$ , we have that for any $A\in\Sigma$ , $(G_{t}(A))_{t\in\mathbb{R}}$ has a constant modification. In other words, for all $t,t^{\prime}\in\mathbb{R}$ , $A\in\Sigma$ , $\mathbb{P}\left(G_{t}(A)=G_{t^{\prime}}(A)\right)=1$ . Since $\Sigma=\sigma(\mathcal{A})$ for a countable algebra $\mathcal{A}$ (Preston, 2008, Prop. 3.1, 3.3), we have that $\mathbb{P}\left(\forall A\in\mathcal{A},\,\,G_{t}(A)=G_{t^{\prime}}(A)\right)=1$ , and hence by Carathéodory’s extension theorem (Kallenberg, 2002, Theorem 2.5), $G_{t}$ and $G_{t^{\prime}}$ are almost surely equal probability measures. This implies that $G$ is a constant process (up to modification) and $X$ is exchangeable.

Next, suppose $\gamma\in(0,1]$ . Then by Eq. 125,

[TABLE]

showing that $X$ is stationary. Next, since $X$ is stationary, for any $t,t^{\prime}\in\mathbb{R}$ and $A\in\Sigma$ , the mean of $G_{t}(A)$ satisfies

[TABLE]

Similarly, the autocovariance satisfies

[TABLE]

Hence $(G_{t}(A))_{t\in\mathbb{R}}$ is weak-sense stationary.

Finally, consider the process $X_{t}=\mathds{1}(t\geq U)$ for $U\in{\sf{Unif}}[0,1]$ , which is locally exchangeable with $d(t,t^{\prime})=\min(|t-t^{\prime}|,1)$ and hence $\gamma=0$ . The underlying random measure process is specified by $G_{t}=\mathds{1}(t<U)\delta_{\{0\}}+\mathds{1}(t\geq U)\delta_{\{1\}}$ where $\delta_{x}$ is the Dirac measure at $x$ ; this has no sample-continuous modification. ∎

Proof of Theorem 10.

Let $\mathfrak{T}$ be the countable subset provided by infinite separability in Definition 4. Let $(t_{n})_{n=1}^{\infty}$ be any ordering of $\mathfrak{T}\setminus T$ , and $Y_{N}=\left(X_{t_{N}},X_{t_{N+1}},\dots\right)$ . Reverse martingale convergence implies that

[TABLE]

where $\mathcal{F}$ is the tail $\sigma$ -algebra of $\{X_{t_{i}}\}_{i=1}^{\infty}$ . Defining $g(X_{T})=\frac{1}{|\mathcal{G}|}\sum_{\pi\in\mathcal{G}}h(X_{\pi,T})$ , we have that $g(X_{T})$ is invariant to $\mathcal{G}$ and thus $g(X_{T})$ is $\sigma(\widetilde{G},Y_{N})$ -measurable. Therefore

[TABLE]

By Lemma 12(1) and Proposition 2,

[TABLE]

Taking the limit as $N\to\infty$ , moving it into the expectation in Eq. 134 via dominated convergence, and using the limit from Eq. 133 yields

[TABLE]

Identical reasoning to the above also shows that

[TABLE]

Finally we add and subtract $g(X_{T})$ in left hand side of Eq. 32, apply the triangle inequality with the above bounds, and note that the sum over $\pi$ is the expectation over a uniformly random permutation to obtain the result. ∎

Proof of Proposition 11.

We rewrite the probability as an expectation,

[TABLE]

By local exchangeability, we can remap under any bijection $\pi^{\prime}:T\to T$ , so that

[TABLE]

Finally, note that the outer indicator function tests whether $S(X_{\pi^{\prime}T})$ is strictly greater than $(1-\alpha)|\mathcal{G}|$ of the statistics across all $\pi\in\mathcal{G}$ . There can be at most $\alpha|\mathcal{G}|$ of such indicator functions, so

[TABLE]

Rearranging the bound yields the result. ∎

Appendix B Technical lemmata

Lemma 12.

Let $X,Y$ be bounded random variables in $[a,b]$ for some $a,b\in\mathbb{R}$ , $a\leq b$ , and $U,V$ be random elements in some probability space.

If $\|(X,U)-(Y,U)\|_{\mathrm{TV}}\leq\epsilon$ , then

[TABLE] 2. 2.

If $\|(X,U)-(X,V)\|_{\mathrm{TV}}\leq\epsilon$ , then for any 1-Lipschitz function $h:\mathbb{R}\to\mathbb{R}$ ,

[TABLE]

Proof.

Denoting $Q:=\mathds{1}\left[\mathbb{E}\left[X\,|\,U\right]>\mathbb{E}\left[Y\,|\,U\right]\right]$ ,

[TABLE]

Using the fact that $Q$ is measurable with respect to $U$ and the tower property yields

[TABLE]

Since the difference is between the expectation of a function bounded in $[0,1]$ evaluated at $(X,U)$ and at $(Y,U)$ , the assumed total variation bound provides the result.

First, note that $\sup_{x,y\in[a,b]}\left|h(x)-h(y)\right|\leq b-a$ by 1-Lipschitz continuity. Then defining $A(U):=\mathbb{E}\left[X\,|\,U\right]$ and $B(V):=\mathbb{E}\left[X\,|\,V\right]$ , the triangle inequality yields

[TABLE]

The right hand term is bounded by $(b-a)\epsilon$ by the assumed total variation bound and 1-Lipschitz continuity. Defining $Q(u)=\mathds{1}\left[A(u)\geq B(u)\right]$ ,

[TABLE]

The first term in the expression can be bounded by $(b-a)\epsilon$ via substitution of the conditional expectation formulae for $A,B$ , using the tower property, and controlling the difference in expectations with the assumed total variation bound. The second term is again a difference in expectation of a bounded function under $U$ and $V$ with the same bound $(b-a)\epsilon$ . ∎

Lemma 13.

For any two sequences of real numbers $(a_{i})_{i=1}^{\infty}$ , $(b_{i})_{i=1}^{\infty}$ ,

[TABLE]

Proof.

The proof follows by adding and subtracting $b_{1}\prod_{i=2}^{\infty}a_{i}$ , then $b_{1}b_{2}\prod_{i=3}^{\infty}a_{i}$ , etc., and then using the triangle inequality. ∎

Lemma 14 ((Reiss, 1981)).

For any two finite product probability measures $\mu=\mu_{1}\times\dots\times\mu_{N}$ and $\nu=\nu_{1}\times\dots\times\nu_{N}$ ,

[TABLE]

Lemma 15.

For any two real-valued random variables $U,V$ and constants $a,b\in\mathbb{R}$ ,

[TABLE]

Proof.

[TABLE]

∎

Lemma 16.

Let $(\mathcal{X},\Sigma)$ be a standard Borel space. There exists a countable collection of measurable subsets $(A_{i})_{i=1}^{\infty}$ , $A_{i}\subseteq\mathcal{X}$ such that for all $\mathcal{A}=\{c_{i},A_{i}\}_{i=1}^{\infty}$ , $c_{i}>0$ , $\sum_{i}c_{i}=1$ , and probability measures $\mu,(\mu_{n})_{n=1}^{\infty}$ ,

[TABLE]

and for all $\mu$ such that each $A_{i}$ is a continuity set of $\mu$ ,

[TABLE]

Proof.

Since $(\mathcal{X},\Sigma)$ is a standard Borel space, we know that $\Sigma$ is generated by a topology with a countable base $(B_{i})_{i=1}^{\infty}$ . Any open set $U\subseteq\mathcal{X}$ can be expressed as a countable union of these sets. Consider the collection of all possible unions $\mathcal{B}_{n}$ of $\{B_{1},\dots,B_{n}\}$ , and construct a countable sequence of sets $(A_{i})_{i=1}^{\infty}$ by ordering $\mathcal{B}_{1}$ , then $\mathcal{B}_{2}$ , and so on. Then for any open set $U\subseteq\mathcal{X}$ , there exists a subsequence $(U_{k})_{k=1}^{\infty}$ of $(A_{i})_{i=1}^{\infty}$ such that $U_{k}\uparrow U$ .

Assume $\|\mu_{n}-\mu\|_{\mathcal{A}}\to 0$ ; then for any open set $U$ and $k\in\mathbb{N}$ ,

[TABLE]

But $\|\mu_{n}-\mu\|_{\mathcal{A}}\to 0$ if and only if $\forall i\in\mathbb{N}$ , $\mu_{n}(A_{i})\to\mu(A_{i})$ . Hence

[TABLE]

Since this holds for all $k\in\mathbb{N}$ and $U_{k}\uparrow U$ , by the continuity of measures,

[TABLE]

Hence $\mu_{n}\overset{d}{\to}\mu$ . If each $A_{i}$ is a continuity set of $\mu$ , then $\mu_{n}\overset{d}{\to}\mu$ implies that $|\mu_{n}(A_{i})-\mu(A_{i})|\to 0$ for each $i$ , which then implies $\|\mu_{n}-\mu\|_{\mathcal{A}}\to 0$ . ∎

Appendix C Additional Examples

In this section, we show that many popular covariate-dependent models from Bayesian nonparametrics exhibit local exchangeability.

C.1 Dependent Dirichlet process mixtures

In a typical mixture model setting, we have observations generated via

[TABLE]

where $(w_{k})_{k=1}^{\infty}$ are the mixture weights satisfying $w_{k}\geq 0$ , $\sum_{k}w_{k}=1$ ; $(\theta_{k})_{k=1}^{\infty}$ are the component parameters; $F(\cdot;\theta)$ is the mixture component likelihood; and $(X_{n})_{n=1}^{\infty}$ are the observations. A popular nonparametric prior for the weights and component parameters is the Dirichlet process (Ferguson, 1973), defined by (Sethuraman, 1994)

[TABLE]

for some distribution $H$ . When the observations come with additional covariate information, the dependent Dirichlet process mixture model (MacEachern, 1999, 2000) may be used to capture similarities between related mixture population data. Here, observations are generated via

[TABLE]

where the component parameters $\theta_{x,k}$ and stick variables $v_{x,k}$ are now i.i.d. stochastic processes on $\mathbb{R}$ , and $w_{x,k}=v_{x,k}\prod_{i=1}^{k-1}(1-v_{x,i})$ . The marginal distributions of $\theta_{x,k}$ and $v_{x,k}$ at $x\in\mathbb{R}$ are $H$ and ${\sf{Beta}}(1,\alpha)$ , respectively. Thus, the dependent Dirichlet process is marginally a Dirichlet process for each covariate value, but can exhibit a wide range of dependencies across covariates. In this setting, we have $\mathcal{T}=\mathbb{R}\times\mathbb{N}$ and strong canonical premetric

[TABLE]

where $t=(x,n)$ and $t^{\prime}=(x^{\prime},n^{\prime})$ . We add and subtract $\sum_{k=1}^{\infty}w_{x^{\prime},k}F(\cdot;\theta_{x,k})$ and apply the triangle inequality to find that

[TABLE]

Since $w_{x,k}$ is a product of independent random variables, Lemma 13 yields

[TABLE]

The infinite sum converges to some $0<C<\infty$ , and so

[TABLE]

Therefore, if the stochastic processes for the parameters and stick variables are both smooth enough such that

[TABLE]

for some premetric $\tilde{d}:\mathcal{T}\times\mathcal{T}\to\mathbb{R}_{+}$ , then $X$ is locally exchangeable with respect to $\min(1,\tilde{d})$ . Many dependent processes (e.g., (Foti and Williamson, 2015)) similar to the dependent Dirichlet process (and kernel beta process below) can be shown to exhibit local exchangeability using similar techniques.

C.2 Kernel beta processes

Another example of a model exhibiting local exchangeability from the Bayesian nonparametrics literature is the kernel beta process latent feature model (Ren et al., 2011). In a typical nonparametric latent feature modelling setting, we have observations generated via

[TABLE]

where $(w_{k})_{k=1}^{\infty}$ are the feature frequencies satisfying $w_{k}\in[0,1]$ , $\sum_{k=1}^{\infty}w_{k}<\infty$ ; $(\theta_{k})_{k=1}^{\infty}$ are the feature parameters; $\mathrm{BeP}$ is the Bernoulli process that sets $Z_{n}(\{\theta_{k}\})=1$ with probability $w_{k}$ and [math] otherwise independently across $k\in\mathbb{N}$ ; and $F$ is the likelihood for each observation. A popular nonparametric prior for the weights and feature parameters is the beta process (Hjort, 1990), defined by

[TABLE]

where $\mathrm{PP}$ is a Poisson point process parametrized by its mean measure, $c$ is some positive function, $H$ is a probability distribution, and $\gamma>0$ . When the observations come with covariate information, the kernel beta process (Ren et al., 2011) may be used to capture similarities in the latent features of related populations. In particular, we replace $Z_{n}$ with

[TABLE]

where $\kappa(x,x_{k};\psi_{k})$ is a kernel function with range in $[0,1]$ centered at $x_{k}$ with parameters $\psi_{k}$ , and

[TABLE]

where $Q$ and $R$ are probability distributions. In other words, the kernel beta process endows each atom with i.i.d. covariates $x_{k}$ and parameters $\psi_{k}$ , and makes the likelihood that an observation with covariate $x$ selects a feature with covariate $x_{k}$ depend on both $x$ and $x_{k}$ . Taking $\mathbb{R}$ to be the space of covariates for simplicity, again we have $\mathcal{T}=\mathbb{R}\times\mathbb{N}$ and (marginalizing $Z_{x,n}$ ) strong canonical premetric

[TABLE]

where $t=(x,n)$ and $t^{\prime}=(x^{\prime},n^{\prime})$ . Suppose $F$ is $\gamma$ -Hölder continuous in total variation for $0<\gamma\leq 1$ , $C\geq 0$ in the sense that

[TABLE]

for any collection of points $\{\theta_{k}\}_{k=1}^{\infty}$ , where $Z(\{\theta_{k}\})=1$ , $Z^{\prime}(\{\theta_{k}\})=1$ independently with probability $p_{k}$ and $p^{\prime}_{k}$ , respectively, and both assign 0 mass to all other sets. Then

[TABLE]

Finally, if the kernel $\kappa$ is $\alpha$ -Hölder continuous with constant $C^{\prime}(\psi)$ depending on $\psi$ , the independence of $\theta_{k}$ , $w_{k}$ , and $\psi_{k}$ may be used to show that

[TABLE]

Therefore the observations are locally exchangeable with $d(t,t^{\prime})=\min\left(1,C^{\prime\prime}|x-x^{\prime}|^{\alpha\gamma}\right)$ and $C^{\prime\prime}$ collects the product of constants from the previous expression.

C.3 Dynamic topic model

The dynamic topic model (Blei and Lafferty, 2006; Wang, Blei and Heckerman, 2008) is a model for text data that extends latent Dirichlet allocation (Blei, Ng and Jordan, 2003) to incorporate timestamp covariate information. In a continuous version of the model, observations are generated via

[TABLE]

where $x\in\mathbb{R}$ represents timestamps, $\alpha_{x}\in\mathbb{R}^{K}$ is a vector of $K$ independent Wiener processes representing the popularity of $K$ topics at time $x$ , $\beta_{x,k}\in\mathbb{R}^{V}$ is a vector of $V$ independent Wiener processes representing the word frequencies for vocabulary of size $V$ in topic $k$ , $\pi_{J}$ is any $L$ -Lipschitz mapping from $\mathbb{R}^{J}$ to the probability simplex $\pi_{J}:\mathbb{R}^{J}\to\Delta^{J-1}$ for any $J\in\mathbb{N}$ , $\mu$ is the mean number of words per document, $D_{n,x}\in\mathbb{N}^{V}$ is the vector of counts of each vocabulary word in the $n^{\text{th}}$ document observed at time $x$ , and $W$ is the number of words in each document, taken to be the same across documents for simplicity. Here the covariate space is $\mathcal{T}=\mathbb{R}\times\mathbb{N}$ , and the observations are count vectors in $\mathbb{N}^{V}$ where $V$ is the vocabulary size. In this setting, the strong canonical premetric is

[TABLE]

where $t=(x,n)$ and $t^{\prime}=(x^{\prime},n^{\prime})$ . But since multinomial variables are a function (in particular, a sum) of independent categorical random variables, Lemma 14 yields the bound

[TABLE]

We evaluate the total variation between two categorical distributions and apply the triangle inequality to find that

[TABLE]

Since $\sum_{v=1}^{V}\pi_{V}(\beta_{x,k})_{v}=\sum_{k=1}^{K}\theta_{x^{\prime},k}=1$ , the components of $\theta_{x,k}$ and $\beta_{x,k}$ are i.i.d. across $k$ , and $\pi_{V}$ is $L$ -Lipschitz,

[TABLE]

where the last line follows by Jensen’s inequality. Therefore the observations are locally exchangeable with $d(t,t^{\prime})=\min\left(1,\frac{1}{2}\mu L\left(K+V\right)\mathchoice{{\hbox{$ \displaystyle\sqrt{|x-x^{\prime}|,} $}\lower 0.4pt\hbox{\vrule height=7.5pt,depth=-6.00003pt}}}{{\hbox{$ \textstyle\sqrt{|x-x^{\prime}|,} $}\lower 0.4pt\hbox{\vrule height=7.5pt,depth=-6.00003pt}}}{{\hbox{$ \scriptstyle\sqrt{|x-x^{\prime}|,} $}\lower 0.4pt\hbox{\vrule height=5.25pt,depth=-4.20003pt}}}{{\hbox{$ \scriptscriptstyle\sqrt{|x-x^{\prime}|,} $}\lower 0.4pt\hbox{\vrule height=3.75pt,depth=-3.00002pt}}}\right)$ .

Bibliography75

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1Aldous (1981) {barticle} [author] \bauthor \bsnm Aldous, \bfnm David \binits D. ( \byear 1981). \btitle Representations for partially exchangeable arrays of random variables. \bjournal Journal of Multivariate Analysis \bvolume 11 \bpages 581–598. \endbibitem
2Aldous (1985) {bbook} [author] \bauthor \bsnm Aldous, \bfnm David \binits D. ( \byear 1985). \btitle Exchangeability and related topics. \bseries École d’été de probabilités de Saint-Flour , XIII. \bpublisher Springer, \baddress Berlin. \endbibitem
3Austin and Panchenko (2014) {barticle} [author] \bauthor \bsnm Austin, \bfnm Tim \binits T. and \bauthor \bsnm Panchenko, \bfnm Dmitry \binits D. ( \byear 2014). \btitle A hierarchical version of the de Finetti and Aldous–Hoover representations. \bjournal Probability Theory and Related Fields \bvolume 159 \bpages 809-823. \endbibitem
4Baiocchi et al. (2010) {barticle} [author] \bauthor \bsnm Baiocchi, \bfnm Mike \binits M., \bauthor \bsnm Small, \bfnm Dylan \binits D., \bauthor \bsnm Lorch, \bfnm Scott \binits S. and \bauthor \bsnm Rosenbaum, \bfnm Paul \binits P. ( \byear 2010). \btitle Building a stronger instrument in an observational study of perinatal care for premature infants. \bjournal Journal of the American Statistical Association \bvolume 105 \bpages 1285–1296. \endbibitem
5Berti, Pratelli and Rigo (2004) {barticle} [author] \bauthor \bsnm Berti, \bfnm Patrizia \binits P., \bauthor \bsnm Pratelli, \bfnm Luca \binits L. and \bauthor \bsnm Rigo, \bfnm Pietro \binits P. ( \byear 2004). \btitle Limit theorems for a class of identically distributed random variables. \bjournal The Annals of Probability \bvolume 32 \bpages 2029–2052. \endbibitem
6Bhatia and Davis (2000) {barticle} [author] \bauthor \bsnm Bhatia, \bfnm Rajendra \binits R. and \bauthor \bsnm Davis, \bfnm Chandler \binits C. ( \byear 2000). \btitle A better bound on the variance. \bjournal The American Mathematical Monthly \bvolume 107 \bpages 353–357. \endbibitem
7Blei and Lafferty (2006) {binproceedings} [author] \bauthor \bsnm Blei, \bfnm David \binits D. and \bauthor \bsnm Lafferty, \bfnm John \binits J. ( \byear 2006). \btitle Dynamic topic models. In \bbooktitle International Conference on Machine Learning. \endbibitem
8Blei, Ng and Jordan (2003) {barticle} [author] \bauthor \bsnm Blei, \bfnm David \binits D., \bauthor \bsnm Ng, \bfnm Andrew \binits A. and \bauthor \bsnm Jordan, \bfnm Michael \binits M. ( \byear 2003). \btitle Latent Dirichlet allocation. \bjournal Journal of Machine Learning Research \bvolume 3 \bpages 993–1022. \endbibitem

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Code & Models

Videos

Local Exchangeability

Abstract

keywords:

1 Introduction

1.1 Related work

2 Local exchangeability

2.1 Definition

Definition 1**.**

Proposition 2**.**

Example 3**.**

2.2 de Finetti representation

Definition 4** (Infinitely-separable space).**

Theorem 5**.**

Example** (continued).**

2.3 Local empirical measure process

Theorem 6**.**

Corollary 7**.**

Example** (continued).**

2.4 Regularity

Corollary 8**.**

Theorem 9**.**

Remark*.*

2.5 Approximate conditional independence

Theorem 10**.**

Remark*.*

3 Examples

3.1 Obtaining the underlying measure-valued process and premetric

3.2 Approximate predictive distributions in discrete Bayesian nonparametrics

3.3 Type-I error inflation in grouped permutation tests

Proposition 11**.**

4 Discussion

Acknowledgements

Appendix A Proofs

Proof of Proposition 2.

Proof of Theorem 5.

Proof of Theorem 6.

Proof of Corollary 7.

Proof of Corollary 8.

Proof of Theorem 9.

Proof of Theorem 10.

Proof of Proposition 11.

Appendix B Technical lemmata

Lemma 12**.**

Proof.

Lemma 13**.**

Proof.

Lemma 14** ((Reiss, 1981)).**

Lemma 15**.**

Proof.

Lemma 16**.**

Proof.

Appendix C Additional Examples

C.1 Dependent Dirichlet process mixtures

C.2 Kernel beta processes

C.3 Dynamic topic model

Definition 1.

Proposition 2.

Example 3.

Definition 4 (Infinitely-separable space).

Theorem 5.

Example (continued).

Theorem 6.

Corollary 7.

Example (continued).

Corollary 8.

Theorem 9.

*Remark**.*

Theorem 10.

*Remark**.*

Proposition 11.

Lemma 12.

Lemma 13.

Lemma 14 ((Reiss, 1981)).

Lemma 15.

Lemma 16.