All Adapted Topologies are Equal

Julio Backhoff-Veraguas; Daniel Bartl; Mathias Beiglb\"ock; Manu Eder

arXiv:1905.00368·math.PR·September 30, 2020

All Adapted Topologies are Equal

Julio Backhoff-Veraguas, Daniel Bartl, Mathias Beiglb\"ock, Manu Eder

PDF

TL;DR

This paper demonstrates that various topologies on the set of stochastic process laws, developed for different purposes, are actually equivalent in finite discrete time, unifying their theoretical framework.

Contribution

It proves that all these different adapted topologies coincide in finite discrete time, providing a unified understanding of their structure and properties.

Findings

01

All adapted topologies are equivalent in finite discrete time.

02

The weak adapted topology is characterized by continuity of optimal stopping problems.

03

Different approaches to defining topologies on stochastic laws unify under this framework.

Abstract

A number of researchers have introduced topological structures on the set of laws of stochastic processes. A unifying goal of these authors is to strengthen the usual weak topology in order to adequately capture the temporal structure of stochastic processes. Aldous defines an extended weak topology based on the weak convergence of prediction processes. In the economic literature, Hellwig introduced the information topology to study the stability of equilibrium problems. Bion-Nadal and Talay introduce a version of the Wasserstein distance between the laws of diffusion processes. Pflug and Pichler consider the nested distance (and the weak nested topology) to obtain continuity of stochastic multistage programming problems. These distances can be seen as a symmetrization of Lassalle's causal transport problem, but there are also further natural ways to derive a topology from causal…

Figures1

Click any figure to enlarge with its caption.

Equations385

W_{p} (μ, ν) := in f {E^{π} (ρ (X, Y)^{p})^{1/ p} : π \in Cpl (μ, ν)} .

W_{p} (μ, ν) := in f {E^{π} (ρ (X, Y)^{p})^{1/ p} : π \in Cpl (μ, ν)} .

\displaystyle\big{(}T_{1}(X_{1},\ldots,X_{N}),\ldots,T_{N}(X_{1},\ldots,X_{N})\big{)}\sim\nu,

\displaystyle\big{(}T_{1}(X_{1},\ldots,X_{N}),\ldots,T_{N}(X_{1},\ldots,X_{N})\big{)}\sim\nu,

\displaystyle\pi\big{(}(Y_{1},\dots,Y_{t})\in A|X\big{)}

\displaystyle\pi\big{(}(Y_{1},\dots,Y_{t})\in A|X\big{)}

C W_{p} (μ, ν)

C W_{p} (μ, ν)

SC W_{p} (μ, ν)

A W_{p} (μ, ν)

I_{t}

I_{t}

I_{t} (μ)

k^{t} (x_{1}, \dots, x_{N})

X_{1}, \dots, X_{t}, L^{μ} (X_{t + 1}, \dots, X_{N} ∣ X_{1}, \dots, X_{t})

X_{1}, \dots, X_{t}, L^{μ} (X_{t + 1}, \dots, X_{N} ∣ X_{1}, \dots, X_{t})

Z_{0}^{μ} := L (X) = μ, Z_{1}^{μ} := L^{μ} (X ∣ X_{1}), \dots, Z_{N}^{μ} := L^{μ} (X ∣ X_{1}, \dots, X_{N}) .

Z_{0}^{μ} := L (X) = μ, Z_{1}^{μ} := L^{μ} (X ∣ X_{1}), \dots, Z_{N}^{μ} := L^{μ} (X ∣ X_{1}, \dots, X_{N}) .

E : P (Ω) \to P (Ω \times P (Ω)^{N + 1})

E : P (Ω) \to P (Ω \times P (Ω)^{N + 1})

(X, Z^{μ}) = (X_{1}, \dots, X_{N}, μ, L^{μ} (X ∣ X_{1}), L^{μ} (X ∣ X_{1}, X_{2}), \dots, L^{μ} (X ∣ X_{1}, \dots, X_{N}))

(X, Z^{μ}) = (X_{1}, \dots, X_{N}, μ, L^{μ} (X ∣ X_{1}), L^{μ} (X ∣ X_{1}, X_{2}), \dots, L^{μ} (X ∣ X_{1}, \dots, X_{N}))

μ \mapsto \int f d μ

μ \mapsto \int f d μ

v^{L} (μ) := in f {E^{μ} (L_{τ}) : τ \leq N \mbox i s a s t o pp in g t im e} .

v^{L} (μ) := in f {E^{μ} (L_{τ}) : τ \leq N \mbox i s a s t o pp in g t im e} .

μ \mapsto v^{L} (μ)

μ \mapsto v^{L} (μ)

ρ_{P_{p} (X^{t} \times P_{p} (X^{N - t}))} (μ, ν)

ρ_{P_{p} (X^{t} \times P_{p} (X^{N - t}))} (μ, ν)

\displaystyle\phantom{:=}\quad+\mathcal{W}_{p}(\hat{\mu},\hat{\nu})^{p}\,\,\mathrm{d}\gamma((x_{i})_{i\leq t},\hat{\mu},(y_{i})_{i\leq t},\hat{\nu})\Big{)}^{1/p}\text{ .}

μ \mapsto v^{L} (μ)

μ \mapsto v^{L} (μ)

V_{t}^{p} (x_{1}, \dots, x_{t}, y_{1}, \dots, y_{t}) :=

V_{t}^{p} (x_{1}, \dots, x_{t}, y_{1}, \dots, y_{t}) :=

\displaystyle\inf_{\gamma^{t+1}\in\operatorname{Cpl}(\mu_{x_{1},\dots,x_{t}},\nu_{y_{1},\dots,y_{t}})}\iint\!\left(\!\begin{array}[]{cc}V^{p}_{t+1}(x_{1},\dots,x_{t+1},y_{1},\dots,y_{t+1})\\ +\ \rho(x_{t+1},y_{t+1})^{p}\end{array}\!\right)\!\,\mathrm{d}\gamma^{t+1}(x_{t+1},y_{t+1}).

ND_{p} (μ, ν)^{p} = γ^{1} \in Cpl (proj_{1}_{#} (μ), proj_{1}_{#} (ν)) in f \iint (V_{1}^{p} (x_{1}, y_{1}) + ρ (x_{1}, y_{1})^{p}) d γ^{1} (x_{1}, y_{1}) .

ND_{p} (μ, ν)^{p} = γ^{1} \in Cpl (proj_{1}_{#} (μ), proj_{1}_{#} (ν)) in f \iint (V_{1}^{p} (x_{1}, y_{1}) + ρ (x_{1}, y_{1})^{p}) d γ^{1} (x_{1}, y_{1}) .

X_{N : N}

X_{N : N}

X_{N - 1 : N}

⋮

X_{1 : N}

N (μ) := L (X_{1}, L (X_{2}, \dots L (X_{N - 1}, L (X_{N} \overset{ˉ}{X}_{1}^{N - 1}) \overset{ˉ}{X}_{1}^{N - 2}) \dots X_{1}))

N (μ) := L (X_{1}, L (X_{2}, \dots L (X_{N - 1}, L (X_{N} \overset{ˉ}{X}_{1}^{N - 1}) \overset{ˉ}{X}_{1}^{N - 2}) \dots X_{1}))

Ω = X^{N} = =: X_{1}^{(t)} X^{t} \times =: X_{2}^{(t)} X^{N - t} = X_{1}^{(t)} \times X_{2}^{(t)} .

Ω = X^{N} = =: X_{1}^{(t)} X^{t} \times =: X_{2}^{(t)} X^{N - t} = X_{1}^{(t)} \times X_{2}^{(t)} .

I W_{p} (μ, ν) := t = 1 \sum N A W_{p}^{(t)} (μ, ν), p \geq 1,

I W_{p} (μ, ν) := t = 1 \sum N A W_{p}^{(t)} (μ, ν), p \geq 1,

ρ_{X \times Y} ((x_{1}, y_{1}), (x_{2}, y_{2})) := (ρ_{X} (x_{1}, x_{2})^{p} + ρ_{Y} (y_{1}, y_{2})^{p})^{1/ p} .

ρ_{X \times Y} ((x_{1}, y_{1}), (x_{2}, y_{2})) := (ρ_{X} (x_{1}, x_{2})^{p} + ρ_{Y} (y_{1}, y_{2})^{p})^{1/ p} .

ρ_{P (X)} (μ, ν) := γ \in Cpl (μ, ν) in f (\int ρ (x_{1}, x_{2})^{p} d γ (x_{1}, x_{2}))^{1/ p} .

ρ_{P (X)} (μ, ν) := γ \in Cpl (μ, ν) in f (\int ρ (x_{1}, x_{2})^{p} d γ (x_{1}, x_{2}))^{1/ p} .

\mathscr{F}\left(\mathcal{A}\rightsquigarrow\mathcal{B}\right):=\Big{\{}\mu\in\mathscr{P}(\mathcal{A}\times\mathcal{B})\,\Big{|}\,\exists f:\mathcal{A}\rightarrow\mathcal{B}\text{ measurable s.t. }\mu(\mathrm{graph}(f))=1\Big{\}}\text{ .}

\mathscr{F}\left(\mathcal{A}\rightsquigarrow\mathcal{B}\right):=\Big{\{}\mu\in\mathscr{P}(\mathcal{A}\times\mathcal{B})\,\Big{|}\,\exists f:\mathcal{A}\rightarrow\mathcal{B}\text{ measurable s.t. }\mu(\mathrm{graph}(f))=1\Big{\}}\text{ .}

X ⟶ f Y

X ⟶ f Y

P (X) ⟶ P (f) P (Y)

P (X) ⟶ P (f) P (Y)

μ_{↾ (X_{i_{j}})_{j}} = P ((X_{i_{j}})_{j}) (μ) .

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Full text

All adapted topologies are equal

Julio Backhoff-Veraguas

,

Daniel Bartl

,

Mathias Beiglböck

and

Manu Eder

Department of Mathematics, University of Vienna, Austria

[email protected]

Abstract.

A number of researchers have introduced topological structures on the set of laws of stochastic processes. A unifying goal of these authors is to strengthen the usual weak topology in order to adequately capture the temporal structure of stochastic processes.

Aldous defines an extended weak topology based on the weak convergence of prediction processes. In the economic literature, Hellwig introduced the information topology to study the stability of equilibrium problems. Bion-Nadal and Talay introduce a version of the Wasserstein distance between the laws of diffusion processes. Pflug and Pichler consider the nested distance (and the weak nested topology) to obtain continuity of stochastic multistage programming problems. These distances can be seen as a symmetrization of Lassalle’s causal transport problem, but there are also further natural ways to derive a topology from causal transport.

Our main result is that all of these seemingly independent approaches define the same topology in finite discrete time. Moreover we show that this ‘weak adapted topology’ is characterized as the coarsest topology that guarantees continuity of optimal stopping problems for continuous bounded reward functions.

J. Backhoff-Veraguas gratefully acknowledges financial support from the Austrian Science Fund (FWF) under grant P30750. D. Bartl has been funded by the Vienna Science and Technology Fund (WWTF) through projects VRG17-005 and MA16-021, as well as by the Austrian Science Fund (FWF) through project P28661. M. Beiglboeck gratefully acknowledges financial support by the FWF through grant Y782. M. Eder gratefully acknowledges financial support by the FWF through grant Y782 and by the WWTF through project MA16-021.

Keywords: Aldous’ extended weak topology, Hellwig’s information topology, nested distance, causal optimal transport, stability of optimal stopping, Vershik’s iterated Kantorovich distance

1. Introduction

1.1. Outline

If some type of natural phenomenon is modelled through a stochastic process, one might expect that the model does not describe reality in an entirely accurate way. To be able to study the impact of such inaccuracies on the problems one is trying to solve, it makes sense to equip the set of laws of stochastic processes with a suitable notion of distance or topology.

Denoting by $\Omega:=\mathcal{X}^{N}$ the path space (where $X$ is some Polish space and $N\in\mathbb{N}$ ), the set of laws of stochastic processes is $\mathscr{P}\!\left(\Omega\right)$ , i.e. the set of probability measures on $\Omega$ .

Clearly, $\mathscr{P}\!\left(\Omega\right)$ carries the usual weak topology. However, this topology does not respect the time evolution of stochastic processes which has a number of potentially inconvenient consequences: e.g., problems of optimal stopping / utility maximization / stochastic programming are not continuous, arbitrary processes can be approximated by processes which are deterministic after the first period, etc. In the following we describe a number of approaches which have been developed by different authors to deal with these (and related) problems. Our main result (Theorem 1.2) is that all of these approaches actually define the same topology in the present discrete time setup. Moreover, this topology is the weakest topology which allows for continuity of optimal stopping problems.

1.2. Adapted Wasserstein distances, nested distance

A number of authors have independently introduced variants of the Wasserstein distance which take the temporal structure of processes into account: the definition of ‘iterated Kantorovich distance’ by Vershik [60, 61] might be seen as a first construction in this direction. The topic is also considered by Rüschendorf [58]. Independently, Pflug and Pflug–Pichler [52, 56, 53, 54, 55, 30] introduce the nested distance and describe the concept’s rich potential for the approximation of stochastic multi-period optimization problems. Lassalle [46] considers the ‘causal transport problem’ that leads to a corresponding notion of distance. Once again independently of these developments, Bion-Nadal and Talay [16] define an adapted version of the Wasserstein distance between laws of solutions to SDEs. Gigli [28, Chapter 4] introduces a similar distance for measures whose first marginal agrees, see also [4, Section 12.4].

To set the stage for describing these ‘adapted’ variants let us fix $p\geq 1$ and recall the definition of the usual $p$ -Wassterstein distance.

$(\mathcal{X},\rho_{\mathcal{X}})$ is now a Polish metric space. On $\Omega=\mathcal{X}^{N}$ we use the Polish metric $\rho_{\Omega}((x_{t})_{t},(y_{t})_{t}):=(\sum_{t}\rho_{\mathcal{X}}(x_{t},y_{t})^{p})^{1/p}$ . Typically, when clear from the context we will omit the subscript for the metric. We use $(X_{t})_{t}$ to denote the canonical process on $\Omega$ , i.e. $X_{t}$ is the projection onto the $t$ -th factor of $\Omega=\mathcal{X}^{N}$ . On $\Omega\times\Omega$ call $X=(X_{t})_{t}$ the projection on the first factor and call $Y=(Y_{t})_{t}$ the projection on the second factor. For $\mu,\nu\in\mathscr{P}\!\left(\Omega\right)$ we denote by $\operatorname{Cpl}\left(\mu,\nu\right)$ the set of probability measures $\pi$ on $\Omega\times\Omega$ for which $X\sim\mu$ and $Y\sim\nu$ under $\pi$ , i.e. for which the distribution of $X$ under $\pi$ is $\mu$ and that of $Y$ under $\pi$ is $\nu$ . In applications, a particular role is played by Monge couplings. A Monge coupling from $\mu$ to $\nu$ is a coupling $\pi$ for which $Y=T(X)$ $\pi$ -a.s. for some Borel mapping $T:\Omega\to\Omega$ that transports $\mu$ to $\nu$ , i.e. satisfies $T_{\#}(\mu)=\nu$ .

For $\mu,\nu\in\mathscr{P}\!_{p}\!\left(\Omega\right)$ , i.e. for probability measures on $\Omega$ with finite $p$ -th moment their $p$ -Wasserstein distance is

[TABLE]

Following, [57] the infimum in (1) remains unchanged if one minimizes only over Monge couplings in many situations.

To motivate the formal definition of the adapted cousins in (5) and (6) below, we start with an informal discussion in terms of Monge mappings: In probabilistic terms, the preservation of mass assumption $T_{\#}(\mu)=\nu$ asserts

[TABLE]

which ignores the evolution of $\mu$ and $\nu$ (resp.) in time. Rather it would appear more natural to restrict to mappings $(T_{k})_{k=1}^{N}$ which are adapted in the sense that $T_{k}$ depends only on $X_{1},\ldots,X_{k}$ . Adapted Wasserstein distances can be defined following precisely this intuition, relying on a suitable version of adaptedness on the level of couplings:

The set $\operatorname{Cpl}_{c}(\mu,\nu)$ of causal couplings 111Intuitively, at time $t$ , given the past $(X_{1},\ldots,X_{t})$ of $X$ , the distribution of $Y_{t}$ does not depend on the future $(X_{t+1},\ldots,X_{N})$ of $X$ . For measures $\mu$ such that the first marginal of $\mu$ has no atoms, the weak closure of the set of adapted Monge couplings, i.e. of those $\pi\in\operatorname{Cpl}\left(\mu,\nu\right)$ for which $Y=T(X)$ $\pi$ -a.s. for $T$ adapted, is precisely the set of all causal couplings, see [44]. consists of all $\pi\in\operatorname{Cpl}(\mu,\nu)$ such that

[TABLE]

for all $t\leq N$ and $A\subseteq\mathcal{X}^{t}$ measurable, cf. [46]. The set of all bi-causal couplings $\operatorname{Cpl}_{bc}(\mu,\nu)$ consists of all $\pi\in\operatorname{Cpl}_{c}(\mu,\nu)$ such that the distribution of $(Y,X)$ under $\pi$ is also in $\operatorname{Cpl}_{c}(\nu,\mu)$ , i.e. that (3) also holds with the roles of $X$ and $Y$ reversed.

The term causal was introduced by Lassalle [46], who considers a causal transport problem in which the usual set of couplings is replaced by the set of causal couplings. The resulting concept is not actually a metric as it lacks symmetry, but as suggested by Soumik Pal, this is easily mended and we formally define the causal - and symmetrized-causal $p$ -Wasserstein distance, resp. as follows:

For $\mu,\nu\in\mathscr{P}\!_{p}\!\left(\Omega\right)$ set

[TABLE]

Rüschendorf [58] refers to $\mathcal{AW}_{p}$ as ‘modified Wasserstein distance’. Pflug-Pichler [52, Definition 1] use the names multi-stage distance of order $p$ and nested distance. It can also be considered as a discrete time version of the ‘Wasserstein-type distance’ of Bion-Nadal and Talay [16]. In [5] we use a slightly modified definition of $\mathcal{AW}_{p}$ which scales better with the number of time-periods $N$ but leads to an equivalent metric (for fixed $p$ and $N$ ). We shall discuss further properties of $\mathcal{AW}_{p}$ (and in particular the connection with Vershik’s iterated Kantorovich distance) in Section 1.8 below.

1.3. Hellwig’s information topology

The information topology introduced by Hellwig in [31] (as well as Aldous’ extended weak topology which we discuss next) is based on the idea that an essential part of the structure of a process is the information that we may deduce about the future behaviour of the process given its behaviour up to current time $t$ . For a process whose law is $\mu$ , this information is captured by the conditional law $\mathscr{L}^{\mu}(X_{t+1},\dots,X_{N}|X_{1}=x_{1},\dots,X_{t}=x_{t})$ of $X_{t+1},\dots,X_{N}$ given $X_{1}=x_{1},\dots,X_{t}=x_{t}$ under $\mu$ .

$\mathscr{L}^{\mu}(X_{t+1},\dots,X_{N}|X_{1}=x_{1},\dots,X_{t}=x_{t})$ is also the disintegration $\mu_{x_{1},\dots,x_{t}}$ of $\mu\in\mathscr{P}\!\left(\Omega\right)$ w.r.t. the first $t$ coordinates.

Hellwig’s information topology is the initial topology w.r.t. a family of maps $(\mathcal{I}_{t})_{t=1}^{N-1}$ which are defined based on these disintegrations:

[TABLE]

Equivalently, $\mathcal{I}_{t}(\mu)$ is the joint law of

[TABLE]

under $\mu$ , and Hellwig’s information topology is therefore the coarsest topology which makes continuous for all $t$ the maps which send a probability $\mu$ to the joint law describing the evolution of the coordinate process up to time $t$ and the prediction about the future behaviour of the coordinate process after $t$ .

Remark 1.1.

All the topologies we consider in this paper are second countable. As such they can be characterized by saying which sequences converge. Restated in the language of sequences, the above definition says that a sequence $(\mu_{n})_{n}$ in $\mathcal{P}(\Omega)$ converges in Hellwig’s information topology to $\mu\in\mathcal{P}(\Omega)$ if and only if, for every $t$ , the sequence $(\mathcal{I}_{t}(\mu_{n}))_{n}$ converges to $\mathcal{I}_{t}(\mu)$ in the usual weak topology on $\mathscr{P}\!\left(\mathcal{X}^{t}\times\mathscr{P}\!\left(\mathcal{X}^{N-t}\right)\right)$ .

The work of Hellwig [31] was motivated by questions of stability in dynamic economic models/games; see the related articles [40, 59, 32, 11].

1.4. Aldous’ extended weak topology

Aldous [3] introduces a type of convergence for pairs of filtrations and continuous time stochastic processes on them that he calls extended weak convergence [3, Definition 15.2]. Restricted to our current setting, his definition can be paraphrased in a similar manner as that of the information topology. Aldous’ idea is to represent a stochastic process with law $\mu$ through the associated prediction process222The definition of the prediction process goes back at least to Knight [41]., that is, the process given by

[TABLE]

That is, $(Z^{\mu}_{t})_{t=0}^{N}$ is a measure-valued martingale that makes increasingly accurate predictions about the full trajectory of the process $X$ .

Rather then comparing the laws of processes directly, the extended weak topology is derived from the weak topology on the corresponding prediction processes (plus the original processes). I.e. formally, the extended weak topology on $\mathscr{P}\!\left(\Omega\right)$ is the initial topology w.r.t. the map

[TABLE]

which sends $\mu$ to the joint distribution of

[TABLE]

under $\mu$ .

Note that, to stay faithful to Aldous’ original definition, we defined $\mathcal{E}$ to map $\mu$ not just to the law of the prediction process but to the joint law of the original process and its prediction process. One easily checks that the original process may be omitted in our setting without changing the resulting topology.

1.5. The optimal stopping topology

The usual weak topology on $\mathscr{P}\!\left(\Omega\right)$ is the coarsest topology which makes continuous all the functions

[TABLE]

for $f:\Omega\rightarrow{\mathbb{R}}$ continuous and bounded.

One may follow a similar pattern and look at the coarsest topology which makes continuous the outcomes of all sequential decision procedures. Perhaps the easiest way to formalize this is to look at optimal stopping problems. In detail, write $AC(\Omega)$ for the set of all processes $(L_{t})_{t=1}^{N}$ which are adapted, bounded and satisfy that $x\mapsto L_{t}(x)$ is continuous for each $t\leq N$ . Write $v^{L}(\mu)$ for the corresponding value function, given that the process $X$ follows the law $\mu$ , i.e.

[TABLE]

The optimal stopping topology on $\mathscr{P}\!\left(\Omega\right)$ is the coarsest topology which makes the functions

[TABLE]

continuous for all $(L_{t})_{t=1}^{N}\in AC(\Omega)$ .

1.6. Main result

We can now state our main result:

Theorem 1.2.

Let $(\mathcal{X},\rho_{\mathcal{X}})$ be a Polish metric space, where $\rho_{\mathcal{X}}$ is a bounded metric and set $\Omega:=\mathcal{X}^{N}$ . Then the following topologies on $\mathscr{P}\!\left(\Omega\right)$ are equal

(1)

the topology induced by $\mathcal{A}\mathcal{W}_{p}$ 2. (2)

the topology induced by $\mathcal{S}\mathcal{C}\mathcal{W}_{p}$ 3. (3)

Hellwig’s information topology 4. (4)

Aldous’ extended weak topology 5. (5)

the optimal stopping topology.

The assumption that $\rho_{\mathcal{X}}$ is bounded serves only to simplify the statement of the theorem, because in this case the topology induced by $\mathcal{W}_{p}$ coincides with the weak topology. For every Polish space there is a bounded complete metric which induces the topology (given any complete metric $\rho_{\mathcal{X}}$ , replace it by e.g. $\min(1,\rho_{\mathcal{X}})$ ).

1.6.1. $p$ -Wasserstein and unbounded metrics

There is an analogous statement, Theorem 1.3 below, which drops the assumption that $\rho_{\mathcal{X}}$ is bounded. To be able to state it, we introduce slight variations of Hellwig’s information topology, of Aldous’ extended weak topology and of the optimal stopping topology:

In [31] Hellwig equips the target spaces of $\mathcal{I}_{t}$ with the weak topology – or more precisely he equips $\mathscr{P}\!\left(\mathcal{X}^{N-t}\right)$ with the weak topology, $\mathcal{X}^{t}\times\mathscr{P}\!\left(\mathcal{X}^{N-t}\right)$ with the product topology and finally $\mathscr{P}\!\left(\mathcal{X}^{t}\times\mathscr{P}\!\left(\mathcal{X}^{N-t}\right)\right)$ with the weak topology based on this topology. One may easily define a $p$ -Wasserstein version of Hellwigs information topology by using the recipe ‘replace the weak topology by the $p$ -Wasserstein metric everywhere’. Concretely, if we restrict $\mathcal{I}_{t}$ to $\mathscr{P}\!_{p}\!\left(\Omega\right)$ , we may view it as a map into $\mathscr{P}\!_{p}\!\left({\mathcal{X}^{t}\times\mathscr{P}\!_{p}\!\left(\mathcal{X}^{N-t}\right)}\right)$ , where the last space carries the metric

[TABLE]

We will call the resulting variant of Hellwigs information topology on $\mathscr{P}\!_{p}\!\left(\Omega\right)$ the $\mathcal{W}_{p}$ -information topology.

Similarly, one may systematically replace every occurrence of the weak topology in the definition of the extended weak topology by the $p$ -Wasserstein metric. We call the resulting topology on $\mathscr{P}\!_{p}\!\left(\Omega\right)$ the extended $\mathcal{W}_{p}$ -topology.

Just like the weak topology is the coarsest topology which makes integration of continuous bounded functions continuous, the $p$ -Wasserstein topology is the coarsest topology which makes integration of continuous functions bounded by $c\cdot(1+\rho(x_{0},x)^{p})$ continuous. Following this analogy, we define $AC_{p}(\Omega)$ as the set of all processes $(L_{t})_{t=1}^{N}$ which are adapted, bounded by $x\mapsto c\cdot(1+\rho(x_{0},x)^{p})$ for some $c\in{\mathbb{R}}_{+}$ and satisfy that $x\mapsto L_{t}(x)$ is continuous for each $t\leq N$ .

The $\mathcal{W}_{p}$ -optimal stopping topology on $\mathscr{P}\!_{p}\!\left(\Omega\right)$ is the coarsest topology which makes the functions

[TABLE]

continuous for all $(L_{t})_{t=1}^{N}\in AC_{p}(\Omega)$ .

With these we may state the following generalization of Theorem 1.2:

Theorem 1.3.

Let $(\mathcal{X},\rho_{\mathcal{X}})$ be a Polish metric space and set $\Omega:=\mathcal{X}^{N}$ . Then the following topologies on $\mathscr{P}\!_{p}\!\left(\Omega\right)$ are equal

(1)

the topology induced by $\mathcal{A}\mathcal{W}_{p}$ 2. (2)

the topology induced by $\mathcal{S}\mathcal{C}\mathcal{W}_{p}$ 3. (3)

the $\mathcal{W}_{p}$ -information topology 4. (4)

the extended $\mathcal{W}_{p}$ -topology 5. (5)

the $\mathcal{W}_{p}$ -optimal stopping topology.

Clearly, one recovers Theorem 1.2 from Theorem 1.3 by choosing a bounded metric on $\mathcal{X}$ , because the $\mathcal{W}_{p}$ -information topology for bounded $\rho_{\mathcal{X}}$ is just the information topology, the extended $\mathcal{W}_{p}$ -topology for bounded $\rho_{\mathcal{X}}$ is just the extended weak topology and the $\mathcal{W}_{p}$ -optimal stopping topology for bounded $\rho_{\mathcal{X}}$ is just the optimal stopping topology.

The relationship between the topologies listed in Theorem 1.2 and those listed in Theorem 1.3 is similar to the non-adapted case where we know that usual $p$ -Wasserstein convergence is equivalent to usual weak convergence plus convergence of the $p$ -th moments.

{restatable}

lemmaconvergencemoments

Convergence in any of the topologies of Theorem 1.3 is equivalent to convergence in any of the topologies of Theorem 1.2 (where for building $\mathcal{S}\mathcal{C}\mathcal{W}_{p}$ and $\mathcal{A}\mathcal{W}_{p}$ , $\rho_{\mathcal{X}}$ is replaced by a bounded compatible complete metric e.g. $\min(1,\rho_{\mathcal{X}})$ ) plus convergence of $p$ -th moments on $\Omega$ w.r.t. (the original) $\rho_{\Omega}$ .

We prove Lemma 1.3 in Section 6, making use of (parts of) Theorem 1.2 and Theorem 1.3.

1.7. Further remarks on related work

1.7.1. Some further articles of successors of Aldous

One of the original applications of Aldous’ weak extended topology concerned the stability of optimal stopping [3]. This corresponds to one half of (4)=(5) in Theorem 1.2, but in a much more general setting. This line of work has been continued by Lamberton and Pagès [45], Coquet and Toldo [20], among others.

Aldous’ extended weak topology was also inspiring and instrumental for the development of the theory of convergence of filtrations, and the associated questions of stability of the martingale representation property and Doob-Meyer decompositions. In this regard, see the works by Hoover et al [37, 35] and by Mémin et al [19, 48]. The related question of stability of stochastic differential equations (as well as their backwards version) with respect to the driving noise has particularly seen a burst of activity in the last two decades. For brevity’s sake we only refer to the recent article by Papapantoleon, Posamaï, and Saplaouras [50] for an overview of the many available works in this direction.

1.7.2. Previous applications of adapted Wasserstein distances.

Pflug, Pichler and co-authors [52, 56, 53, 54, 55, 30] have extensively developed and applied the notion of nested distaces for the purpose of scenario generation, stability, sensitivity bounds, and distributionally robust stochastic optimization, in the context of operations research.

Acciaio, Zalashko, and one of the present authors consider in [2] the adapted Wasserstein distance in continuous time in connection with utility maximization, enlargement of filtrations and optimal stopping.

Causal couplings have appeared in the work by Yamada and Watanabe [62], Jacod and Mémin [38] as well as Kurtz [42, 43], concerning weak solutions of stochastic differential equations, and by Rüschendof [58] concerning approximation theorems in probability theory. The term ‘causal’ is first used by Lassalle [46], who uses it in an additional constraint for the transport problem and gives an alternative derivation of the Talagrand inequality for the Wiener measure. Causal couplings are also present in the numerical scheme suggested in [1] for (extended mean-field) stochastic control.

The article [7] connects adapted Wasserstein distance (in continuous time) to martingale optimal transport (cf. [34, 13, 27, 23, 17, 33, 18, 12, 14] among many others). Several familiar objects appear as solutions to variational problems in this context. E.g. geometric Brownian motion is the martingale which is closest in $\mathcal{AW}_{2}$ to usual Brownian motion subject having a log normal distribution at the terminal time-point, the local vol model is closest to Brownian motion subject to matching 1-d marginals.

Bion-Nadal and Talay [16] introduce an adapted Wasserstein-type distance on the set of diffusion SDEs and show that this distance corresponds to the computation of a tractable stochastic control problem. They also apply their results to the problem of fitting diffusion models to given marginals.

In [5] the present authors consider adapted Wasserstein distances in relation to stability in finance: Lipschitz continuity of utility maximization/hedging are established w.r.t. to the underlying models in discrete and continuous time.

1.8. Another formulation of the adapted Wasserstein distance and of Hellwigs information topology

Here we give an alternative formulation of the adapted Wasserstein distance / nested distance due to Pflug and Pichler.

Again, $\mathcal{X}$ is a Polish space and $\rho=\rho_{\mathcal{X}}$ is a compatible metric on $\mathcal{X}$ . Starting with $V_{N}^{p}:=0$ we define

[TABLE]

The nested distance is finally obtained in a backwards recursive way by

[TABLE]

Then $\mathcal{A}\mathcal{W}_{p}=\mathcal{ND}_{p}$ . We refer to [8] for the (straightforward) justification.

For $N>1$ the adapted Wasserstein distance is not complete. As was established in [6], a natural complete space into which $(\mathscr{P}\!_{p}\!\left(\Omega\right),\mathcal{AW}_{p})$ embeds is given by the space of nested distributions:

Consider the sequence of metric spaces

[TABLE]

where at each stage $t$ , the space $\mathscr{P}\!_{p}\!\left(\mathcal{X}_{t:N}\right)$ is endowed with the $p$ -Wasserstein distance with respect to the metric $\rho_{t:N}$ on $\mathcal{X}_{t:N}$ , which we denote by $\mathcal{W}_{\rho_{t:N},p}$ . The space of nested distributions (of depth $N$ ) is defined as $\mathscr{P}\!_{p}\!\left(\mathcal{X}_{1:N}\right)$ . We endow $\mathscr{P}\!_{p}\!\left(\mathcal{X}_{1:N}\right)$ with the complete metric $\mathcal{W}_{\rho_{1:N},p}$ .

The space of nested distributions was defined by Pflug [51]. Notably the idea to iterate the formation of Wasserstein spaces and metrics goes back to Vershik [60, 61] who uses the name ‘iterated Kantorovich distance’. The main interest of Vershik (and his successors) lies in the classification of filtrations (in the language of ergodic theory). We refer to the work of Emery and Schachermayer [25] for a survey from a probabilistic perspective and to Janvresse, Laurent and de la Rue [39] for a contemporary article (again from a probabilistic viewpoint).

$\mathscr{P}\!_{p}\!\left(\Omega\right)$ is naturally embedded in the set of nested distributions of depth $N$ through the map $\mathcal{N}$ given by

[TABLE]

where $(X_{1},\dots,X_{N})$ is a vector with law $\mu$ , $\mathscr{L}$ again denotes (conditional) law and we use $\bar{X}_{1}^{t}$ as a shorthand for the vector $X_{1},\dots,X_{t}$ .

Following [6], we have:

Theorem 1.4.

The map $\mathcal{N}$ defined in (12) embeds the metric space $(\mathscr{P}\!_{p}\!\left(\Omega\right),\mathcal{AW}_{p})$ isometrically into the complete separable metric space $(\mathscr{P}\!_{p}\!\left(\mathcal{X}_{1:N}\right),\mathcal{W}_{\rho_{1:N},p})$ .

Remark 1.5.

When $\mathcal{X}$ has no isolated points, $\mathscr{P}\!_{p}\!\left(\mathcal{X}_{1:N}\right)$ is actually the completion of $\mathscr{P}\!_{p}\!\left(\Omega\right)$ , i.e. $\mathscr{P}\!_{p}\!\left(\Omega\right)$ considered as a subset of $\mathscr{P}\!_{p}\!\left(\mathcal{X}_{1:N}\right)$ is dense.

1.8.1. Hellwig’s information topology in terms of adapted Wasserstein distances

We note that Hellwig’s definition of the information topology can also be rephrased using the concept of adapted Wasserstein distance: Assume that $\rho_{\mathcal{X}}$ is a bounded metric and for $t\leq N$ , set

[TABLE]

I.e. for each $t$ , we consider $\Omega$ as the product of two Polish spaces (which one might consider as ‘history’ and ‘future’). Extending the defintion of $\mathcal{A}\mathcal{W}_{p}$ in the obvious way to products of not necessarily equal Polish spaces, we can then equip $\mathscr{P}\!_{p}\!\left(X_{1}^{(t)}\times X_{2}^{(t)}\right)$ with a one period adapted Wasserstein distance $\mathcal{AW}_{p}^{(t)},p\geq 1$ . Setting for $\mu,\nu\in\mathcal{P}(\Omega)$

[TABLE]

we obtain a compatible metric for the information topology. This is relatively straightforward (whereas the full version of Theorem 1.2 is not straightforward as far as we are concerned).

1.9. Preservation of Compactness

We close this section with a result about the preservation of relative compactness which we shall use in Sections 4 and 6, but which also might be of independent interest. Specifically, in [9, 10] the two-step version of Lemma 1.6 is used as a crucial tool in the investigation of the weak transport problem.

A more detailed investigation of compactness in $\mathscr{P}\!\left(\Omega\right)$ with the weak adapted topology is the topic of the companion paper to this one, [24].

Assume for simplicity that $\rho_{\mathcal{X}}$ is a bounded metric. Then we have

Lemma 1.6 (Compactness lemma).

$A\subseteq\mathscr{P}\!\left(\Omega\right)$ * is relatively compact w.r.t. the usual weak topology iff $\mathcal{N}[A]\subseteq\mathscr{P}\!\left(\mathcal{X}_{1:N}\right)$ is relatively compact.*

We note that Lemma 1.6 is essentially a consequence of the characterization of compact subsets in $\mathscr{P}\!\left(\mathscr{P}\!\left(X\right)\right)$ ; in a somewhat different framework it was first proved in [36]. The version stated here follows by repeated application of [24, Lemma 3.3]/[9, Lemma 2.6].

The implication that $\mathcal{N}[A]$ relatively compact implies $A$ relatively compact is rather easy to see, but the other direction that $A$ relatively compact implies $\mathcal{N}[A]$ relatively compact is nontrivial since the mapping $\mathcal{N}:\mathscr{P}\!\left(\Omega\right)\to\mathscr{P}\!\left(\mathcal{X}_{1:N}\right)$ is not continuous when $\mathscr{P}\!\left(\Omega\right)$ is endowed with the usual weak topology (except for trivial cases). Lemma 1.6 would not be true if we were to replace relative compactness by compactness.

The assumption that $\rho_{\mathcal{X}}$ is bounded is inessential. A version of Lemma 1.6 holds if we replace $\mathscr{P}\!\left(\Omega\right)$ by $\mathscr{P}\!_{p}\!\left(\Omega\right)$ and the weak topology by the one induced by the $p$ -Wasserstein metric.

A similar result based on Hellwig’s information toplogy, relating relative compactness in $\mathscr{P}\!\left(\Omega\right)$ to relative compactness in $\prod_{t=1}^{N-1}\mathscr{P}\!\left(\mathcal{X}^{t}\times\mathscr{P}\!\left(\mathcal{X}^{N-t}\right)\right)$ , is also true.

2. Preparations

The rest of the paper will essentially be devoted to proving Theorem 1.2, or really its generalization Theorem 1.3.

In Section 3 we prove that Hellwig’s information topology equals the topology induced by $\mathcal{A}\mathcal{W}_{p}$ , i.e. $\ref{it:Hellwig}=\ref{it:AW}$ in Theorem 1.3. In a sense, of all the topologies listed in Theorem 1.3, Hellwig’s information toplogy ‘looks’ the coarsest – or at least like one of the coarser ones, while the topology induced by $\mathcal{A}\mathcal{W}_{p}$ ‘looks’ the finest.

In Section 4 we sandwich the topology induced by $\mathcal{S}\mathcal{C}\mathcal{W}_{p}$ between Hellwig’s information topology and the toplogy induced by $\mathcal{A}\mathcal{W}_{p}$ , i.e. we show $\ref{it:Hellwig}\leq\ref{it:SCW}\leq\ref{it:AW}$ in Theorem 1.3.

In Section 5 we show that Aldous’ extended weak topology is equal to Hellwig’s information topology, i.e. $\ref{it:Aldous}=\ref{it:Hellwig}$ in Theorem 1.3.

In Section 6 we prove Lemma 1.3.

In Section 7 we prove that the optimal stopping topology is coarser than the topology induced by $\mathcal{A}\mathcal{W}_{p}$ and finer than Hellwig’s ( $\mathcal{W}_{p}$ -)information topology, i.e. $\ref{it:Hellwig}\leq\ref{it:optstop}\leq\ref{it:AW}$ in Theorem 1.3.

2.1. Notation

The nested structure of spaces like for example $\mathscr{P}\!_{p}\!\left(\mathcal{X}_{1:N}\right)$ introduced in Section 1.8 is (at least for the authors) not so easy to gain an intuition for. It seems rather challenging to picture probability measures on probability measures on probability measures… etc.

Therefore, much of the proofs in the following two sections will be about bookkeeping and not getting lost in these nested structures. In most other contexts we would regard such bookkeeping as abstract nonsense better swept under the rug, but in the context of the present paper we believe that it really constitutes an important and nontrivial ingredient in successfully carrying out the proofs.

To aid in this endeavour we make some notational preparations and introduce a few conventions.

2.1.1. Operations on Spaces

In the introduction we described the topologies listed in Theorems 1.2 and 1.3 as initial topologies w.r.t. maps into more complex spaces. These spaces are built up from just a few basic operations, and in most cases the maps can also be constructed using a few relatively simple ingredients.

For spaces, the operations in question are

•

product formation, i.e. for spaces $\mathcal{X}$ and $\mathcal{Y}$ we may form their product space $\mathcal{X}\times\mathcal{Y}$ ,

•

and passing from a space $\mathcal{X}$ to the space $\mathscr{P}\!\left(\mathcal{X}\right)$ of probability measures on $\mathcal{X}$ .

Here we run into some tension between the various existing definitions in the literature. While Hellwig and Aldous originally defined their topologies based on equipping the space $\mathscr{P}\!\left(\mathcal{X}\right)$ of probability measures on some space $\mathcal{X}$ with the weak topology, without any mention of metrics, $\mathcal{A}\mathcal{W}_{p}$ is a metric built on the $p$ -Wasserstein metric, and Theorem 1.4 exhibits this metric as the ‘initial metric’ w.r.t. an embedding of $\mathscr{P}\!_{p}\!\left(\Omega\right)$ (not $\mathscr{P}\!\left(\Omega\right)$ ) into $(\mathscr{P}\!_{p}\!\left(\mathcal{X}_{1:N}\right),\mathcal{W}_{\rho_{1:N},p})$ .

Luckily, when the base metric $\rho_{\mathcal{X}}$ on $\mathcal{X}$ is bounded and we decide that we only care about topologies and not the metrics that induce them, all of these distinctions vanish, and one may hope for these fine distinctions to not be so important in the end.

To give as uniform and as streamlined a treatment as possible of all the various ways in which these metric and topological spaces can be related to each other we employ the following strategy: A lot of our arguments are agnostic to the distinction between $\mathscr{P}$ and $\mathscr{P}_{p}$ , and to whether we are talking about metric or topological spaces etc. They only rely on properties of the operations of product formation and formation of spaces of probability measures and on properties of maps between various spaces built using these operations which hold in either case. For the rest of the paper we will therefore drop the $p$ in $\mathscr{P}_{p}$ and other explicit mentions of these distinctions. The reader may decide to read the paper using either of the following two sets of conventions, which are to be applied recursively:

Convention 1 (weak topologies)

•

$\mathcal{X}$ , $\mathcal{Y}$ , $\mathcal{Z}$ , $\mathcal{A}$ , $\mathcal{B}$ , $\mathcal{C}$ , etc. are Polish spaces.

•

$\mathcal{X}\times\mathcal{Y}$ is a topological space with the product topology (again Polish).

•

$\mathscr{P}\!\left(\mathcal{X}\right)$ is a topological space with the weak topology (also Polish).

•

‘space’ will mean Polish space.

Convention 2 ( $\mathcal{W}_{p}$ )

•

$p\geq 1$ is fixed throughout the paper

•

$\mathcal{X}$ , $\mathcal{Y}$ , $\mathcal{Z}$ , $\mathcal{A}$ , $\mathcal{B}$ , $\mathcal{C}$ , etc. are Polish (i.e. complete separable) metric spaces with metrics $\rho_{\mathcal{X}}$ , $\rho_{\mathcal{Y}}$ , $\rho_{\mathcal{Z}}$ , $\rho_{\mathcal{A}}$ , $\rho_{\mathcal{B}}$ , $\rho_{\mathcal{C}}$ , etc. respectively.

•

$\mathcal{X}\times\mathcal{Y}$ is a Polish metric space with the metric

[TABLE]

•

$\mathscr{P}\!\left(\mathcal{X}\right)$ is a Polish metric space with the $p$ -Wasserstein metric

[TABLE]

•

The subscript on the metric $\rho$ may be dropped when clear from the context.

•

‘space’ will mean Polish metric space.

Unless specified otherwise everything said from here on will be true for either way of reading. Convention 1 will lead to a direct proof of Theorem 1.2, while Convention 2 will give a proof of the more general version, Theorem 1.3. Occasionally an argument will require us to talk directly about metrics to establish continuity of some map. When one only cares about Theorem 1.2 and not Theorem 1.3 these sections can be read while assuming that $p=1$ and that all metrics mentioned are bounded.

Another space we will need is

Definition 2.1.

$\mathscr{F}\left(\mathcal{A}\rightsquigarrow\mathcal{B}\right)\subseteq\mathscr{P}(\mathcal{A}\times\mathcal{B})$ is the space of probability measures on $\mathcal{A}\times\mathcal{B}$ which are concentrated on the graph of a measuruable function, i.e.:

[TABLE]

The space $\mathscr{F}\left(\mathcal{A}\rightsquigarrow\mathcal{B}\right)$ carries the subspace topology / the restriction of the metric on $\mathscr{P}\!\left(\mathcal{A}\times\mathcal{B}\right)$ .

2.1.2. Maps between spaces

Assuming Convention 1, when $f:\mathcal{X}\to\mathcal{Y}$ is a continuous map, the pushforward under $f$ , i.e. the map which sends $\mu\in\mathscr{P}\!\left(\mathcal{X}\right)$ to the measure $\nu\in\mathscr{P}\!\left(\mathcal{Y}\right)$ with $\nu(A)=\mu(f^{-1}[A])$ is also continuous.

Similarly, assuming Convention 2, when $f:\mathcal{X}\to\mathcal{Y}$ is a Lipschitz-continuous map between metric spaces the pushforward under $f$ is also Lipschitz-continous from $\mathscr{P}\!\left(\mathcal{X}\right)$ to $\mathscr{P}\!\left(\mathcal{Y}\right)$ .

We will use $\mathscr{P}\!\left(f\right):\mathscr{P}\!\left(\mathcal{X}\right)\to\mathscr{P}\!\left(\mathcal{Y}\right)$ to denote the pushforward under $f$ , to emphasize the fact that $\mathscr{P}$ is a functor, i.e. that it sends a diagram with a ‘nice’ (read continuous/Lipschitz) map

[TABLE]

to a similar diagram

[TABLE]

where the map is also ‘nice’, and that $\mathscr{P}\!\left(f\circ g\right)=\mathscr{P}\!\left(f\right)\circ\mathscr{P}\!\left(g\right)$ and $\mathscr{P}\!\left(1_{\mathcal{X}}\right)=1_{\mathscr{P}\!\left(\mathcal{X}\right)}$ (where $1_{\mathcal{X}}$ is the identity function on $\mathcal{X}$ ).

For a product of spaces $\mathcal{X}\times\mathcal{Y}$ , the projection onto $\mathcal{X}$ will alternatively be denoted by either $\operatorname{proj}_{\mathcal{X}}$ or by the same letter that is used for the space, but in a non-calligrapic font, i.e. $X:\mathcal{X}\times\mathcal{Y}\to\mathcal{X}$ .

If $\mu$ is defined on some product $\prod_{i}\mathcal{X}_{i}$ of spaces, we also introduce a shorthand notation for marginals of $\mu$ , i.e. for the pushforward of $\mu$ under projection onto the product of some subset of the original factors:

[TABLE]

If $f:\mathcal{A}\rightarrow\mathcal{B}$ and $g:\mathcal{A}\rightarrow\mathcal{C}$ are functions we write $(f\bm{,}g)$ for the function

[TABLE]

If we want to specify a map from, say $\mathcal{A}\times\mathcal{B}\times\mathcal{C}$ to $\mathcal{X}$ but we only really care about one of the variables we will use an underscore ‘ $\_$ ’ instead of naming the unused variables, as in $(a,\_,\_)\mapsto f(a)$ . Similarly, when integrating we may also use $\_$ to denote unused variables, i.e. for $\mu\in\mathscr{P}\!\left(\mathcal{X}\times\mathcal{Y}\right)$ we might write ${\textstyle\int}f(y)\,\mathrm{d}\mu(\_,y)$ .

Two important maps will be the disintegration map $\operatorname{dis}_{\mathcal{A}}^{\mathcal{B}}$ and its left inverse $\operatorname{int}_{\mathcal{A}}^{\mathcal{B}}$ .

The disintegration map

[TABLE]

sends a probability $\mu$ on $\mathcal{A}\times\mathcal{B}$ to the measure

[TABLE]

where $a\mapsto\mu_{a}$ is a classical disintegration of $\mu$ , i.e. if $\bar{\mu}=\operatorname{dis}_{\mathcal{A}}^{\mathcal{B}}(\mu)$ then

[TABLE]

The disintegration map is measurable (see for example [15, Proposition 7.27]) and injective. It is not continuous w.r.t. the weak topologies or the Wasserstein metrics.

When writing $\operatorname{dis}_{\mathcal{A}}^{\mathcal{B}}$ we will not insist that $\mathcal{A}$ has to be the first factor in the domain of $\operatorname{dis}_{\mathcal{A}}^{\mathcal{B}}$ – $\mathcal{A}$ and $\mathcal{B}$ may even be products themselves, whose factors are intermingled in the product that makes up the domain of $\operatorname{dis}_{\mathcal{A}}^{\mathcal{B}}$ . Also, we may sometimes omit $\mathcal{B}$ , only specifying the variable(s) w.r.t. which we are disintegrating, not the ones which are left over, as in $\operatorname{dis}_{\mathcal{A}}$ .

The map

[TABLE]

is (Lipschitz-)continuous.

The pair $\operatorname{dis}_{\mathcal{A}}^{\mathcal{B}}$ , $\operatorname{int}_{\mathcal{A}}^{\mathcal{B}}$ enjoy the following properties:

(1)

$\operatorname{int}_{\mathcal{A}}^{\mathcal{B}}$ is the left inverse of the disintegration map, i.e.

[TABLE]

This is a direct consequence of the definition of the disintegration. 2. (2)

$\operatorname{int}_{\mathcal{A}}^{\mathcal{B}}\!{}_{\restriction\mathscr{F}\left(\mathcal{A}\rightsquigarrow\mathscr{P}\!\left(\mathcal{B}\right)\right)}$ is injective. Therefore, 3. (3)

$\operatorname{dis}_{\mathcal{A}}^{\mathcal{B}}\circ\operatorname{int}_{\mathcal{A}}^{\mathcal{B}}\!{}_{\restriction\mathscr{F}\left(\mathcal{A}\rightsquigarrow\mathscr{P}\!\left(\mathcal{B}\right)\right)}=1_{\mathscr{F}\left(\mathcal{A}\rightsquigarrow\mathscr{P}\!\left(\mathcal{B}\right)\right)}$ , i.e. $\operatorname{dis}_{\mathcal{A}}^{\mathcal{B}}$ and $\operatorname{int}_{\mathcal{A}}^{\mathcal{B}}$ are inverse bijections between $\mathscr{P}\!\left(\mathcal{A}\times\mathcal{B}\right)$ and $\mathscr{F}\left(\mathcal{A}\rightsquigarrow\mathscr{P}\!\left(\mathcal{B}\right)\right)$ .

The last two properties are just a reformulation of the known fact that the disintegration of a measure is almost-surely uniquely defined.

2.1.3. Processes which take values in different spaces at different times

Already in the introduction, in Section 1.8.1, we found it convenient to extend the definition of $\mathcal{A}\mathcal{W}_{p}$ to products of not necessarily equal Polish spaces ‘in the obvious way’. To accommodate for reapplication of concepts in a similar style as seen there we make the minor generalization of letting all the processes we talk about take values in different spaces at different times – typically at time $t$ they will take values in a space $\mathcal{X}_{t}$ .

Denote by ${}\mkern 3.5mu\overline{\mkern-3.5mu\mathcal{X}\mkern-0.5mu}\mkern 0.5mu_{\mkern-3.0muj}^{\mkern 0.0muk}:=\prod_{i=j}^{k}\mathcal{X}_{i}$ and define ${}\mkern 3.5mu\overline{\mkern-3.5mu\mathcal{X}\mkern-0.5mu}\mkern 0.5mu:={}\mkern 3.5mu\overline{\mkern-3.5mu\mathcal{X}\mkern-0.5mu}\mkern 0.5mu_{\mkern-3.0mu1}^{\mkern 0.0muN}$ , ${}\mkern 3.5mu\overline{\mkern-3.5mu\mathcal{X}\mkern-0.5mu}\mkern 0.5mu^{k}:={}\mkern 3.5mu\overline{\mkern-3.5mu\mathcal{X}\mkern-0.5mu}\mkern 0.5mu_{\mkern-3.0mu1}^{\mkern 0.0muk}$ , ${}\mkern 3.5mu\overline{\mkern-3.5mu\mathcal{X}\mkern-0.5mu}\mkern 0.5mu_{\mkern-3.0muj}:={}\mkern 3.5mu\overline{\mkern-3.5mu\mathcal{X}\mkern-0.5mu}\mkern 0.5mu_{\mkern-3.0muj}^{\mkern 0.0muN}$ .

3. Hellwig’s $\mathcal{W}_{p}$ -information topology is equal to the topology induced by $\mathcal{A}\mathcal{W}_{p}$

In this section we show $\ref{it:Hellwig}=\ref{it:AW}$ in Theorem 1.3. We will do so by identifying both topologies as initial topologies w.r.t. a single map each, i.e. finding a space which is homeomorphic to $\mathscr{P}\!\left({}\mkern 3.5mu\overline{\mkern-3.5mu\mathcal{X}\mkern-0.5mu}\mkern 0.5mu\right)$ with Hellwig’s ( $\mathcal{W}_{p}$ -)information topology and one which is homeomorphic to $\mathscr{P}\!\left({}\mkern 3.5mu\overline{\mkern-3.5mu\mathcal{X}\mkern-0.5mu}\mkern 0.5mu\right)$ with the topology induced by $\mathcal{A}\mathcal{W}_{p}$ and then showing that these spaces are homeomorphic in the right way. As an auxilliary tool we will introduce another topology on $\mathscr{P}\!\left({}\mkern 3.5mu\overline{\mkern-3.5mu\mathcal{X}\mkern-0.5mu}\mkern 0.5mu\right)$ which wasn’t mentioned in the introduction, but which is very similar to Hellwig’s. The proof strategy can be summarized by saying that we want to show that the following diagram is commutative.

[TABLE]

Here $\mathcal{N}$ is the map which induces the same topology as $\mathcal{A}\mathcal{W}_{p}$ , $\mathcal{I}$ induces Hellwig’s topology and $\mathcal{I}^{\prime}$ induces what we will call the reduced information topology. We shortly restate their definitions below.

Since these mappings are injective and by the definition of the initial topology all of these mappings are homeomorphisms. To be precise, $\mathcal{N}$ is a homeomorphism from ${\mathscr{P}\!\left({}\mkern 3.5mu\overline{\mkern-3.5mu\mathcal{X}\mkern-0.5mu}\mkern 0.5mu\right)}$ with the topology induced by $\mathcal{A}\mathcal{W}_{p}$ onto $\mathcal{N}[{\mathscr{P}\!\left({}\mkern 3.5mu\overline{\mkern-3.5mu\mathcal{X}\mkern-0.5mu}\mkern 0.5mu\right)}]$ (cf. Theorem 1.4), $\mathcal{I}$ is a homeomorphism from ${\mathscr{P}\!\left({}\mkern 3.5mu\overline{\mkern-3.5mu\mathcal{X}\mkern-0.5mu}\mkern 0.5mu\right)}$ with the information topology onto $\mathcal{I}[{\mathscr{P}\!\left({}\mkern 3.5mu\overline{\mkern-3.5mu\mathcal{X}\mkern-0.5mu}\mkern 0.5mu\right)}]$ , and $\mathcal{I}^{\prime}$ is a homeomorphism from ${\mathscr{P}\!\left({}\mkern 3.5mu\overline{\mkern-3.5mu\mathcal{X}\mkern-0.5mu}\mkern 0.5mu\right)}$ with the reduced information topology onto $\mathcal{I}^{\prime}[{\mathscr{P}\!\left({}\mkern 3.5mu\overline{\mkern-3.5mu\mathcal{X}\mkern-0.5mu}\mkern 0.5mu\right)}]$ .

The maps $\mathcal{K}$ , $\mathcal{M}$ , $\mathcal{H}$ are still to be found.

As introduced in Section 1.3 Hellwig’s ( $\mathcal{W}_{p}$ -)information topology is induced by a family of maps $\mathcal{I}_{t}$ , given by:

[TABLE]

Equivalently, the information topology is the initial topology w.r.t. the map

[TABLE]

We saw in Section 1.8 that $\mathcal{A}\mathcal{W}_{p}$ is induced by an embedding $\mathcal{N}:\mathscr{P}\!\left({}\mkern 3.5mu\overline{\mkern-3.5mu\mathcal{X}\mkern-0.5mu}\mkern 0.5mu\right)\to\mathscr{P}\!\left(\mathcal{X}_{1:N}\right)$ . Rephrasing the definition there, $\mathcal{N}$ is obtained by defining recursively from $t=N-1$ to $t=1$ :

[TABLE]

In fact, because $\operatorname{dis}$ maps into the space of measures concentrated on the graph of a function, $\mathcal{N}$ also maps into a smaller space, which we call $\mathcal{F}_{1}$ , and which is again defined by recursion down from $N-1$ to $1$ :

[TABLE]

I.e. $\mathcal{F}_{1}$ is $\mathscr{P}\!\left(\mathcal{X}_{1:N}\right)$ with all occurences of $\mathscr{P}\!\left(\cdot\times\cdot\right)$ replaced by $\mathscr{F}\left(\cdot\rightsquigarrow\cdot\right)$ . Remember that we had

[TABLE]

For convenience, let us also define

[TABLE]

The fact that

[TABLE]

and that therefore $\mathcal{N}$ maps into $\mathcal{F}_{1}$ is a consequence of Lemma 3.1 below.

Finally, $\mathcal{I}^{\prime}$ is defined as follows

[TABLE]

I.e. the reduced information topology, like the information topology, makes continuous predictions about the behaviour of the process after time $t$ given information about its behaviour up to time $t$ , only now we are just predicting what the process will do in the next step, not for the rest of time.

$\mathcal{I}$ , $\mathcal{I}^{\prime}$ and $\mathcal{N}$ are injective and therefore bijections onto their codomains. This means that the values of the maps $\mathcal{K}$ , $\mathcal{M}$ , $\mathcal{H}$ in diagram (14) as functions between sets are really already prescribed. The task consists in finding a representation for them which makes it clear that they are continuous.

Lemma 3.1.

$\operatorname{dis}_{\mathcal{A}}^{\mathcal{B}\times\mathcal{Y}}$ * restricted to $\mathscr{F}\left(\mathcal{A}\times\mathcal{B}\rightsquigarrow\mathcal{Y}\right)$ maps onto $\mathscr{F}\!\big{(}\mathcal{A}\rightsquigarrow\mathscr{F}\left(\mathcal{B}\rightsquigarrow\mathcal{Y}\right)\!\big{)}$ .*

Proof.

We first show that it maps into $\mathscr{F}\big{(}\mathcal{A}\rightsquigarrow\mathscr{F}\left(\mathcal{B}\rightsquigarrow\mathcal{Y}\right)\big{)}$ . Let $\nu\in\mathscr{F}\big{(}\mathcal{A}\times\mathcal{B}\rightsquigarrow\mathcal{Y}\big{)}$ and let $g:\mathcal{A}\times\mathcal{B}\rightarrow\mathcal{Y}$ be a function witnessing this fact, i.e. $\nu(f)=\int f(a,b,g(a,b))\,\mathrm{d}\nu(a,b,\_)$ .

Let $\alpha:=\operatorname{dis}_{\mathcal{A}}^{\mathcal{B}\times\mathcal{Y}}(\nu)$ . Then

[TABLE]

This means that for $\alpha$ -a.a. $(a,\beta)$ we have $\int 1_{g(a,b)\neq y}\,\mathrm{d}\beta(b,y)=0$ , i.e. $\beta$ is concentrated on the graph of the function $b\mapsto g(a,b)$ .

To see that any $\alpha\in\mathscr{F}\big{(}\mathcal{A}\rightsquigarrow\mathscr{F}\left(\mathcal{B}\rightsquigarrow\mathcal{Y}\right)\big{)}$ can be obtained as the image of some $\nu\in\mathscr{F}\left(\mathcal{A}\times\mathcal{B}\rightsquigarrow\mathcal{Y}\right)$ under $\operatorname{dis}_{\mathcal{A}}^{\mathcal{B}\times\mathcal{Y}}$ , note that for such $\alpha$ , by the existence of measurably dependent (classical) disintegrations (see for example [15, Proposition 7.27]), $\nu:=\operatorname{int}_{\mathcal{A}}^{\mathcal{B}\times\mathcal{Y}}(\alpha)\in\mathscr{F}\left(\mathcal{A}\times\mathcal{B}\rightsquigarrow\mathcal{Y}\right)$ , and $\operatorname{dis}_{\mathcal{A}}^{\mathcal{B}\times\mathcal{Y}}(\nu)=\alpha$ . ∎

3.1. Homeomorphisms

We give a plain language description of what follows in this section:

The continuity of $\mathcal{M}$ will be quite trivial, because we are just discarding information.

The components $\mathcal{K}_{k}:\mathcal{F}_{1}\rightarrow\mathscr{F}\left({}\mkern 3.5mu\overline{\mkern-3.5mu\mathcal{X}\mkern-0.5mu}\mkern 0.5mu^{k}\rightsquigarrow\mathscr{P}\!\left({}\mkern 3.5mu\overline{\mkern-3.5mu\mathcal{X}\mkern-0.5mu}\mkern 0.5mu_{\mkern-3.0muk+1}\right)\right)$ of the map $\mathcal{K}$ are obtained by ‘folding’ both the ‘head’ and the ‘tail’ of $\mathcal{F}_{1}$ using iterated application of the map $\operatorname{int}$ .

[TABLE]

By continuity of $\operatorname{int}$ , it’s easy to see that $\mathcal{K}_{k}$ is continuous. To show that the map $\mathcal{K}$ with the components $\mathcal{K}_{k}$ is the map we are looking for, we basically show that

[TABLE]

$\mathcal{N}^{-1}$ is again another way of ‘folding’ all of $\mathcal{F}_{1}$ using $\operatorname{int}$ to arrive at $\mathscr{P}\!\left({}\mkern 3.5mu\overline{\mkern-3.5mu\mathcal{X}\mkern-0.5mu}\mkern 0.5mu\right)$ . As $\mathcal{I}^{-1}$ is also $\operatorname{int}$ , showing (15) amounts to showing that these two different ways of ‘folding’ – first the head and tail and then in a last step the junction between $k$ and $k+1$ on the one hand, and from front to back on the other hand – do the same thing. This may be intuitively clear to the reader. The proof works by repeated application of Lemma 3.5, which represents one step of ‘folding order doesn’t matter’. Using Lemma 3.5 the proof is completely analogous to the proof that for an operation $\star$ satisfying $(a\star b)\star c=a\star(b\star c)$ , i.e. for an associative operation, one has

[TABLE]

As we know, for such an operation any way of parenthesizing the multiplication of $N$ elements gives the same result. An analogous statement holds for $\operatorname{int}$ , though we do not formally state or prove this.

Finally, in Lemma 3.8, using Lemma 3.7 as the main ingredient we prove the ‘hard direction’, i.e. that $\mathcal{H}$ is continuous. If the continuity of $\mathcal{M}$ and $\mathcal{K}$ as informally described here seem obvious to the reader they may wish to skip ahead to Lemma 3.7 and Lemma 3.8.

Remark 3.2.

The reader interested in working out the details and analogies between ‘folding’ using $\operatorname{int}$ and associative binary operations might be interested in reading about monads in the context of Category Theory first. (See for example Chapter VI in [47].) In fact, $(\mathscr{P},\bm{\eta},\bm{\mu})$ forms a monad, where

[TABLE]

sends an element $x$ of $\mathcal{X}$ to the dirac measure at $x$ and

[TABLE]

This monad is studied in a little more detail in [29]. $\operatorname{int}$ can be obtained from $\bm{\mu}$ and a tensorial strength $t_{\mathcal{A},\mathcal{B}}:\mathcal{A}\times\mathscr{P}\!\left(\mathcal{B}\right)\rightarrow\mathscr{P}\!\left(\mathcal{A}\times\mathcal{B}\right)$ in the sense described for example in [49].

To show that $\mathcal{M}$ is continuous we will need the following lemma.

Lemma 3.3.

$\operatorname{dis}_{\mathcal{A}}^{\mathcal{B}}$ * is natural in $\mathcal{B}$ , i.e. for $f:\mathcal{B}\rightarrow\mathcal{B}^{\prime}$ the following diagram commutes.*

[TABLE]

Proof.

This is just straigtforward calculation using the definitions. ∎

Applying Lemma 3.3 with $\mathcal{A}={}\mkern 3.5mu\overline{\mkern-3.5mu\mathcal{X}\mkern-0.5mu}\mkern 0.5mu^{k}$ , $\mathcal{B}={}\mkern 3.5mu\overline{\mkern-3.5mu\mathcal{X}\mkern-0.5mu}\mkern 0.5mu_{\mkern-3.0muk+1}$ , $\mathcal{B}^{\prime}=\mathcal{X}_{k+1}$ and $f=\operatorname{proj}_{\mathcal{X}_{k+1}}:{}\mkern 3.5mu\overline{\mkern-3.5mu\mathcal{X}\mkern-0.5mu}\mkern 0.5mu_{\mkern-3.0muk+1}\rightarrow\mathcal{X}_{k+1}$ we get that

[TABLE]

Setting $\mathcal{M}_{k}:=\mathscr{P}\!\left(1_{{}\mkern 3.5mu\overline{\mkern-3.5mu\mathcal{X}\mkern-0.5mu}\mkern 0.5mu^{k}}\times\mathscr{P}\!\left(\operatorname{proj}_{\mathcal{X}_{k+1}}\right)\right)$ we get $\mathcal{I}^{\prime}_{k}=\mathcal{M}_{k}\circ\mathcal{I}_{k}$ and then setting $\mathcal{M}((\mu_{k})_{k}):=(\mathcal{M}_{k}(\mu_{k}))_{k}$ gives $\mathcal{I}^{\prime}=\mathcal{M}\circ\mathcal{I}$ .

There is an analogue of Lemma 3.3 which we list here for completeness.

Lemma 3.4.

$\operatorname{int}_{\mathcal{A}}^{\mathcal{B}}:\mathscr{P}\!\left(\mathcal{A}\times\mathscr{P}\!\left(\mathcal{B}\right)\right)\rightarrow\mathscr{P}\!\left(\mathcal{A}\times\mathcal{B}\right)$ * is natural in $\mathcal{B}$ , i.e. for $f:\mathcal{B}\rightarrow\mathcal{B}^{\prime}$ the following diagram commutes:*

[TABLE]

In particular, if $\mathcal{B}\subseteq\mathcal{B}^{\prime}$ then

[TABLE]

if we regard $\mathscr{P}\!\left(\mathcal{A}\times\mathscr{P}\!\left(\mathcal{B}\right)\right)$ as a subset of $\mathscr{P}\!\left(\mathcal{A}\times\mathscr{P}\!\left(\mathcal{B}^{\prime}\right)\right)$ by recursively using the recipe: ‘if $\mathcal{B}$ is a subset of $\mathcal{B}^{\prime}$ , then we can view $\mathscr{P}\!\left(\mathcal{B}\right)$ as the subset of those $\mu\in\mathscr{P}\!\left(\mathcal{B}^{\prime}\right)$ which are concentrated on $\mathcal{B}$ ’.

Proof.

Again this is just calculation. ∎

We already implicity used the ‘in particular’-part of Lemma 3.4 when we said that $\mathcal{N}$ can be regarded both as a map into $\mathscr{P}\!\left(\mathcal{X}_{1:N}\right)$ and into $\mathcal{F}_{1}$ but the use there seemed too trivial to warrant much mention. There will be more such tacit uses.

Now we show that $\mathcal{K}$ is continuous. We claim that it can be written as

[TABLE]

where

[TABLE]

or without the dots, letting $\circ$$\prod$ denote concatenation of functions, e.g. $\hbox to0.0pt{\hbox to9.44447pt{\hss$ \circ $\hss}\hss}\hbox{$ \prod $}_{i=3}^{1}f_{i}=f_{3}\circ f_{2}\circ f_{1}$ :

[TABLE]

To prove this we will repeatedly apply the following lemma.

Lemma 3.5 ( $\operatorname{int}$ is ‘associative’).

$\operatorname{int}$ * satisfies the following relation:*

[TABLE]

These maps can be seen in the following commutative diagram.

[TABLE]

Proof.

This is just expanding the definition. Both maps send a measure $\alpha\in\mathscr{P}\!\left(\mathcal{A}\times\mathscr{P}\!\left(\mathcal{B}\times\mathscr{P}\!\left(\mathcal{C}\right)\right)\right)$ to the measure $\mu$ with

[TABLE]

∎

Lemma 3.6.

The following relation holds.

[TABLE]

Proof.

Again, this is just repeated application of Lemma 3.5. Below we define $\mathcal{T}_{l}$ for $N\geq l\geq k$ and show that

[TABLE]

for all $N\geq l\geq k$ by showing $\mathcal{T}_{l}=\mathcal{T}_{l-1}$ for all $N\geq l>k$ . The left hand side of (17) is the left hand side of (16) with the common tail ${\textstyle\hbox to0.0pt{\hbox to9.44447pt{\hss$ \circ $\hss}\hss}\hbox{$ \prod $}_{i=k-1}^{1}}\operatorname{int}_{{}\mkern 3.5mu\overline{\mkern-3.5mu\mathcal{X}\mkern-0.5mu}\mkern 0.5mu^{i}}^{\mathcal{X}_{i+1:N}}$ of the left and right side in (16) dropped. $\mathcal{T}_{k}$ will be the right hand side of (16) with the common part dropped.

[TABLE]

Here we regard $\hbox to0.0pt{\hbox to9.44447pt{\hss$ \circ $\hss}\hss}\hbox{$ \prod $}_{r}^{s}\dots$ with $r<s$ (an empty product in our context) as the identity function. For $l=N$ the first factor is an empty product and therefore clearly (17) is true for $l=N$ . To get from $\mathcal{T}_{l}$ to $\mathcal{T}_{l-1}$ we leave the first factor alone and apply Lemma 3.5 with $\mathcal{A}={}\mkern 3.5mu\overline{\mkern-3.5mu\mathcal{X}\mkern-0.5mu}\mkern 0.5mu^{k}$ , $\mathcal{B}={}\mkern 3.5mu\overline{\mkern-3.5mu\mathcal{X}\mkern-0.5mu}\mkern 0.5mu_{\mkern-3.0muk+1}^{\mkern 0.0mul-1}$ and $\mathcal{C}=\mathcal{X}_{l:N}$ . This transforms

[TABLE]

into

[TABLE]

and therefore $\mathcal{T}_{l}$ into $\mathcal{T}_{l-1}$ . ∎

Lemma 3.7.

The right hand triangle in (14) commutes, i.e.

[TABLE]

Proof.

Prepending $\mathcal{N}$ to (16) gives

[TABLE]

and appending $\mathcal{I}_{k}$ gives

[TABLE]

∎

Now we will show that $\mathcal{H}$ is continuous. We will postpone the proof of Lemma 3.7 below, which is the crucial non-bookkeeping ingredient in the proof of Lemma 3.8 below, until the end of this section. The methods used in the proof of Lemma 3.7 differ significantly from the rest in this section and make use of the concept of the modulus of continuity for measures, and results relating to it, introduced in the companion paper [24] to this one.

{restatable}

lemmaindkernel

Let

[TABLE]

be the set of all $(\mu^{\prime},\mu)$ s.t.

[TABLE]

The function

[TABLE]

is continuous.

Clearly, as a function between sets, $\mathcal{J}_{\mathcal{A},\mathcal{B}}^{\mathcal{Y}}(\mu^{\prime},\mu)$ only depends on $\mu$ . But, as we know, $\operatorname{dis}_{\mathcal{A}}^{\mathcal{B}\times\mathcal{Y}}$ is not continuous. Only when we refine the topology on the source space, which we encode by regarding $\mathcal{J}_{\mathcal{A},\mathcal{B}}^{\mathcal{Y}}$ as a map from the above subset of a product space, does it become continuous.

Lemma 3.8.

$\mathcal{H}$ * is continuous.*

Proof.

We will inductively define

[TABLE]

(again down from $N-1$ to $1$ ) so that they will be continuous by construction (and by virtue of Lemma 3.7). Also by construction, we will have $\mathcal{H}^{k}\circ\mathcal{I}^{\prime}=\mathcal{N}^{k}$ . $\mathcal{H}$ will be $\mathcal{H}^{1}$ so that $\mathcal{H}\circ\mathcal{I}^{\prime}=\mathcal{N}$ .

Set $\mathcal{H}^{N-1}:=\operatorname{proj}_{N-1}$ , the projection from $\prod_{k=1}^{N-1}\mathscr{F}\left({}\mkern 3.5mu\overline{\mkern-3.5mu\mathcal{X}\mkern-0.5mu}\mkern 0.5mu^{k}\rightsquigarrow\mathscr{P}\!\left(\mathcal{X}_{k+1}\right)\right)$ onto the last factor. $\mathcal{H}^{N-1}\circ\mathcal{I}^{\prime}=\mathcal{I}^{\prime}_{N-1}=\operatorname{dis}_{{}\mkern 3.5mu\overline{\mkern-3.5mu\mathcal{X}\mkern-0.5mu}\mkern 0.5mu^{N-1}}^{\mathcal{X}_{N}}=\mathcal{N}^{N-1}$ by definition. Given $\mathcal{H}^{k+1}$ define

[TABLE]

where $\operatorname{proj}_{k}$ is the projection from $\prod_{k=1}^{N-1}\mathscr{F}\left({}\mkern 3.5mu\overline{\mkern-3.5mu\mathcal{X}\mkern-0.5mu}\mkern 0.5mu^{k}\rightsquigarrow\mathscr{P}\!\left(\mathcal{X}_{k+1}\right)\right)$ onto the $k$ -th factor.

For this to be well-defined we need to check that for $\mu\in\mathcal{I}^{\prime}\left[\mathscr{P}\!\left({}\mkern 3.5mu\overline{\mkern-3.5mu\mathcal{X}\mkern-0.5mu}\mkern 0.5mu\right)\right]$ we have

[TABLE]

I.e. for $\nu\in\mathscr{P}\!\left({}\mkern 3.5mu\overline{\mkern-3.5mu\mathcal{X}\mkern-0.5mu}\mkern 0.5mu\right)$ we want

[TABLE]

The composite of the maps on the left-hand side is equal to

[TABLE]

On the right-hand side we get by induction hypothesis

[TABLE]

Using that $\mathscr{P}\!\left(\operatorname{proj}_{\mathcal{A}}\right)\circ\operatorname{dis}_{\mathcal{A}}^{\mathcal{B}}=\mathscr{P}\!\left(\operatorname{proj}_{\mathcal{A}}\right)$ we see for $l\geq k+1$

[TABLE]

i.e. by induction (19) is also equal to $\mathscr{P}\!\left(\operatorname{proj}_{{}\mkern 3.5mu\overline{\mkern-3.5mu\mathcal{X}\mkern-0.5mu}\mkern 0.5mu^{k+1}}\right)$ .

As a composite of continuous maps $\mathcal{H}^{k}$ is clearly continuous. (This is where we use Lemma 3.7.) As a map between sets $\mathcal{H}^{k}$ is just

[TABLE]

by induction hypothesis and definition of $\mathcal{N}^{k}$ . ∎

3.2. Proof of Lemma 3.7

In this part we prove Lemma 3.7. Here we use several of the ideas developed in the companion paper [24]. In particular we will need [24, Lemma 4.2] which we reproduce below.

Lemma 3.9 ([24, Lemma 4.2]).

Let $\mu\in\mathscr{F}\left(\mathcal{X}\rightsquigarrow\mathcal{Y}\right)$ . For any $\varepsilon>0$ there is a $\delta>0$ s.t. if

[TABLE]

then

[TABLE]

For easy reference we also restate Lemma 3.7. \indkernel*

Proof of Lemma 3.7.

Let $(\mu^{\prime},\mu)\in\operatorname{dom}\!(\mathcal{J}_{\mathcal{A},\mathcal{B}}^{\mathcal{Y}})$ . Let $\varepsilon>0$ .

Choose $\delta>0$ according to Lemma 3.9 with $\mathcal{X}=\mathcal{A}\times\mathcal{B}$ , i.e. s.t. for any $\nu\in\mathscr{P}\!\left(\mathcal{A}\times\mathcal{B}\times\mathcal{Y}\right)$ with $\mathcal{W}_{p}\left(\mu,\nu\right)<\delta$ and any $\gamma\in\operatorname{Cpl}\left(\mu,\nu\right)$ with ${\textstyle\int}\rho(a_{1},a_{2})^{p}+\rho(b_{1},b_{2})^{p}\,\mathrm{d}\gamma(a_{1},b_{1},\_,a_{2},b_{2},\_)<\delta^{p}$ we have ${\textstyle\int}\rho(y_{1},y_{2})^{p}\,\mathrm{d}\gamma(\_,y_{1},\_,y_{2})<\varepsilon^{p}$ .

Let $(\nu^{\prime},\nu)\in\operatorname{dom}\!(\mathcal{J}_{\mathcal{A},\mathcal{B}}^{\mathcal{Y}})$ with $\max(\rho(\mu,\nu),\rho(\mu^{\prime},\nu^{\prime}))<\min(\delta,\varepsilon)$ .

This means we can find $\gamma^{\prime}\in\operatorname{Cpl}\left(\mu^{\prime},\nu^{\prime}\right)$ with

[TABLE]

Let $(a,b)\mapsto f_{a}(b)$ and $(a,b)\mapsto g_{a}(b):\mathcal{A}\times\mathcal{B}\rightarrow\mathcal{Y}$ be measurable functions on whose graph $\mu$ and $\nu$ , respectively, are concentrated. Let $\bar{\mu}:=\mathcal{J}_{\mathcal{A},\mathcal{B}}^{\mathcal{Y}}(\mu^{\prime},\mu)$ , $\bar{\nu}:=\mathcal{J}_{\mathcal{A},\mathcal{B}}^{\mathcal{Y}}(\nu^{\prime},\nu)$ .

As noted in the proof of Lemma 3.1 we know that for $\bar{\mu}$ -a.a. $(a,\dot{\mu})$ the measure $\dot{\mu}$ is concentrated on the graph of the function $f_{a}$ (and similarly for $\bar{\nu}$ ). This together with $\mathscr{P}\!\left(1_{\mathcal{A}}\times\mathscr{P}\!\left(\operatorname{proj}_{\mathcal{B}}\right)\right)(\bar{\mu})=\mu^{\prime}$ (which is a consequence of (18)) implies that

[TABLE]

(again similarly for $\bar{\nu}$ ).

From this we see that the measure $\bar{\gamma}\in\mathscr{P}\!\left(\mathcal{A}\times\mathscr{F}\left(\mathcal{B}\rightsquigarrow\mathcal{Y}\right)\times\mathcal{A}\times\mathscr{F}\left(\mathcal{B}\rightsquigarrow\mathcal{Y}\right)\right)$ defined as

[TABLE]

is in $\operatorname{Cpl}\left(\bar{\mu},\bar{\nu}\right)$ .

We may measurably select almost-witnesses $\hat{\gamma}_{\hat{b}_{1},\hat{b}_{2}}\in\operatorname{Cpl}(\hat{b}_{1},\hat{b}_{2})$ for the distances $\mathcal{W}_{p}(\hat{b}_{1},\hat{b}_{2})$ s.t. building on (20) we get

[TABLE]

Now

[TABLE]

where $\gamma\in\operatorname{Cpl}\left(\mu,\nu\right)$ is defined as

[TABLE]

The integral over the first two summands in (22) is less than $\min(\delta^{p},\varepsilon^{p})$ by (21). By our choice of $\delta$ in the beginning this implies that the integral over the last summand is also less than $\varepsilon^{p}$ , so that overall

[TABLE]

Es $\varepsilon$ was arbitrary this concludes the proof. ∎

4. The symmetrized causal Wasserstein distance $\mathcal{S}\mathcal{C}\mathcal{W}_{p}$

In this section we prove that the topology induced by $\mathcal{S}\mathcal{C}\mathcal{W}_{p}$ is sandwiched between Hellwig’s $\mathcal{W}_{p}$ -information topology and the topology induced by $\mathcal{A}\mathcal{W}_{p}$ , and therefore by what we have already seen in the previous section equal to both of them. Our arguments in this section make explicit use of metrics. The reader who is only interested in the simpler version of our main theorem, Theorem 1.2 may assume that $p=1$ and that all metrics are bounded.

Remember that for $\mu,\nu\in\mathscr{P}\!\left({}\mkern 3.5mu\overline{\mkern-3.5mu\mathcal{X}\mkern-0.5mu}\mkern 0.5mu\right)$ we have

[TABLE]

In proving this we will take a slightly roundabout route. First we will focus on the case where ${}\mkern 3.5mu\overline{\mkern-3.5mu\mathcal{X}\mkern-0.5mu}\mkern 0.5mu=\mathcal{X}_{1}\times\mathcal{X}_{2}$ is the product of just two spaces, i.e. where we have only two time points. Moreover, for expositional purposes, let us for the moment assume that $\mathcal{X}_{1}$ and $\mathcal{X}_{2}$ are both compact. Generalizing from this setting will not be very hard.

In the compact, two-time-point case we will show equality of the two topologies in question by extending both to a larger (compact) space and showing equality of the topologies on that larger space.

In more detail:

When there are only two timepoints Hellwig’s $\mathcal{W}_{p}$ -information topology and the topology induced by $\mathcal{A}\mathcal{W}_{p}$ trivially coincide. Both are induced by emedding $\mathscr{P}\!\left(\mathcal{X}_{1}\times\mathcal{X}_{2}\right)$ into $\mathscr{P}\!\left(\mathcal{X}_{1}\times\mathscr{P}\!\left(\mathcal{X}_{2}\right)\right)$ via $\operatorname{dis}_{\mathcal{X}_{1}}^{\mathcal{X}_{2}}$ . The latter space carries its standard metric $\rho_{\mathscr{P}\!\left(\mathcal{X}_{1}\times\mathscr{P}\!\left(\mathcal{X}_{2}\right)\right)}$ , which – as was already established in Theorem 1.4 in Section 1.8 of the introduction – is an extension of $\mathcal{A}\mathcal{W}_{p}$ . To highlight this connection, in this section we will also refer to that metric as ${}\mkern 8.0mu\overline{\mkern-8.0mu\mathcal{A}\mathcal{W}_{p}\mkern-7.0mu}\mkern 7.0mu$ . As a reminder,

[TABLE]

where $\mathcal{W}_{p}$ is the normal Wasserstein distance (on $\mathscr{P}\!\left(\mathcal{X}_{2}\right)$ in this case). We will find an extension ${}\mkern 5.0mu\overline{\mkern-5.0mu\mathcal{C}\mathcal{W}_{p}\mkern-6.0mu}\mkern 6.0mu$ of $\mathcal{C}\mathcal{W}_{p}$ to $\mathscr{P}\!\left(\mathcal{X}_{1}\times\mathscr{P}\!\left(\mathcal{X}_{2}\right)\right)$ , which still satisfies all properties of a metric except for symmetry and which is dominated by ${}\mkern 8.0mu\overline{\mkern-8.0mu\mathcal{A}\mathcal{W}_{p}\mkern-7.0mu}\mkern 7.0mu$ . Symmetrizing this extension gives a metric (which we will call ${}\mkern 5.0mu\overline{\mkern-5.0mu\mathcal{S}\mathcal{C}\mathcal{W}_{p}\mkern-6.0mu}\mkern 6.0mu$ ). The identity function from $\mathscr{P}\!\left(\mathcal{X}_{1}\times\mathscr{P}\!\left(\mathcal{X}_{2}\right)\right)$ topologized with ${}\mkern 8.0mu\overline{\mkern-8.0mu\mathcal{A}\mathcal{W}_{p}\mkern-7.0mu}\mkern 7.0mu$ to $\mathscr{P}\!\left(\mathcal{X}_{1}\times\mathscr{P}\!\left(\mathcal{X}_{2}\right)\right)$ topologized with ${}\mkern 5.0mu\overline{\mkern-5.0mu\mathcal{S}\mathcal{C}\mathcal{W}_{p}\mkern-6.0mu}\mkern 6.0mu$ will then be a continuous bijection from a compact space (this is where we use compactness of $\mathcal{X}_{1}$ , $\mathcal{X}_{2}$ ) to a Hausdorff space, i.e. a homeomorphism.

The next subsection will be devoted to finding an expression for the extension of $\mathcal{C}\mathcal{W}_{p}$ to $\mathscr{P}\!\left(\mathcal{X}_{1}\times\mathscr{P}\!\left(\mathcal{X}_{2}\right)\right)$ and proving that it satisfies all the properties mentioned above.

Remark 4.1.

When $\mathcal{X}_{1}$ contains no isolated points, because $\mathscr{P}\!\left(\mathcal{X}_{1}\times\mathscr{P}\!\left(\mathcal{X}_{2}\right)\right)$ is the metric completion of $\mathscr{P}\!\left(\mathcal{X}_{1}\times\mathcal{X}_{2}\right)$ w.r.t. $\mathcal{A}\mathcal{W}_{p}$ and because the above properties imply that $\mathcal{C}\mathcal{W}_{p}$ is (uniformly) continuous w.r.t. $\mathcal{A}\mathcal{W}_{p}$ , we have already uniquely identified ${}\mkern 5.0mu\overline{\mkern-5.0mu\mathcal{C}\mathcal{W}_{p}\mkern-6.0mu}\mkern 6.0mu$ . Still, we want to find an expression that allows us to work with ${}\mkern 5.0mu\overline{\mkern-5.0mu\mathcal{C}\mathcal{W}_{p}\mkern-6.0mu}\mkern 6.0mu$ and in particular that allows us to prove that ${}\mkern 5.0mu\overline{\mkern-5.0mu\mathcal{S}\mathcal{C}\mathcal{W}_{p}\mkern-6.0mu}\mkern 6.0mu$ is a metric and not just a pseudometric, i.e. that the induced topology is in fact Hausdorff. This is exactly what we gain from assuming compact base spaces and passing to the completion: instead of having to find a lower bound for $\mathcal{S}\mathcal{C}\mathcal{W}_{p}\left(\mu,\nu\right)$ in terms of $\mathcal{A}\mathcal{W}_{p}\left(\mu,\nu\right)$ (and possibly $\mu$ ) we now just have to prove that if $\mu\neq\nu$ then ${}\mkern 5.0mu\overline{\mkern-5.0mu\mathcal{S}\mathcal{C}\mathcal{W}_{p}\mkern-6.0mu}\mkern 6.0mu\left(\mu,\nu\right)>0$ .

For definiteness we note that we do not assume, compactness of any space in the following.

4.1. Extending the causal ‘distance’

So now we are working with two Polish metric spaces $\mathcal{X}_{1}$ , $\mathcal{X}_{2}$ . Remember that we denote the ‘canonical process’ on ${}\mkern 3.5mu\overline{\mkern-3.5mu\mathcal{X}\mkern-0.5mu}\mkern 0.5mu:=\mathcal{X}_{1}\times\mathcal{X}_{2}$ by $(X_{i})_{i=1,2}$ , i.e. $X_{i}:{}\mkern 3.5mu\overline{\mkern-3.5mu\mathcal{X}\mkern-0.5mu}\mkern 0.5mu\rightarrow\mathcal{X}_{i}$ is the projection onto the $i$ -th coordinate.

To differentiate between the different roles that ${}\mkern 3.5mu\overline{\mkern-3.5mu\mathcal{X}\mkern-0.5mu}\mkern 0.5mu$ may play - i.e. is it the space for the left measure $\mu$ or the right measure $\nu$ when measuring the ‘distance’ $\mathcal{C}\mathcal{W}_{p}\left(\mu,\nu\right)$ - we will also refer to ${}\mkern 3.5mu\overline{\mkern-3.5mu\mathcal{X}\mkern-0.5mu}\mkern 0.5mu$ , $\mathcal{X}_{i}$ by the aliases ${}\mkern 3.5mu\overline{\mkern-3.5mu\mathcal{Y}\mkern-0.5mu}\mkern 0.5mu$ , $\mathcal{Y}_{i}$ respectively. (And later ${}\mkern 3.5mu\overline{\mkern-3.5mu\mathcal{Z}\mkern-0.5mu}\mkern 0.5mu$ , $\mathcal{Z}_{i}$ as well.) Analogously, we have $Y_{i}:{}\mkern 3.5mu\overline{\mkern-3.5mu\mathcal{Y}\mkern-0.5mu}\mkern 0.5mu\rightarrow\mathcal{Y}_{i}$ . (And $Z_{i}:{}\mkern 3.5mu\overline{\mkern-3.5mu\mathcal{Z}\mkern-0.5mu}\mkern 0.5mu\rightarrow\mathcal{Z}_{i}$ .)

In this section we will repeatedly make use of the following construction:

Definition 4.2.

Let $\mathcal{A}$ , $\mathcal{B}$ , $\mathcal{C}$ be Polish metric spaces. Let $\mu\in\mathscr{P}\!\left(\mathcal{A}\times\mathcal{B}\right)$ and $\nu\in\mathscr{P}\!\left(\mathcal{B}\times\mathcal{C}\right)$ with $\mu_{\restriction\mathcal{B}}=\nu_{\restriction\mathcal{B}}$ . We define

[TABLE]

as the measure given by

[TABLE]

where $b\mapsto\nu_{b}$ is a disintegration of $\nu$ w.r.t. $\mathcal{B}$ and similarly for $\mu$ .

We further define

[TABLE]

Remark 4.3.

If $\mu$ is a probability on $\mathcal{A}\times\mathcal{B}$ and $\nu$ is a probability on $\mathcal{B}\times\mathcal{C}$ , another way of saying what $\mu\mathbin{\leavevmode\hbox to7.67pt{\vbox to10.32pt{\pgfpicture\makeatletter\hbox{\hskip 3.83331pt\lower-7.98332pt\hbox to0.0pt{\pgfsys@beginscope\pgfsys@invoke{ }\definecolor{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@rgb@stroke{0}{0}{0}\pgfsys@invoke{ }\pgfsys@color@rgb@fill{0}{0}{0}\pgfsys@invoke{ }\pgfsys@setlinewidth{0.4pt}\pgfsys@invoke{ }\nullfont\hbox to0.0pt{\pgfsys@beginscope\pgfsys@invoke{ }{}{}{ {{}}\hbox{\hbox{{\pgfsys@beginscope\pgfsys@invoke{ }{{}{}{{ {}{}}}{ {}{}} {{}{{}}}{{}{}}{}{{}{}} { }{{{{}}\pgfsys@beginscope\pgfsys@invoke{ }\pgfsys@transformcm{1.0}{0.0}{0.0}{1.0}{-3.83331pt}{-1.75pt}\pgfsys@invoke{ }\hbox{{\definecolor{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@rgb@stroke{0}{0}{0}\pgfsys@invoke{ }\pgfsys@color@rgb@fill{0}{0}{0}\pgfsys@invoke{ }\hbox{{$ {\scriptstyle\otimes} $}} }}\pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope}}} \pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope}}} {{}}\hbox{\hbox{{\pgfsys@beginscope\pgfsys@invoke{ }{{}{}{{ {}{}}}{ {}{}} {{}{{}}}{{}{}}{}{{}{}} { }{{{{}}\pgfsys@beginscope\pgfsys@invoke{ }\pgfsys@transformcm{1.0}{0.0}{0.0}{1.0}{-2.47917pt}{-7.98332pt}\pgfsys@invoke{ }\hbox{{\definecolor{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@rgb@stroke{0}{0}{0}\pgfsys@invoke{ }\pgfsys@color@rgb@fill{0}{0}{0}\pgfsys@invoke{ }\hbox{{$ {\scriptstyle\mathcal{B}} $}} }}\pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope}}} \pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope}}} } \pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope{}{}{}\hss}\pgfsys@discardpath\pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope\hss}}\lxSVG@closescope\endpgfpicture}}}\nu$ is, is to state that it is a probability on $\mathcal{A}\times\mathcal{B}\times\mathcal{C}$ s.t. the law of $(A,B)$ is equal to $\mu$ , the law of $(B,C)$ is equal to $\nu$ (where per our convention $A$ is the projection onto $\mathcal{A}$ , etc.), and $A$ is conditionally independent from $C$ given $B$ . (For the notion of conditional independence see for example [22, Definition II.43].)

Another helpful intuition comes from looking at the case where $\mu\in\mathscr{F}\left(\mathcal{A}\rightsquigarrow\mathcal{B}\right)$ is concentrated on the graph of some measurable function $f:\mathcal{A}\rightarrow\mathcal{B}$ and $\nu\in\mathscr{F}\left(\mathcal{B}\rightsquigarrow\mathcal{C}\right)$ is concentrated on the graph of a measurable function $g:\mathcal{B}\rightarrow\mathcal{C}$ . $\mu\mathbin{\raise 2.58334pt\hbox{\oalign{\hfil$ \scriptscriptstyle\mathrm{o} $\hfil\cr\hfil$ \scriptscriptstyle\mathrm{9} $\hfil}}}_{\mathcal{B}}\nu$ is then concentrated on the graph of $g\circ f:\mathcal{A}\rightarrow\mathcal{C}$ . In some contexts $g\circ f$ is also written as $f\mathbin{\raise 2.58334pt\hbox{\oalign{\hfil$ \scriptscriptstyle\mathrm{o} $\hfil\cr\hfil$ \scriptscriptstyle\mathrm{9} $\hfil}}}g$ , which is where we borrowed the symbol from.

Remark 4.4.

We will often encounter the situation that one of the factors $\mathcal{A}$ , $\mathcal{B}$ or $\mathcal{C}$ in Definition 4.2 is itself a product of spaces and the individual factors may not always be so nicely sorted. We will rely on naming in the subscript the space(s) along which to join the measures $\mu$ and $\nu$ . For example if $\mu\in\mathscr{P}\!\left(\mathcal{A}_{1}\times\mathcal{B}_{1}\times\mathcal{A}_{1}\times\mathcal{B}_{2}\right)$ and $\nu\in\mathscr{P}\!\left(\mathcal{B}_{2}\times\mathcal{C}_{1}\times\mathcal{B}_{1}\times\mathcal{C}_{2}\right)$ we might write

[TABLE]

to refer to the measure that we get when in (26) we use $(b_{1},b_{2})\in\mathcal{B}_{1}\times\mathcal{B}_{2}$ as the middle variable $b$ . We will not be systematic about the order of the factors in the resulting product space on which e.g. $\mu\mathbin{\leavevmode\hbox to15.86pt{\vbox to11.68pt{\pgfpicture\makeatletter\hbox{\hskip 7.93056pt\lower-9.3444pt\hbox to0.0pt{\pgfsys@beginscope\pgfsys@invoke{ }\definecolor{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@rgb@stroke{0}{0}{0}\pgfsys@invoke{ }\pgfsys@color@rgb@fill{0}{0}{0}\pgfsys@invoke{ }\pgfsys@setlinewidth{0.4pt}\pgfsys@invoke{ }\nullfont\hbox to0.0pt{\pgfsys@beginscope\pgfsys@invoke{ }{}{}{ {{}}\hbox{\hbox{{\pgfsys@beginscope\pgfsys@invoke{ }{{}{}{{ {}{}}}{ {}{}} {{}{{}}}{{}{}}{}{{}{}} { }{{{{}}\pgfsys@beginscope\pgfsys@invoke{ }\pgfsys@transformcm{1.0}{0.0}{0.0}{1.0}{-3.83331pt}{-1.75pt}\pgfsys@invoke{ }\hbox{{\definecolor{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@rgb@stroke{0}{0}{0}\pgfsys@invoke{ }\pgfsys@color@rgb@fill{0}{0}{0}\pgfsys@invoke{ }\hbox{{$ {\scriptstyle\otimes} $}} }}\pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope}}} \pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope}}} {{}}\hbox{\hbox{{\pgfsys@beginscope\pgfsys@invoke{ }{{}{}{{ {}{}}}{ {}{}} {{}{{}}}{{}{}}{}{{}{}} { }{{{{}}\pgfsys@beginscope\pgfsys@invoke{ }\pgfsys@transformcm{1.0}{0.0}{0.0}{1.0}{-7.93056pt}{-7.98332pt}\pgfsys@invoke{ }\hbox{{\definecolor{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@rgb@stroke{0}{0}{0}\pgfsys@invoke{ }\pgfsys@color@rgb@fill{0}{0}{0}\pgfsys@invoke{ }\hbox{{$ {\scriptstyle\mathcal{B}{1},\mathcal{B}{2}} $}} }}\pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope}}} \pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope}}} } \pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope{}{}{}\hss}\pgfsys@discardpath\pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope\hss}}\lxSVG@closescope\endpgfpicture}}}\nu$ is a measure, again relying on naming our spaces for disambiguation.

For future reference we paraphrase the definition of a causal transport plan given in (3) in the introduction.

Lemma 4.5.

Let $\mu$ be a measure on ${}\mkern 3.5mu\overline{\mkern-3.5mu\mathcal{X}\mkern-0.5mu}\mkern 0.5mu=\mathcal{X}_{1}\times\mathcal{X}_{2}$ and $\nu$ be a measure on ${}\mkern 3.5mu\overline{\mkern-3.5mu\mathcal{Y}\mkern-0.5mu}\mkern 0.5mu=\mathcal{Y}_{1}\times\mathcal{Y}_{2}$ . $\gamma\in\operatorname{Cpl}\left(\mu,\nu\right)$ is a causal transference plan from $\mu$ to $\nu$ iff under $\gamma$

[TABLE]

Proof.

One way of formulating conditional independence is as in (3), see for example [22, Definition II.43, Theorem II.45]. ∎

In other words, $\gamma\in\operatorname{Cpl}\left(\mu,\nu\right)$ is a causal transference plan iff $\gamma_{\restriction\mathcal{X}_{1},\mathcal{X}_{2},\mathcal{Y}_{1}}=\mu\mathbin{\leavevmode\hbox to7.67pt{\vbox to11.61pt{\pgfpicture\makeatletter\hbox{\hskip 3.83331pt\lower-9.2722pt\hbox to0.0pt{\pgfsys@beginscope\pgfsys@invoke{ }\definecolor{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@rgb@stroke{0}{0}{0}\pgfsys@invoke{ }\pgfsys@color@rgb@fill{0}{0}{0}\pgfsys@invoke{ }\pgfsys@setlinewidth{0.4pt}\pgfsys@invoke{ }\nullfont\hbox to0.0pt{\pgfsys@beginscope\pgfsys@invoke{ }{}{}{ {{}}\hbox{\hbox{{\pgfsys@beginscope\pgfsys@invoke{ }{{}{}{{ {}{}}}{ {}{}} {{}{{}}}{{}{}}{}{{}{}} { }{{{{}}\pgfsys@beginscope\pgfsys@invoke{ }\pgfsys@transformcm{1.0}{0.0}{0.0}{1.0}{-3.83331pt}{-1.75pt}\pgfsys@invoke{ }\hbox{{\definecolor{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@rgb@stroke{0}{0}{0}\pgfsys@invoke{ }\pgfsys@color@rgb@fill{0}{0}{0}\pgfsys@invoke{ }\hbox{{$ {\scriptstyle\otimes} $}} }}\pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope}}} \pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope}}} {{}}\hbox{\hbox{{\pgfsys@beginscope\pgfsys@invoke{ }{{}{}{{ {}{}}}{ {}{}} {{}{{}}}{{}{}}{}{{}{}} { }{{{{}}\pgfsys@beginscope\pgfsys@invoke{ }\pgfsys@transformcm{1.0}{0.0}{0.0}{1.0}{-3.625pt}{-7.98332pt}\pgfsys@invoke{ }\hbox{{\definecolor{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@rgb@stroke{0}{0}{0}\pgfsys@invoke{ }\pgfsys@color@rgb@fill{0}{0}{0}\pgfsys@invoke{ }\hbox{{$ {\scriptstyle\mathcal{X}_{1}} $}} }}\pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope}}} \pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope}}} } \pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope{}{}{}\hss}\pgfsys@discardpath\pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope\hss}}\lxSVG@closescope\endpgfpicture}}}\gamma_{\restriction\mathcal{X}_{1},\mathcal{Y}_{1}}$ .

We start by reexpressing $\mathcal{C}\mathcal{W}_{p}$ in different ways until we find one which also makes sense in $\mathscr{P}\!\left(\mathcal{X}_{1}\times\mathscr{P}\!\left(\mathcal{X}_{2}\right)\right)$ .

Let $\mu\in\mathscr{P}\!\left({}\mkern 3.5mu\overline{\mkern-3.5mu\mathcal{X}\mkern-0.5mu}\mkern 0.5mu\right)$ and $\nu\in\mathscr{P}\!\left({}\mkern 3.5mu\overline{\mkern-3.5mu\mathcal{Y}\mkern-0.5mu}\mkern 0.5mu\right)$ . Then

[TABLE]

This is true because, on the one hand clearly a $\gamma\in C_{1}$ is causal by Lemma 4.5 and the alternative characterization of $\mathbin{\leavevmode\hbox to7.67pt{\vbox to11.61pt{\pgfpicture\makeatletter\hbox{\hskip 3.83331pt\lower-9.2722pt\hbox to0.0pt{\pgfsys@beginscope\pgfsys@invoke{ }\definecolor{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@rgb@stroke{0}{0}{0}\pgfsys@invoke{ }\pgfsys@color@rgb@fill{0}{0}{0}\pgfsys@invoke{ }\pgfsys@setlinewidth{0.4pt}\pgfsys@invoke{ }\nullfont\hbox to0.0pt{\pgfsys@beginscope\pgfsys@invoke{ }{}{}{ {{}}\hbox{\hbox{{\pgfsys@beginscope\pgfsys@invoke{ }{{}{}{{ {}{}}}{ {}{}} {{}{{}}}{{}{}}{}{{}{}} { }{{{{}}\pgfsys@beginscope\pgfsys@invoke{ }\pgfsys@transformcm{1.0}{0.0}{0.0}{1.0}{-3.83331pt}{-1.75pt}\pgfsys@invoke{ }\hbox{{\definecolor{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@rgb@stroke{0}{0}{0}\pgfsys@invoke{ }\pgfsys@color@rgb@fill{0}{0}{0}\pgfsys@invoke{ }\hbox{{$ {\scriptstyle\otimes} $}} }}\pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope}}} \pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope}}} {{}}\hbox{\hbox{{\pgfsys@beginscope\pgfsys@invoke{ }{{}{}{{ {}{}}}{ {}{}} {{}{{}}}{{}{}}{}{{}{}} { }{{{{}}\pgfsys@beginscope\pgfsys@invoke{ }\pgfsys@transformcm{1.0}{0.0}{0.0}{1.0}{-3.625pt}{-7.98332pt}\pgfsys@invoke{ }\hbox{{\definecolor{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@rgb@stroke{0}{0}{0}\pgfsys@invoke{ }\pgfsys@color@rgb@fill{0}{0}{0}\pgfsys@invoke{ }\hbox{{$ {\scriptstyle\mathcal{X}{1}} $}} }}\pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope}}} \pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope}}} } \pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope{}{}{}\hss}\pgfsys@discardpath\pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope\hss}}\lxSVG@closescope\endpgfpicture}}}$ . On the other hand, given any causal $\gamma\in\operatorname{Cpl}\left(\mu,\nu\right)$ , again by Lemma 4.5, $\gamma_{\restriction\mathcal{X}_{1},\mathcal{X}_{2},\mathcal{Y}_{1}}=\mu\mathbin{\leavevmode\hbox to7.67pt{\vbox to11.61pt{\pgfpicture\makeatletter\hbox{\hskip 3.83331pt\lower-9.2722pt\hbox to0.0pt{\pgfsys@beginscope\pgfsys@invoke{ }\definecolor{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@rgb@stroke{0}{0}{0}\pgfsys@invoke{ }\pgfsys@color@rgb@fill{0}{0}{0}\pgfsys@invoke{ }\pgfsys@setlinewidth{0.4pt}\pgfsys@invoke{ }\nullfont\hbox to0.0pt{\pgfsys@beginscope\pgfsys@invoke{ }{}{}{ {{}}\hbox{\hbox{{\pgfsys@beginscope\pgfsys@invoke{ }{{}{}{{ {}{}}}{ {}{}} {{}{{}}}{{}{}}{}{{}{}} { }{{{{}}\pgfsys@beginscope\pgfsys@invoke{ }\pgfsys@transformcm{1.0}{0.0}{0.0}{1.0}{-3.83331pt}{-1.75pt}\pgfsys@invoke{ }\hbox{{\definecolor{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@rgb@stroke{0}{0}{0}\pgfsys@invoke{ }\pgfsys@color@rgb@fill{0}{0}{0}\pgfsys@invoke{ }\hbox{{$ {\scriptstyle\otimes} $}} }}\pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope}}} \pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope}}} {{}}\hbox{\hbox{{\pgfsys@beginscope\pgfsys@invoke{ }{{}{}{{ {}{}}}{ {}{}} {{}{{}}}{{}{}}{}{{}{}} { }{{{{}}\pgfsys@beginscope\pgfsys@invoke{ }\pgfsys@transformcm{1.0}{0.0}{0.0}{1.0}{-3.625pt}{-7.98332pt}\pgfsys@invoke{ }\hbox{{\definecolor{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@rgb@stroke{0}{0}{0}\pgfsys@invoke{ }\pgfsys@color@rgb@fill{0}{0}{0}\pgfsys@invoke{ }\hbox{{$ {\scriptstyle\mathcal{X}{1}} $}} }}\pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope}}} \pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope}}} } \pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope{}{}{}\hss}\pgfsys@discardpath\pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope\hss}}\lxSVG@closescope\endpgfpicture}}}\gamma_{\restriction\mathcal{X}_{1},\mathcal{Y}_{1}}$ , and we may define $\gamma^{\prime}:=\left(\mu\mathbin{\leavevmode\hbox to7.67pt{\vbox to11.61pt{\pgfpicture\makeatletter\hbox{\hskip 3.83331pt\lower-9.2722pt\hbox to0.0pt{\pgfsys@beginscope\pgfsys@invoke{ }\definecolor{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@rgb@stroke{0}{0}{0}\pgfsys@invoke{ }\pgfsys@color@rgb@fill{0}{0}{0}\pgfsys@invoke{ }\pgfsys@setlinewidth{0.4pt}\pgfsys@invoke{ }\nullfont\hbox to0.0pt{\pgfsys@beginscope\pgfsys@invoke{ }{}{}{ {{}}\hbox{\hbox{{\pgfsys@beginscope\pgfsys@invoke{ }{{}{}{{ {}{}}}{ {}{}} {{}{{}}}{{}{}}{}{{}{}} { }{{{{}}\pgfsys@beginscope\pgfsys@invoke{ }\pgfsys@transformcm{1.0}{0.0}{0.0}{1.0}{-3.83331pt}{-1.75pt}\pgfsys@invoke{ }\hbox{{\definecolor{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@rgb@stroke{0}{0}{0}\pgfsys@invoke{ }\pgfsys@color@rgb@fill{0}{0}{0}\pgfsys@invoke{ }\hbox{{$ {\scriptstyle\otimes} $}} }}\pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope}}} \pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope}}} {{}}\hbox{\hbox{{\pgfsys@beginscope\pgfsys@invoke{ }{{}{}{{ {}{}}}{ {}{}} {{}{{}}}{{}{}}{}{{}{}} { }{{{{}}\pgfsys@beginscope\pgfsys@invoke{ }\pgfsys@transformcm{1.0}{0.0}{0.0}{1.0}{-3.625pt}{-7.98332pt}\pgfsys@invoke{ }\hbox{{\definecolor{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@rgb@stroke{0}{0}{0}\pgfsys@invoke{ }\pgfsys@color@rgb@fill{0}{0}{0}\pgfsys@invoke{ }\hbox{{$ {\scriptstyle\mathcal{X}{1}} $}} }}\pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope}}} \pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope}}} } \pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope{}{}{}\hss}\pgfsys@discardpath\pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope\hss}}\lxSVG@closescope\endpgfpicture}}}\gamma_{\restriction\mathcal{X}_{1},\mathcal{Y}_{1}}\right)\mathbin{\leavevmode\hbox to16.44pt{\vbox to11.68pt{\pgfpicture\makeatletter\hbox{\hskip 8.22221pt\lower-9.3444pt\hbox to0.0pt{\pgfsys@beginscope\pgfsys@invoke{ }\definecolor{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@rgb@stroke{0}{0}{0}\pgfsys@invoke{ }\pgfsys@color@rgb@fill{0}{0}{0}\pgfsys@invoke{ }\pgfsys@setlinewidth{0.4pt}\pgfsys@invoke{ }\nullfont\hbox to0.0pt{\pgfsys@beginscope\pgfsys@invoke{ }{}{}{ {{}}\hbox{\hbox{{\pgfsys@beginscope\pgfsys@invoke{ }{{}{}{{ {}{}}}{ {}{}} {{}{{}}}{{}{}}{}{{}{}} { }{{{{}}\pgfsys@beginscope\pgfsys@invoke{ }\pgfsys@transformcm{1.0}{0.0}{0.0}{1.0}{-3.83331pt}{-1.75pt}\pgfsys@invoke{ }\hbox{{\definecolor{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@rgb@stroke{0}{0}{0}\pgfsys@invoke{ }\pgfsys@color@rgb@fill{0}{0}{0}\pgfsys@invoke{ }\hbox{{$ {\scriptstyle\otimes} $}} }}\pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope}}} \pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope}}} {{}}\hbox{\hbox{{\pgfsys@beginscope\pgfsys@invoke{ }{{}{}{{ {}{}}}{ {}{}} {{}{{}}}{{}{}}{}{{}{}} { }{{{{}}\pgfsys@beginscope\pgfsys@invoke{ }\pgfsys@transformcm{1.0}{0.0}{0.0}{1.0}{-8.22221pt}{-7.98332pt}\pgfsys@invoke{ }\hbox{{\definecolor{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@rgb@stroke{0}{0}{0}\pgfsys@invoke{ }\pgfsys@color@rgb@fill{0}{0}{0}\pgfsys@invoke{ }\hbox{{$ {\scriptstyle\mathcal{X}{2},\mathcal{Y}_{1}} $}} }}\pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope}}} \pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope}}} } \pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope{}{}{}\hss}\pgfsys@discardpath\pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope\hss}}\lxSVG@closescope\endpgfpicture}}}\gamma_{\restriction\mathcal{X}_{2},\mathcal{Y}_{1},\mathcal{Y}_{2}}\enskip\in\operatorname{Cpl}\left(\mu,\nu\right)$ . Now $\gamma_{\restriction\mathcal{X}_{1},\mathcal{X}_{2},\mathcal{Y}_{1}}=\gamma^{\prime}_{\restriction\mathcal{X}_{1},\mathcal{X}_{2},\mathcal{Y}_{1}}$ and $\gamma_{\restriction\mathcal{X}_{2},\mathcal{Y}_{1},\mathcal{Y}_{2}}=\gamma^{\prime}_{\restriction\mathcal{X}_{2},\mathcal{Y}_{1},\mathcal{Y}_{2}}$ , so in particular

[TABLE]

We may name the different building blocks of $\gamma\in C_{1}$ to get

[TABLE]

with

[TABLE]

i.e. there is a bijection between $C_{1}$ and $C_{2}$ given by sending $\gamma^{\prime}\in C_{1}$ to $(\gamma,\beta)\in C_{2}$ where $\gamma:=\gamma^{\prime}_{\restriction\mathcal{X}_{1},\mathcal{Y}_{1}}$ , $\beta:=\gamma^{\prime}_{\restriction\mathcal{X}_{2},\mathcal{Y}_{1},\mathcal{Y}_{2}}$ , and, in the other direction, by sending $(\gamma,\beta)\in C_{2}$ to $\gamma^{\prime}:=\left(\mu\mathbin{\leavevmode\hbox to7.67pt{\vbox to11.61pt{\pgfpicture\makeatletter\hbox{\hskip 3.83331pt\lower-9.2722pt\hbox to0.0pt{\pgfsys@beginscope\pgfsys@invoke{ }\definecolor{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@rgb@stroke{0}{0}{0}\pgfsys@invoke{ }\pgfsys@color@rgb@fill{0}{0}{0}\pgfsys@invoke{ }\pgfsys@setlinewidth{0.4pt}\pgfsys@invoke{ }\nullfont\hbox to0.0pt{\pgfsys@beginscope\pgfsys@invoke{ }{}{}{ {{}}\hbox{\hbox{{\pgfsys@beginscope\pgfsys@invoke{ }{{}{}{{ {}{}}}{ {}{}} {{}{{}}}{{}{}}{}{{}{}} { }{{{{}}\pgfsys@beginscope\pgfsys@invoke{ }\pgfsys@transformcm{1.0}{0.0}{0.0}{1.0}{-3.83331pt}{-1.75pt}\pgfsys@invoke{ }\hbox{{\definecolor{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@rgb@stroke{0}{0}{0}\pgfsys@invoke{ }\pgfsys@color@rgb@fill{0}{0}{0}\pgfsys@invoke{ }\hbox{{$ {\scriptstyle\otimes} $}} }}\pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope}}} \pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope}}} {{}}\hbox{\hbox{{\pgfsys@beginscope\pgfsys@invoke{ }{{}{}{{ {}{}}}{ {}{}} {{}{{}}}{{}{}}{}{{}{}} { }{{{{}}\pgfsys@beginscope\pgfsys@invoke{ }\pgfsys@transformcm{1.0}{0.0}{0.0}{1.0}{-3.625pt}{-7.98332pt}\pgfsys@invoke{ }\hbox{{\definecolor{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@rgb@stroke{0}{0}{0}\pgfsys@invoke{ }\pgfsys@color@rgb@fill{0}{0}{0}\pgfsys@invoke{ }\hbox{{$ {\scriptstyle\mathcal{X}{1}} $}} }}\pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope}}} \pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope}}} } \pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope{}{}{}\hss}\pgfsys@discardpath\pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope\hss}}\lxSVG@closescope\endpgfpicture}}}\gamma\right)\mathbin{\leavevmode\hbox to16.44pt{\vbox to11.68pt{\pgfpicture\makeatletter\hbox{\hskip 8.22221pt\lower-9.3444pt\hbox to0.0pt{\pgfsys@beginscope\pgfsys@invoke{ }\definecolor{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@rgb@stroke{0}{0}{0}\pgfsys@invoke{ }\pgfsys@color@rgb@fill{0}{0}{0}\pgfsys@invoke{ }\pgfsys@setlinewidth{0.4pt}\pgfsys@invoke{ }\nullfont\hbox to0.0pt{\pgfsys@beginscope\pgfsys@invoke{ }{}{}{ {{}}\hbox{\hbox{{\pgfsys@beginscope\pgfsys@invoke{ }{{}{}{{ {}{}}}{ {}{}} {{}{{}}}{{}{}}{}{{}{}} { }{{{{}}\pgfsys@beginscope\pgfsys@invoke{ }\pgfsys@transformcm{1.0}{0.0}{0.0}{1.0}{-3.83331pt}{-1.75pt}\pgfsys@invoke{ }\hbox{{\definecolor{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@rgb@stroke{0}{0}{0}\pgfsys@invoke{ }\pgfsys@color@rgb@fill{0}{0}{0}\pgfsys@invoke{ }\hbox{{$ {\scriptstyle\otimes} $}} }}\pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope}}} \pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope}}} {{}}\hbox{\hbox{{\pgfsys@beginscope\pgfsys@invoke{ }{{}{}{{ {}{}}}{ {}{}} {{}{{}}}{{}{}}{}{{}{}} { }{{{{}}\pgfsys@beginscope\pgfsys@invoke{ }\pgfsys@transformcm{1.0}{0.0}{0.0}{1.0}{-8.22221pt}{-7.98332pt}\pgfsys@invoke{ }\hbox{{\definecolor{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@rgb@stroke{0}{0}{0}\pgfsys@invoke{ }\pgfsys@color@rgb@fill{0}{0}{0}\pgfsys@invoke{ }\hbox{{$ {\scriptstyle\mathcal{X}{2},\mathcal{Y}_{1}} $}} }}\pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope}}} \pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope}}} } \pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope{}{}{}\hss}\pgfsys@discardpath\pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope\hss}}\lxSVG@closescope\endpgfpicture}}}\beta$ .

We can apply the bijection $\operatorname{dis}_{\mathcal{Y}_{1}}:\mathscr{P}\!\left(\mathcal{Y}_{1}\times\mathcal{X}_{2}\times\mathcal{Y}_{2}\right)\rightarrow\mathscr{F}\left(\mathcal{Y}_{1}\rightsquigarrow\mathscr{P}\!\left(\mathcal{X}_{2}\times\mathcal{Y}_{2}\right)\right)$ to $\beta$ . Translating the conditions on $(\gamma,\beta)\in C_{2}$ to conditions on $(\gamma,\operatorname{dis}_{\mathcal{Y}_{1}}(\beta))$ we arrive at

[TABLE]

where

[TABLE]

Let $(\gamma,\beta)\in C_{3}$ and let $(y_{1},\beta^{\prime})\mapsto\tilde{\beta}^{\prime}_{y_{1},\beta^{\prime}}$ be a measurable mapping with $\tilde{\beta}^{\prime}_{y_{1},\beta^{\prime}}\in\operatorname{Cpl}\left(\beta^{\prime}_{\restriction\mathcal{X}_{2}},\beta^{\prime}_{\restriction\mathcal{Y}_{2}}\right)$ for $\beta$ -a.a. $(y_{1},\beta^{\prime})$ . Then we have that also $(\gamma,\tilde{\beta})\in C_{3}$ , where $\tilde{\beta}\in\mathscr{F}\left(\mathcal{Y}_{1}\rightsquigarrow\mathscr{P}\!\left(\mathcal{X}_{2}\times\mathcal{Y}_{2}\right)\right)$ is defined by

[TABLE]

By employing a $\beta$ -a.e. measurable selector this implies that

[TABLE]

We need

Lemma 4.6.

If $\kappa\in\mathscr{P}\!\left(\mathcal{A}\times\mathcal{B}\right)$ and $\lambda\in\mathscr{F}\left(\mathcal{B}\rightsquigarrow\mathcal{C}\right)$ then the only measure $\eta\in\mathscr{P}\!\left(\mathcal{A}\times\mathcal{B}\times\mathcal{C}\right)$ with $\eta_{\restriction\mathcal{A}\times\mathcal{B}}=\kappa$ and $\eta_{\restriction\mathcal{B}\times\mathcal{C}}=\lambda$ is $\kappa\mathbin{\leavevmode\hbox to7.67pt{\vbox to10.32pt{\pgfpicture\makeatletter\hbox{\hskip 3.83331pt\lower-7.98332pt\hbox to0.0pt{\pgfsys@beginscope\pgfsys@invoke{ }\definecolor{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@rgb@stroke{0}{0}{0}\pgfsys@invoke{ }\pgfsys@color@rgb@fill{0}{0}{0}\pgfsys@invoke{ }\pgfsys@setlinewidth{0.4pt}\pgfsys@invoke{ }\nullfont\hbox to0.0pt{\pgfsys@beginscope\pgfsys@invoke{ }{}{}{ {{}}\hbox{\hbox{{\pgfsys@beginscope\pgfsys@invoke{ }{{}{}{{ {}{}}}{ {}{}} {{}{{}}}{{}{}}{}{{}{}} { }{{{{}}\pgfsys@beginscope\pgfsys@invoke{ }\pgfsys@transformcm{1.0}{0.0}{0.0}{1.0}{-3.83331pt}{-1.75pt}\pgfsys@invoke{ }\hbox{{\definecolor{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@rgb@stroke{0}{0}{0}\pgfsys@invoke{ }\pgfsys@color@rgb@fill{0}{0}{0}\pgfsys@invoke{ }\hbox{{$ {\scriptstyle\otimes} $}} }}\pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope}}} \pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope}}} {{}}\hbox{\hbox{{\pgfsys@beginscope\pgfsys@invoke{ }{{}{}{{ {}{}}}{ {}{}} {{}{{}}}{{}{}}{}{{}{}} { }{{{{}}\pgfsys@beginscope\pgfsys@invoke{ }\pgfsys@transformcm{1.0}{0.0}{0.0}{1.0}{-2.47917pt}{-7.98332pt}\pgfsys@invoke{ }\hbox{{\definecolor{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@rgb@stroke{0}{0}{0}\pgfsys@invoke{ }\pgfsys@color@rgb@fill{0}{0}{0}\pgfsys@invoke{ }\hbox{{$ {\scriptstyle\mathcal{B}} $}} }}\pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope}}} \pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope}}} } \pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope{}{}{}\hss}\pgfsys@discardpath\pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope\hss}}\lxSVG@closescope\endpgfpicture}}}\lambda$ .

Proof.

If $\eta$ satisfies the properties above and $b\mapsto\kappa_{b}$ , $b\mapsto\lambda_{b}$ are (classical) disintegrations of $\kappa$ , $\lambda$ w.r.t. $\mathcal{B}$ , then a (classical) disintegration $b\mapsto\eta_{b}$ of $\eta$ w.r.t. $\mathcal{B}$ has to satisfy ${\eta_{b}}_{\restriction\mathcal{A}}=\kappa_{b}$ and ${\eta_{b}}_{\restriction\mathcal{C}}=\lambda_{b}$ a.s. As $\lambda_{b}$ is a Dirac measure a.s. this forces $\eta_{b}$ to be $\kappa_{b}\otimes\lambda_{b}$ almost surely. ∎

This implies that for $(\gamma,\beta)\in C_{3}$ the distribution of

[TABLE]

under $\beta$ is already determined by $\gamma$ , i.e. because the distribution of $(y_{1},\beta^{\prime})\mapsto(y_{1},{\beta^{\prime}_{\restriction\mathcal{X}_{2}}})$ is $\operatorname{dis}_{\mathcal{Y}_{1}}\!\!\left(\gamma\mathbin{\raise 2.58334pt\hbox{\oalign{\hfil$ \scriptscriptstyle\mathrm{o} $\hfil\cr\hfil$ \scriptscriptstyle\mathrm{9} $\hfil}}}_{\mathcal{X}_{1}}\mu\right)$ and the distribution of $(y_{1},\beta^{\prime})\mapsto(y_{1},{\beta^{\prime}_{\restriction\mathcal{Y}_{2}}})$ is $\operatorname{dis}_{\mathcal{Y}_{1}}\!(\nu)$ , the distribution of (27) under $\beta$ must be equal to

[TABLE]

This means that we may get rid of $\beta$ :

[TABLE]

For the final step we need another lemma:

Lemma 4.7.

Let $\lambda\in\mathscr{P}\!\left(\mathcal{A}\times\mathcal{B}\right)$ and $\beta\in\mathscr{P}\!\left(\mathcal{B}\times\mathcal{C}\right)$ . Let $\smash{\hat{C}}$ denote the projection onto $\mathscr{P}\!\left(\mathcal{C}\right)$ . Then

[TABLE]

is equal to the distribution of

[TABLE]

Proof.

Let $a\mapsto\lambda_{a}$ be a version of the (classical) disintegration of $\lambda$ w.r.t. $\mathcal{A}$ and let $b\mapsto\beta_{b}$ be a disintegration of $\beta$ w.r.t. $\mathcal{B}$ .

As one easily checks, a version of the (classical) disintegration of $\lambda\mathbin{\raise 2.58334pt\hbox{\oalign{\hfil$ \scriptscriptstyle\mathrm{o} $\hfil\cr\hfil$ \scriptscriptstyle\mathrm{9} $\hfil}}}_{\mathcal{B}}\beta$ w.r.t. $\mathcal{A}$ is given by $a\mapsto\int\beta_{b}\,\mathrm{d}\lambda_{a}(b)$ , so that $\operatorname{dis}_{\mathcal{A}}\left(\lambda\mathbin{\raise 2.58334pt\hbox{\oalign{\hfil$ \scriptscriptstyle\mathrm{o} $\hfil\cr\hfil$ \scriptscriptstyle\mathrm{9} $\hfil}}}_{\mathcal{B}}\beta\right)$ is equal to

[TABLE]

By the same argument a version of the disintegration of $\lambda\mathbin{\raise 2.58334pt\hbox{\oalign{\hfil$ \scriptscriptstyle\mathrm{o} $\hfil\cr\hfil$ \scriptscriptstyle\mathrm{9} $\hfil}}}_{\mathcal{B}}\operatorname{dis}_{\mathcal{B}}(\beta)$ w.r.t. $\mathcal{A}$ is given by $h:=a\mapsto\int\operatorname{dis}_{\mathcal{B}}(\beta)_{b}\,\mathrm{d}\lambda_{a}(b)$ , where $b\mapsto\operatorname{dis}_{\mathcal{B}}(\beta)_{b}$ is a disintegration of $\operatorname{dis}_{\mathcal{B}}(\beta)$ w.r.t. $\mathcal{B}$ . But such a disintegration is given by $b\mapsto\delta_{\beta_{b}}$ , (where $\delta_{\beta_{b}}$ is the dirac measure at $\beta_{b}$ ). So $h=a\mapsto\int\delta_{\beta_{b}}\,\mathrm{d}\lambda_{a}(b)$ . This means (a version of) $\mathbb{E}^{\eta}\left(\smash{\hat{C}}\middle|A\right)$ is given by

[TABLE]

so that the distribution of $(A,\mathbb{E}^{\eta}\left(\smash{\hat{C}}\middle|A\right))$ under $\eta$ is also given by

[TABLE]

∎

Using this lemma with $\mathcal{A}=\mathcal{Y}_{1}$ , $\mathcal{B}=\mathcal{X}_{1}$ , $\mathcal{C}=\mathcal{X}_{2}$ , $\lambda=\gamma$ , $\beta=\mu$ and writing ${\smash{\hat{X}_{2}}}$ , ${\smash{\hat{Y}_{2}}}$ for the projections onto $\mathscr{P}\!\left(\mathcal{X}_{2}\right)$ , $\mathscr{P}\!\left(\mathcal{Y}_{2}\right)$ respectively, we find:

[TABLE]

where $\eta(\gamma):=\operatorname{dis}_{\mathcal{X}_{1}}(\mu)\mathbin{\leavevmode\hbox to7.67pt{\vbox to11.61pt{\pgfpicture\makeatletter\hbox{\hskip 3.83331pt\lower-9.2722pt\hbox to0.0pt{\pgfsys@beginscope\pgfsys@invoke{ }\definecolor{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@rgb@stroke{0}{0}{0}\pgfsys@invoke{ }\pgfsys@color@rgb@fill{0}{0}{0}\pgfsys@invoke{ }\pgfsys@setlinewidth{0.4pt}\pgfsys@invoke{ }\nullfont\hbox to0.0pt{\pgfsys@beginscope\pgfsys@invoke{ }{}{}{ {{}}\hbox{\hbox{{\pgfsys@beginscope\pgfsys@invoke{ }{{}{}{{ {}{}}}{ {}{}} {{}{{}}}{{}{}}{}{{}{}} { }{{{{}}\pgfsys@beginscope\pgfsys@invoke{ }\pgfsys@transformcm{1.0}{0.0}{0.0}{1.0}{-3.83331pt}{-1.75pt}\pgfsys@invoke{ }\hbox{{\definecolor{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@rgb@stroke{0}{0}{0}\pgfsys@invoke{ }\pgfsys@color@rgb@fill{0}{0}{0}\pgfsys@invoke{ }\hbox{{$ {\scriptstyle\otimes} $}} }}\pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope}}} \pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope}}} {{}}\hbox{\hbox{{\pgfsys@beginscope\pgfsys@invoke{ }{{}{}{{ {}{}}}{ {}{}} {{}{{}}}{{}{}}{}{{}{}} { }{{{{}}\pgfsys@beginscope\pgfsys@invoke{ }\pgfsys@transformcm{1.0}{0.0}{0.0}{1.0}{-3.625pt}{-7.98332pt}\pgfsys@invoke{ }\hbox{{\definecolor{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@rgb@stroke{0}{0}{0}\pgfsys@invoke{ }\pgfsys@color@rgb@fill{0}{0}{0}\pgfsys@invoke{ }\hbox{{$ {\scriptstyle\mathcal{X}{1}} $}} }}\pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope}}} \pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope}}} } \pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope{}{}{}\hss}\pgfsys@discardpath\pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope\hss}}\lxSVG@closescope\endpgfpicture}}}\gamma\mathbin{\leavevmode\hbox to7.67pt{\vbox to11.61pt{\pgfpicture\makeatletter\hbox{\hskip 3.83331pt\lower-9.2722pt\hbox to0.0pt{\pgfsys@beginscope\pgfsys@invoke{ }\definecolor{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@rgb@stroke{0}{0}{0}\pgfsys@invoke{ }\pgfsys@color@rgb@fill{0}{0}{0}\pgfsys@invoke{ }\pgfsys@setlinewidth{0.4pt}\pgfsys@invoke{ }\nullfont\hbox to0.0pt{\pgfsys@beginscope\pgfsys@invoke{ }{}{}{ {{}}\hbox{\hbox{{\pgfsys@beginscope\pgfsys@invoke{ }{{}{}{{ {}{}}}{ {}{}} {{}{{}}}{{}{}}{}{{}{}} { }{{{{}}\pgfsys@beginscope\pgfsys@invoke{ }\pgfsys@transformcm{1.0}{0.0}{0.0}{1.0}{-3.83331pt}{-1.75pt}\pgfsys@invoke{ }\hbox{{\definecolor{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@rgb@stroke{0}{0}{0}\pgfsys@invoke{ }\pgfsys@color@rgb@fill{0}{0}{0}\pgfsys@invoke{ }\hbox{{$ {\scriptstyle\otimes} $}} }}\pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope}}} \pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope}}} {{}}\hbox{\hbox{{\pgfsys@beginscope\pgfsys@invoke{ }{{}{}{{ {}{}}}{ {}{}} {{}{{}}}{{}{}}{}{{}{}} { }{{{{}}\pgfsys@beginscope\pgfsys@invoke{ }\pgfsys@transformcm{1.0}{0.0}{0.0}{1.0}{-3.625pt}{-7.98332pt}\pgfsys@invoke{ }\hbox{{\definecolor{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@rgb@stroke{0}{0}{0}\pgfsys@invoke{ }\pgfsys@color@rgb@fill{0}{0}{0}\pgfsys@invoke{ }\hbox{{$ {\scriptstyle\mathcal{Y}{1}} $}} }}\pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope}}} \pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope}}} } \pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope{}{}{}\hss}\pgfsys@discardpath\pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope\hss}}\lxSVG@closescope\endpgfpicture}}}\operatorname{dis}_{\mathcal{Y}_{1}}(\nu)$ .

By Lemma 4.6 the function $\eta:\operatorname{Cpl}\left(\mu_{\restriction\mathcal{X}_{1}},\nu_{\restriction\mathcal{Y}_{1}}\right)\rightarrow\operatorname{Cpl}\left(\operatorname{dis}_{\mathcal{X}_{1}}(\mu),\operatorname{dis}_{\mathcal{Y}_{1}}(\nu)\right)$ is a bijection, so we may as well write

[TABLE]

Finally, under any $\gamma\in\operatorname{Cpl}\left(\operatorname{dis}_{\mathcal{X}_{1}}(\mu),\operatorname{dis}_{\mathcal{Y}_{1}}(\nu)\right)$ we know that ${\smash{\hat{Y}_{2}}}$ is almost surely equal to a function of $Y_{1}$ , so that the completions of the sigma-algebras generated by $Y_{1}$ and $\smash{\vec{Y}}:=(Y_{1},{\smash{\hat{Y}_{2}}})$ respectively are equal. This means that $\mathbb{E}^{\gamma}\left({\smash{\hat{X}_{2}}}\middle|Y_{1}\right)=\mathbb{E}^{\gamma}\left({\smash{\hat{X}_{2}}}\middle|\smash{\vec{Y}}\right)$ a.s. and we arrive at our final expression for $\mathcal{C}\mathcal{W}_{p}\left(\mu,\nu\right)$ :

[TABLE]

Now this expression is trivial to generalize to $\mu\in\mathscr{P}\!\left(\mathcal{X}_{1}\times\mathscr{P}\!\left(\mathcal{X}_{2}\right)\right)$ and $\nu\in\mathscr{P}\!\left(\mathcal{Y}_{1}\times\mathscr{P}\!\left(\mathcal{Y}_{2}\right)\right)$ , i.e. for such $\mu$ , $\nu$ we set

[TABLE]

To summarize our discussion up to this point:

Lemma 4.8.

The function

[TABLE]

as defined in (28) is really an extension of

[TABLE]

as defined in (23) (when $\mathscr{P}\!\left(\mathcal{X}_{1}\times\mathcal{X}_{2}\right)$ is embedded into $\mathscr{P}\!\left(\mathcal{X}_{1}\times\mathscr{P}\!\left(\mathcal{X}_{2}\right)\right)$ via $\operatorname{dis}_{\mathcal{X}_{1}}$ ).

Next we promised to show

Lemma 4.9.

${}\mkern 5.0mu\overline{\mkern-5.0mu\mathcal{C}\mathcal{W}_{p}\mkern-6.0mu}\mkern 6.0mu$ * is bounded by ${}\mkern 8.0mu\overline{\mkern-8.0mu\mathcal{A}\mathcal{W}_{p}\mkern-7.0mu}\mkern 7.0mu$ , i.e.*

[TABLE]

Proof.

By the conditional version of Jensen’s inequality applied to the convex function $(\hat{x},\hat{y})\mapsto\mathcal{W}_{p}\left(\hat{x},\hat{y}\right)^{p}$ we have

[TABLE]

∎

Remark 4.10.

For the reader who may be sceptical of whether Jensen’s inequality holds in this rather unusual setting, where we have a convex function

[TABLE]

and conditional expectations on spaces of measures we remark that for the Wasserstein distance in particular this is very easy to check. The proof is just integrating transport plans between ${\smash{\hat{X}_{2}}}$ and ${\smash{\hat{Y}_{2}}}$ w.r.t. the distribution of these conditioned on $\smash{\vec{Y}}$ (in this case) to get transport plans between $\mathbb{E}^{\gamma}\left({\smash{\hat{X}_{2}}}\middle|\smash{\vec{Y}}\right)$ and $\mathbb{E}^{\gamma}\left({\smash{\hat{Y}_{2}}}\middle|\smash{\vec{Y}}\right)$ .

Lemma 4.11.

Let $\mu,\nu,\lambda\in\mathscr{P}\!\left(\mathcal{X}_{1}\times\mathscr{P}\!\left(\mathcal{X}_{2}\right)\right)$ . Then

[TABLE]

Proof.

Using our naming convention we have

[TABLE]

We denote the projections onto $\mathscr{P}\!\left(\mathcal{X}_{2}\right)$ , $\mathscr{P}\!\left(\mathcal{Y}_{2}\right)$ , $\mathscr{P}\!\left(\mathcal{Z}_{2}\right)$ by ${\smash{\hat{X}_{2}}}$ , ${\smash{\hat{Y}_{2}}}$ , ${\smash{\hat{Z}_{2}}}$ respectively. $\smash{\vec{Y}}=(Y_{1},{\smash{\hat{Y}_{2}}})$ , $\smash{\vec{Z}}:=(Z_{1},{\smash{\hat{Z}_{2}}})$ .

Let $\gamma\in\operatorname{Cpl}\left(\mu,\nu\right)$ and $\eta\in\operatorname{Cpl}\left(\nu,\lambda\right)$ . In the following let $\mathbb{E}$ refer to (conditional) expectation w.r.t. $\kappa:=\gamma\mathbin{\leavevmode\hbox to24.99pt{\vbox to12.53pt{\pgfpicture\makeatletter\hbox{\hskip 12.49309pt\lower-10.2pt\hbox to0.0pt{\pgfsys@beginscope\pgfsys@invoke{ }\definecolor{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@rgb@stroke{0}{0}{0}\pgfsys@invoke{ }\pgfsys@color@rgb@fill{0}{0}{0}\pgfsys@invoke{ }\pgfsys@setlinewidth{0.4pt}\pgfsys@invoke{ }\nullfont\hbox to0.0pt{\pgfsys@beginscope\pgfsys@invoke{ }{}{}{ {{}}\hbox{\hbox{{\pgfsys@beginscope\pgfsys@invoke{ }{{}{}{{ {}{}}}{ {}{}} {{}{{}}}{{}{}}{}{{}{}} { }{{{{}}\pgfsys@beginscope\pgfsys@invoke{ }\pgfsys@transformcm{1.0}{0.0}{0.0}{1.0}{-3.83331pt}{-1.75pt}\pgfsys@invoke{ }\hbox{{\definecolor{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@rgb@stroke{0}{0}{0}\pgfsys@invoke{ }\pgfsys@color@rgb@fill{0}{0}{0}\pgfsys@invoke{ }\hbox{{$ {\scriptstyle\otimes} $}} }}\pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope}}} \pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope}}} {{}}\hbox{\hbox{{\pgfsys@beginscope\pgfsys@invoke{ }{{}{}{{ {}{}}}{ {}{}} {{}{{}}}{{}{}}{}{{}{}} { }{{{{}}\pgfsys@beginscope\pgfsys@invoke{ }\pgfsys@transformcm{1.0}{0.0}{0.0}{1.0}{-12.49309pt}{-8.45pt}\pgfsys@invoke{ }\hbox{{\definecolor{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@rgb@stroke{0}{0}{0}\pgfsys@invoke{ }\pgfsys@color@rgb@fill{0}{0}{0}\pgfsys@invoke{ }\hbox{{$ {\scriptstyle\mathcal{Y}{1},\mathscr{P}!\left(\mathcal{Y}{2}\right)} $}} }}\pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope}}} \pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope}}} } \pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope{}{}{}\hss}\pgfsys@discardpath\pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope\hss}}\lxSVG@closescope\endpgfpicture}}}\eta$ , and let $\left\lVert\cdot\right\rVert_{L_{p}}$ refer to the $L_{p}$ -norm w.r.t. $\kappa$ .

Combining the triangle inequalities for $\rho$ , $\mathcal{W}_{p}$ and the $\left\lVert\cdot\right\rVert_{L_{p}}$ we get

[TABLE]

By the conditional Jensen inequality

[TABLE]

and therefore

[TABLE]

By construction, $({\smash{\hat{X}_{2}}},{\smash{\hat{Y}_{2}}})$ is conditionally independent from $\smash{\vec{Z}}$ given $\smash{\vec{Y}}$ , so that $\mathbb{E}\left(({\smash{\hat{X}_{2}}},{\smash{\hat{Y}_{2}}})\middle|\smash{\vec{Y}},\smash{\vec{Z}}\right)=\mathbb{E}\left(({\smash{\hat{X}_{2}}},{\smash{\hat{Y}_{2}}})\middle|\smash{\vec{Y}}\right)$ (this basic fact about conditional independence can be found for example as Theorem 45 in [22]). Combining this with (30) gives

[TABLE]

Putting together (29) and (31) with the triangle inequality for $\ell_{p}$ we get

[TABLE]

∎

Lemma 4.12.

${}\mkern 5.0mu\overline{\mkern-5.0mu\mathcal{C}\mathcal{W}_{p}\mkern-6.0mu}\mkern 6.0mu$ * is uniformly continuous w.r.t. ${}\mkern 8.0mu\overline{\mkern-8.0mu\mathcal{A}\mathcal{W}_{p}\mkern-7.0mu}\mkern 7.0mu$ on $\mathscr{P}\!\left(\mathcal{X}_{1}\times\mathscr{P}\!\left(\mathcal{X}_{2}\right)\right)^{2}$ .*

Proof.

Let $\mu,\nu,\mu^{\prime},\nu^{\prime}\in\mathscr{P}\!\left(\mathcal{X}_{1}\times\mathscr{P}\!\left(\mathcal{X}_{2}\right)\right)$ . We repeatedly use Lemma 4.11:

[TABLE]

therefore

[TABLE]

Switching the roles of $(\mu,\nu)$ and $(\mu^{\prime},\nu^{\prime})$ implies

[TABLE]

∎

Lemma 4.13.

The infimum in (28) is attained.

Proof.

This is an application of [9, Theorem 1.2].

For self-containedness and because it’s a nice application of the nested distance, we also sketch the argument here. We know that $\operatorname{Cpl}\left(\mu,\nu\right)$ is compact. The problem is that $\gamma\mapsto\mathbb{E}^{\gamma}\left(\mathcal{W}_{p}\left(\mathbb{E}^{\gamma}\left({\smash{\hat{X}_{2}}}\middle|\smash{\vec{Y}}\right),\smash{\vec{Y}}\right)^{p}\right)$ is not (lower semi-) continuous. But we may switch to a topology which is better adapted to the problem at hand. Namely the two-timepoint $\mathcal{A}\mathcal{W}_{p}$ -topology. In this case the space for the first timepoint is $\mathcal{Y}_{1}\times\mathscr{P}\!\left(\mathcal{Y}_{2}\right)$ and that for the second is $\mathcal{X}_{1}\times\mathscr{P}\!\left(\mathcal{X}_{2}\right)$ . In effect that means that instead of $\gamma\in\operatorname{Cpl}\left(\mu,\nu\right)$ we are now looking at $\gamma^{\prime}\in\mathscr{F}\left(\mathcal{Y}_{1}\times\mathscr{P}\!\left(\mathcal{Y}_{2}\right)\rightsquigarrow\mathscr{P}\!\left(\mathcal{X}_{1}\times\mathscr{P}\!\left(\mathcal{X}_{2}\right)\right)\right)$ . The function that we are optimizing over can be written as

[TABLE]

$C$ is a continuous function and so is $\hat{C}$ . Now $\operatorname{dis}_{\mathcal{Y}_{1}\times\mathscr{P}\!\left(\mathcal{Y}_{2}\right)}\left(\operatorname{Cpl}\left(\mu,\nu\right)\right)$ is not compact, but

[TABLE]

is. So we can find a minimizer $\gamma^{\prime}$ of $\hat{C}$ in this set. To return to $\operatorname{Cpl}\left(\mu,\nu\right)$ , or more precisely $\operatorname{dis}_{\mathcal{Y}_{1}\times\mathscr{P}\!\left(\mathcal{Y}_{2}\right)}\left(\operatorname{Cpl}\left(\mu,\nu\right)\right)$ , we can send $\gamma^{\prime}$ to the distribution $\gamma^{\prime\prime}$ of $(Y_{1},{\smash{\hat{Y}_{2}}},\mathbb{E}^{\gamma^{\prime}}\left({\smash{\hat{\vec{X}}}}\middle|\smash{\vec{Y}}\right))$ . Because $C$ is continuous and convex in its last argument and by (the conditional version of) Jensens inequality (which could again be proved ‘by hand’ here) $\hat{C}(\gamma^{\prime\prime})\leq\hat{C}(\gamma^{\prime})$ . $\operatorname{int}_{\mathcal{Y}_{1}\times\mathscr{P}\!\left(\mathcal{Y}_{2}\right)}(\gamma^{\prime\prime})$ is the sought after minimizer of (28). ∎

Lemma 4.14.

Let $\mu,\nu\in\mathscr{P}\!\left(\mathcal{X}_{1}\times\mathscr{P}\!\left(\mathcal{X}_{2}\right)\right)$ . Then ${}\mkern 5.0mu\overline{\mkern-5.0mu\mathcal{C}\mathcal{W}_{p}\mkern-6.0mu}\mkern 6.0mu\left(\mu,\nu\right)={}\mkern 5.0mu\overline{\mkern-5.0mu\mathcal{C}\mathcal{W}_{p}\mkern-6.0mu}\mkern 6.0mu\left(\nu,\mu\right)=0$ implies $\mu=\nu$ .

Proof.

Call

[TABLE]

To have labels for our spaces, see $\mu,\nu$ as

[TABLE]

Let $\gamma\in\operatorname{Cpl}\left(\mu,\nu\right)\subseteq\mathscr{P}\!\left(\smash{\vec{\mathcal{X}}}\times\smash{\vec{\mathcal{Y}}}\right)$ s.t. $\mathbb{E}^{\gamma}\left(\rho(X_{1},Y_{1})^{p}\right)+\mathbb{E}^{\gamma}\left(\mathcal{W}_{p}\left(\mathbb{E}^{\gamma}\left({\smash{\hat{X}_{2}}}\middle|\smash{\vec{Y}}\right),{\smash{\hat{Y}_{2}}}\right)^{p}\right)=0$ .

Let $\eta\in\operatorname{Cpl}\left(\nu,\mu\right)\subseteq\mathscr{P}\!\left(\smash{\vec{\mathcal{Y}}}\times\smash{\vec{\mathcal{Z}}}\right)$ s.t. $\mathbb{E}^{\eta}\left(\rho(Y_{1},Z_{1})^{p}\right)+\mathbb{E}^{\eta}\left(\mathcal{W}_{p}\left(\mathbb{E}^{\eta}\left({\smash{\hat{Y}_{2}}}\middle|\smash{\vec{Z}}\right),{\smash{\hat{Z}_{2}}}\right)^{p}\right)=0$ .

All the following considerations happen under $\displaystyle\gamma\mathbin{\leavevmode\hbox to7.67pt{\vbox to12.68pt{\pgfpicture\makeatletter\hbox{\hskip 3.83331pt\lower-10.34444pt\hbox to0.0pt{\pgfsys@beginscope\pgfsys@invoke{ }\definecolor{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@rgb@stroke{0}{0}{0}\pgfsys@invoke{ }\pgfsys@color@rgb@fill{0}{0}{0}\pgfsys@invoke{ }\pgfsys@setlinewidth{0.4pt}\pgfsys@invoke{ }\nullfont\hbox to0.0pt{\pgfsys@beginscope\pgfsys@invoke{ }{}{}{ {{}}\hbox{\hbox{{\pgfsys@beginscope\pgfsys@invoke{ }{{}{}{{ {}{}}}{ {}{}} {{}{{}}}{{}{}}{}{{}{}} { }{{{{}}\pgfsys@beginscope\pgfsys@invoke{ }\pgfsys@transformcm{1.0}{0.0}{0.0}{1.0}{-3.83331pt}{-1.75pt}\pgfsys@invoke{ }\hbox{{\definecolor{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@rgb@stroke{0}{0}{0}\pgfsys@invoke{ }\pgfsys@color@rgb@fill{0}{0}{0}\pgfsys@invoke{ }\hbox{{$ {\scriptstyle\otimes} $}} }}\pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope}}} \pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope}}} {{}}\hbox{\hbox{{\pgfsys@beginscope\pgfsys@invoke{ }{{}{}{{ {}{}}}{ {}{}} {{}{{}}}{{}{}}{}{{}{}} { }{{{{}}\pgfsys@beginscope\pgfsys@invoke{ }\pgfsys@transformcm{1.0}{0.0}{0.0}{1.0}{-3.26909pt}{-10.34444pt}\pgfsys@invoke{ }\hbox{{\definecolor{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@rgb@stroke{0}{0}{0}\pgfsys@invoke{ }\pgfsys@color@rgb@fill{0}{0}{0}\pgfsys@invoke{ }\hbox{{$ {\scriptstyle\vec{\mathcal{Y}}} $}} }}\pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope}}} \pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope}}} } \pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope{}{}{}\hss}\pgfsys@discardpath\pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope\hss}}\lxSVG@closescope\endpgfpicture}}}\eta$ . Clearly, $Z_{1}=Y_{1}=X_{1}$ a.s.

Moreover, because $\mathbb{E}\left({\smash{\hat{X}_{2}}}\middle|\smash{\vec{Y}},\smash{\vec{Z}}\right)=\mathbb{E}\left({\smash{\hat{X}_{2}}}\middle|\smash{\vec{Y}}\right)$ , the random variables ${\smash{\hat{Z}_{2}}},{\smash{\hat{Y}_{2}}},{\smash{\hat{X}_{2}}}$ form a martingale w.r.t. the filtration generated by $\smash{\vec{Z}},\smash{\vec{Y}},\vec{X}$ . The distribution of ${\smash{\hat{Z}_{2}}}$ is equal to the distribution of ${\smash{\hat{X}_{2}}}$ . Both of these statements are also true if we integrate some bounded measurable function w.r.t. our random variables, i.e. for any bounded measurable $f:\mathcal{X}_{2}\rightarrow{\mathbb{R}}$ we have that $\int f\,\mathrm{d}{\smash{\hat{Z}_{2}}},\int f\,\mathrm{d}{\smash{\hat{Y}_{2}}},\int f\,\mathrm{d}{\smash{\hat{X}_{2}}}$ is a martingale and that the distribution of $\int f\,\mathrm{d}{\smash{\hat{Z}_{2}}}$ is equal to the distribution of $\int f\,\mathrm{d}{\smash{\hat{X}_{2}}}$ . But this means that we must have $\int f\,\mathrm{d}{\smash{\hat{Z}_{2}}}=\int f\,\mathrm{d}{\smash{\hat{Y}_{2}}}=\int f\,\mathrm{d}{\smash{\hat{X}_{2}}}$ a.s. (Lemma 4.15 below). As this is true for all $f$ from a countable generator of the sigma-algebra on $\mathcal{X}_{2}$ , we have ${\smash{\hat{Z}_{2}}}={\smash{\hat{Y}_{2}}}={\smash{\hat{X}_{2}}}$ a.s. ∎

Lemma 4.15.

Let $X_{1},X_{2},X_{3}$ be a bounded martingale over ${\mathbb{R}}$ . If the distribution of $X_{1}$ is equal to the distribution of $X_{3}$ then $X_{1}=X_{2}=X_{3}$ a.s.

Proof.

This is a consequence of the strict version of Jensen’s inequality applied to any everywhere strictly convex function. (Take for example $x\mapsto x^{2}$ .) ∎

Remark 4.16.

The reason we took the detour of turning our probability-measure-valued martingale into a family of martingales on ${\mathbb{R}}$ and arguing on these is because this way we avoid having to exhibit a continuous, everywhere strictly convex function on $\mathscr{P}\!\left(\mathcal{X}_{2}\right)$ .

As a reminder:

Definition 4.17.

For $\mu,\nu\in\mathscr{P}\!\left(\mathcal{X}_{1}\times\mathscr{P}\!\left(\mathcal{X}_{2}\right)\right)$ ,

[TABLE]

Theorem 4.18.

${}\mkern 5.0mu\overline{\mkern-5.0mu\mathcal{S}\mathcal{C}\mathcal{W}_{p}\mkern-6.0mu}\mkern 6.0mu$ * is a metric on $\mathscr{P}\!\left(\mathcal{X}_{1}\times\mathscr{P}\!\left(\mathcal{X}_{2}\right)\right)$ satisfying*

[TABLE]

Proof.

This follows from Lemma 4.11, Lemma 4.14 and Lemma 4.9. ∎

Remark 4.19.

As outlined at the beginning of this section, and thanks to Theorem 4.18, we now know enough to conclude that the topology induced by $\mathcal{S}\mathcal{C}\mathcal{W}_{p}$ is equal to the topology induced by $\mathcal{A}\mathcal{W}_{p}$ , in case both $\mathcal{X}_{1}$ and $\mathcal{X}_{2}$ were compact. The non-compact case is not much harder. We now proceed to settle this case: For this we need the following lemma.

Lemma 4.20.

The map

[TABLE]

is a contraction when we equip the source space with ${}\mkern 5.0mu\overline{\mkern-5.0mu\mathcal{S}\mathcal{C}\mathcal{W}_{p}\mkern-6.0mu}\mkern 6.0mu$ and the target space with $\mathcal{W}_{p}$ . More specifically for $\mu,\nu\in\mathscr{P}\!\left(\mathcal{X}_{1}\times\mathscr{P}\!\left(\mathcal{X}_{2}\right)\right)$

[TABLE]

Proof.

We prove the second statement. Let $\mu\in\mathscr{P}\!\left(\smash{\vec{\mathcal{X}}}\right)$ , $\nu\in\mathscr{P}\!\left(\smash{\vec{\mathcal{Y}}}\right)$ . Given $\gamma\in\operatorname{Cpl}\left(\mu,\nu\right)$ and $\varepsilon>0$ the task is to find $\gamma^{\prime}\in\operatorname{Cpl}\left(\operatorname{int}_{\mathcal{X}_{1}}\mu,\operatorname{int}_{\mathcal{Y}_{1}}\nu\right)$ s.t.

[TABLE]

We take inspiration from the discussion at the beginning of this section. Let $\Xi:\smash{\vec{\mathcal{X}}}\times\smash{\vec{\mathcal{Y}}}\rightarrow\mathscr{P}\!\left(\mathcal{X}_{2}\times\mathcal{Y}_{2}\right)$ be a measurable selector satisfying

[TABLE]

The obvious choice for $\gamma^{\prime}$ , namely $f\mapsto\mathbb{E}^{\gamma}\left(\mathbb{E}^{\Xi}\left(f(X_{1},X_{2},Y_{1},Y_{2})\right)\right)$ will not work because in general it gets the relationship between $X_{1}$ and $X_{2}$ wrong, i.e. its first marginal may not be $\operatorname{int}_{\mathcal{X}_{1}}(\mu)$ . Instead we again define $\gamma_{L}\in\mathscr{P}\!\left(\mathcal{X}_{1}\times\mathcal{X}_{2}\times\mathcal{Y}_{1}\right)$ and $\gamma_{R}\in\mathscr{P}\!\left(\mathcal{X}_{2}\times\mathcal{Y}_{1}\times\mathcal{Y}_{2}\right)$ and set $\gamma^{\prime}:=\gamma_{L}\mathbin{\leavevmode\hbox to16.44pt{\vbox to11.68pt{\pgfpicture\makeatletter\hbox{\hskip 8.22221pt\lower-9.3444pt\hbox to0.0pt{\pgfsys@beginscope\pgfsys@invoke{ }\definecolor{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@rgb@stroke{0}{0}{0}\pgfsys@invoke{ }\pgfsys@color@rgb@fill{0}{0}{0}\pgfsys@invoke{ }\pgfsys@setlinewidth{0.4pt}\pgfsys@invoke{ }\nullfont\hbox to0.0pt{\pgfsys@beginscope\pgfsys@invoke{ }{}{}{ {{}}\hbox{\hbox{{\pgfsys@beginscope\pgfsys@invoke{ }{{}{}{{ {}{}}}{ {}{}} {{}{{}}}{{}{}}{}{{}{}} { }{{{{}}\pgfsys@beginscope\pgfsys@invoke{ }\pgfsys@transformcm{1.0}{0.0}{0.0}{1.0}{-3.83331pt}{-1.75pt}\pgfsys@invoke{ }\hbox{{\definecolor{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@rgb@stroke{0}{0}{0}\pgfsys@invoke{ }\pgfsys@color@rgb@fill{0}{0}{0}\pgfsys@invoke{ }\hbox{{$ {\scriptstyle\otimes} $}} }}\pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope}}} \pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope}}} {{}}\hbox{\hbox{{\pgfsys@beginscope\pgfsys@invoke{ }{{}{}{{ {}{}}}{ {}{}} {{}{{}}}{{}{}}{}{{}{}} { }{{{{}}\pgfsys@beginscope\pgfsys@invoke{ }\pgfsys@transformcm{1.0}{0.0}{0.0}{1.0}{-8.22221pt}{-7.98332pt}\pgfsys@invoke{ }\hbox{{\definecolor{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@rgb@stroke{0}{0}{0}\pgfsys@invoke{ }\pgfsys@color@rgb@fill{0}{0}{0}\pgfsys@invoke{ }\hbox{{$ {\scriptstyle\mathcal{X}{2},\mathcal{Y}{1}} $}} }}\pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope}}} \pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope}}} } \pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope{}{}{}\hss}\pgfsys@discardpath\pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope\hss}}\lxSVG@closescope\endpgfpicture}}}\gamma_{R}$ .

[TABLE]

Clearly, if we can actually define $\gamma^{\prime}$ as announced, then (33) will hold, because then

[TABLE]

It remains to check that $\gamma_{L}$ and $\gamma_{R}$ can actually be composed, i.e. that $(X_{2},Y_{1})$ has the same distribution under $\gamma_{L}$ and $\gamma_{R}$ .

[TABLE]

The step in the middle has its own Lemma 4.21 below. ∎

Lemma 4.21.

Let ${\mathbb{P}}$ be a probability on $\mathscr{P}\!\left(\mathcal{X}\right)\times\mathcal{Y}$ , for Polish spaces $\mathcal{X},\mathcal{Y}$ . Let $h:\mathcal{X}\times\mathcal{Y}\rightarrow{\mathbb{R}}$ be a measurable function. Then

[TABLE]

where $\mathbb{E}$ without superscript is the (conditional) expectiation w.r.t. ${\mathbb{P}}$ and $\hat{X}$ is the projection onto $\mathscr{P}\!\left(\mathcal{X}\right)$ .

Note that $X$ is on both sides introduced by the expectation operator which carries a superscript, while $Y$ may on both sides be interpreted as coming from the outermost context. On the right hand side $Y$ may also be seen as having been introduced by the outermost conditional expectation operator. (As this operator conditions on $Y$ this is the same thing.)

Proof.

Both sides are clearly $Y$ -measurable. We prove that for $h(x,y)=f(x)g_{1}(y)$ , multiplying by $g_{2}(Y)$ and taking expectation gives the same result. By definition of the conditional expectation

[TABLE]

Applying the continuous linear function $\gamma\mapsto\mathbb{E}^{\gamma}\left(f(X)\right)$ this gives

[TABLE]

Again by the definition of the conditional expectation:

[TABLE]

where for the third equality we plugged in the previous equation. ∎

Alternative proof of Lemma 4.20 when $\mathcal{X}_{1}$ has no isolated points.

When the space $\mathcal{X}_{1}$ has no isolated points one can show that the space $\mathscr{F}\left(\mathcal{X}_{1}\rightsquigarrow\mathscr{P}\!\left(\mathcal{X}_{2}\right)\right)$ is dense in $\mathscr{P}\!\left(\mathcal{X}_{1}\times\mathscr{P}\!\left(\mathcal{X}_{2}\right)\right)$ . This allows for a shorter proof of Lemma 4.20:

By the original definition (23) of $\mathcal{C}\mathcal{W}_{p}$ on the space $\mathscr{P}\!\left(\mathcal{X}_{1}\times\mathcal{X}_{2}\right)$ the inequality (32) holds on $\mathscr{F}\left(\mathcal{X}_{1}\rightsquigarrow\mathscr{P}\!\left(\mathcal{X}_{2}\right)\right)\times\mathscr{F}\left(\mathcal{X}_{1}\rightsquigarrow\mathscr{P}\!\left(\mathcal{X}_{2}\right)\right)$ . Both $\mathcal{C}\mathcal{W}_{p}$ and $(\mu,\nu)\mapsto\mathcal{W}_{p}\left(\operatorname{int}_{\mathcal{X}_{1}}(\mu),\operatorname{int}_{\mathcal{X}_{1}}(\nu)\right)$ are uniformly continuous on $\mathscr{P}\!\left(\smash{\vec{\mathcal{X}}}\right)\times\mathscr{P}\!\left(\smash{\vec{\mathcal{X}}}\right)$ w.r.t. some product metric of ${}\mkern 8.0mu\overline{\mkern-8.0mu\mathcal{A}\mathcal{W}_{p}\mkern-7.0mu}\mkern 7.0mu$ with itself. $\mathscr{F}\left(\mathcal{X}_{1}\rightsquigarrow\mathscr{P}\!\left(\mathcal{X}_{2}\right)\right)$ is dense in $\mathscr{P}\!\left(\smash{\vec{\mathcal{X}}}\right)$ , and therefore $\mathscr{F}\left(\mathcal{X}_{1}\rightsquigarrow\mathscr{P}\!\left(\mathcal{X}_{2}\right)\right)\times\mathscr{F}\left(\mathcal{X}_{1}\rightsquigarrow\mathscr{P}\!\left(\mathcal{X}_{2}\right)\right)$ is dense in $\mathscr{P}\!\left(\smash{\vec{\mathcal{X}}}\right)\times\mathscr{P}\!\left(\smash{\vec{\mathcal{X}}}\right)$ . This implies that (32) holds on all of $\mathscr{P}\!\left(\smash{\vec{\mathcal{X}}}\right)\times\mathscr{P}\!\left(\smash{\vec{\mathcal{X}}}\right)$ . ∎

Theorem 4.22.

The topology induced by ${}\mkern 5.0mu\overline{\mkern-5.0mu\mathcal{S}\mathcal{C}\mathcal{W}_{p}\mkern-6.0mu}\mkern 6.0mu$ on $\mathscr{P}\!\left(\mathcal{X}_{1}\times\mathscr{P}\!\left(\mathcal{X}_{2}\right)\right)$ is equal to the toplogy induced by ${}\mkern 8.0mu\overline{\mkern-8.0mu\mathcal{A}\mathcal{W}_{p}\mkern-7.0mu}\mkern 7.0mu$ on that space.

Proof.

As both topologies are metric and therefore first-countable we may argue on sequences. Let $(\mu_{n})_{n}$ be a sequence in $\mathscr{P}\!\left(\mathcal{X}_{1}\times\mathscr{P}\!\left(\mathcal{X}_{2}\right)\right)$ . As ${}\mkern 5.0mu\overline{\mkern-5.0mu\mathcal{S}\mathcal{C}\mathcal{W}_{p}\mkern-6.0mu}\mkern 6.0mu\left(\mu_{n},\mu\right)\leq{}\mkern 8.0mu\overline{\mkern-8.0mu\mathcal{A}\mathcal{W}_{p}\mkern-7.0mu}\mkern 7.0mu\left(\mu_{n},\mu\right)$ , if $(\mu_{n})_{n}$ converges to $\mu$ w.r.t. ${}\mkern 8.0mu\overline{\mkern-8.0mu\mathcal{A}\mathcal{W}_{p}\mkern-7.0mu}\mkern 7.0mu$ it also converges to $\mu$ w.r.t. ${}\mkern 5.0mu\overline{\mkern-5.0mu\mathcal{S}\mathcal{C}\mathcal{W}_{p}\mkern-6.0mu}\mkern 6.0mu$ .

Now assume that a sequence $(\mu_{n})_{n}$ in $\mathscr{P}\!\left(\mathcal{X}_{1}\times\mathscr{P}\!\left(\mathcal{X}_{2}\right)\right)$ converges to $\mu$ w.r.t. ${}\mkern 5.0mu\overline{\mkern-5.0mu\mathcal{S}\mathcal{C}\mathcal{W}_{p}\mkern-6.0mu}\mkern 6.0mu$ . We will show that every subsequence of $(\mu_{n})_{n}$ has a subsequence which converges to $\mu$ w.r.t. ${}\mkern 8.0mu\overline{\mkern-8.0mu\mathcal{A}\mathcal{W}_{p}\mkern-7.0mu}\mkern 7.0mu$ .

Note that convergence of $(\mu_{n})_{n}$ w.r.t. ${}\mkern 5.0mu\overline{\mkern-5.0mu\mathcal{S}\mathcal{C}\mathcal{W}_{p}\mkern-6.0mu}\mkern 6.0mu$ implies that the set $K:=\left\{\mu_{n}\,\middle|\,n\in\mathbb{N}\right\}$ is relatively compact w.r.t. the topology induced by ${}\mkern 5.0mu\overline{\mkern-5.0mu\mathcal{S}\mathcal{C}\mathcal{W}_{p}\mkern-6.0mu}\mkern 6.0mu$ . As $\operatorname{int}_{\mathcal{X}_{1}}$ is continuous as a map from $\mathscr{P}\!\left(\mathcal{X}_{1}\times\mathscr{P}\!\left(\mathcal{X}_{2}\right)\right)$ with the topology induced by ${}\mkern 5.0mu\overline{\mkern-5.0mu\mathcal{S}\mathcal{C}\mathcal{W}_{p}\mkern-6.0mu}\mkern 6.0mu$ to $\mathscr{P}\!\left(\mathcal{X}_{1}\times\mathcal{X}_{2}\right)$ with the toplogy induced by $\mathcal{W}_{p}$ (Lemma 4.20), we have that $\operatorname{int}_{\mathcal{X}_{1}}[K]=\left\{\operatorname{int}_{\mathcal{X}_{1}}(\mu_{n})\,\middle|\,n\in\mathbb{N}\right\}$ is also relatively compact. By Lemma 1.6/[24, Lemma 3.3] this implies that $K$ is relatively compact in $\mathscr{P}\!\left(\mathcal{X}_{1}\times\mathscr{P}\!\left(\mathcal{X}_{2}\right)\right)$ with the topology induced by ${}\mkern 8.0mu\overline{\mkern-8.0mu\mathcal{A}\mathcal{W}_{p}\mkern-7.0mu}\mkern 7.0mu$ . Now let $(\mu_{n_{k}})_{k}$ be some subsequence of $(\mu_{n})_{n}$ . As $K$ is relatively compact we can find a subsequence $(\mu_{n_{k_{j}}})_{j}$ of $(\mu_{n_{k}})_{k}$ , which converges w.r.t. ${}\mkern 8.0mu\overline{\mkern-8.0mu\mathcal{A}\mathcal{W}_{p}\mkern-7.0mu}\mkern 7.0mu$ to some $\mu^{\prime}\in\mathscr{P}\!\left(\mathcal{X}_{1}\times\mathscr{P}\!\left(\mathcal{X}_{2}\right)\right)$ . As ${}\mkern 5.0mu\overline{\mkern-5.0mu\mathcal{S}\mathcal{C}\mathcal{W}_{p}\mkern-6.0mu}\mkern 6.0mu\left(\mu_{n_{k_{j}}},\mu^{\prime}\right)\leq{}\mkern 8.0mu\overline{\mkern-8.0mu\mathcal{A}\mathcal{W}_{p}\mkern-7.0mu}\mkern 7.0mu\left(\mu_{n_{k_{j}}},\mu^{\prime}\right)$ this sequence also converges to $\mu^{\prime}$ w.r.t. ${}\mkern 5.0mu\overline{\mkern-5.0mu\mathcal{S}\mathcal{C}\mathcal{W}_{p}\mkern-6.0mu}\mkern 6.0mu$ . But $(\mu_{n_{k_{j}}})_{j}$ also converges to $\mu$ w.r.t. ${}\mkern 5.0mu\overline{\mkern-5.0mu\mathcal{S}\mathcal{C}\mathcal{W}_{p}\mkern-6.0mu}\mkern 6.0mu$ . Because the topology induced by ${}\mkern 5.0mu\overline{\mkern-5.0mu\mathcal{S}\mathcal{C}\mathcal{W}_{p}\mkern-6.0mu}\mkern 6.0mu$ is Hausdorff (Lemma 4.14), we must have $\mu^{\prime}=\mu$ , i.e. $(\mu_{n_{k_{j}}})_{j}$ converges to $\mu$ w.r.t. ${}\mkern 8.0mu\overline{\mkern-8.0mu\mathcal{A}\mathcal{W}_{p}\mkern-7.0mu}\mkern 7.0mu$ . ∎

Now we return to the general case of $N$ time-points.

Theorem 4.23.

The topology induced by $\mathcal{S}\mathcal{C}\mathcal{W}_{p}$ on $\mathscr{P}\!\left({}\mkern 3.5mu\overline{\mkern-3.5mu\mathcal{X}\mkern-0.5mu}\mkern 0.5mu\right)$ is equal to Hellwig’s $\mathcal{W}_{p}$ -information topology and to the topology induced by $\mathcal{A}\mathcal{W}_{p}$ .

Proof.

As every bicausal transport plan between $\mu$ and $\nu$ can be interpreted as a causal transport plan from $\mu$ to $\nu$ and also as a causal transport plan from $\nu$ to $\mu$ we have that $\mathcal{S}\mathcal{C}\mathcal{W}_{p}\left(\mu,\nu\right)\leq\mathcal{A}\mathcal{W}_{p}(\mu,\nu)$ . This means that the identity from $\mathscr{P}\!\left({}\mkern 3.5mu\overline{\mkern-3.5mu\mathcal{X}\mkern-0.5mu}\mkern 0.5mu\right)$ with the topology induced by $\mathcal{A}\mathcal{W}_{p}$ to $\mathscr{P}\!\left({}\mkern 3.5mu\overline{\mkern-3.5mu\mathcal{X}\mkern-0.5mu}\mkern 0.5mu\right)$ with the topology induced by $\mathcal{S}\mathcal{C}\mathcal{W}_{p}$ is continuous. For the other direction we show that the identity from $\mathscr{P}\!\left({}\mkern 3.5mu\overline{\mkern-3.5mu\mathcal{X}\mkern-0.5mu}\mkern 0.5mu\right)$ with the topology induced by $\mathcal{S}\mathcal{C}\mathcal{W}_{p}$ to $\mathscr{P}\!\left({}\mkern 3.5mu\overline{\mkern-3.5mu\mathcal{X}\mkern-0.5mu}\mkern 0.5mu\right)$ with the $\mathcal{W}_{p}$ -information topology is continuous, i.e. we show that each of the maps

[TABLE]

is continuous when $\mathscr{P}\!\left({}\mkern 3.5mu\overline{\mkern-3.5mu\mathcal{X}\mkern-0.5mu}\mkern 0.5mu\right)$ gets the topology induced by $\mathcal{S}\mathcal{C}\mathcal{W}_{p}$ .

If $\mu,\nu\in\mathscr{P}\!\left({}\mkern 3.5mu\overline{\mkern-3.5mu\mathcal{X}\mkern-0.5mu}\mkern 0.5mu\right)$ and $\gamma\in\operatorname{Cpl}\left(\mu,\nu\right)$ is causal, then, in particular, $\gamma$ is ‘causal at the timestep from $t$ to $t+1$ ’, i.e. $\gamma$ is causal when regarded as a coupling between $\mu,\nu\in\mathscr{P}\!\left({}\mkern 3.5mu\overline{\mkern-3.5mu\mathcal{X}\mkern-0.5mu}\mkern 0.5mu^{t}\times{}\mkern 3.5mu\overline{\mkern-3.5mu\mathcal{X}\mkern-0.5mu}\mkern 0.5mu_{\mkern-3.0mut+1}\right)$ . This means that if we define $\mathcal{S}\mathcal{C}\mathcal{W}_{p}^{\prime}$ like $\mathcal{S}\mathcal{C}\mathcal{W}_{p}$ , but only require causality based on the decomposition of ${}\mkern 3.5mu\overline{\mkern-3.5mu\mathcal{X}\mkern-0.5mu}\mkern 0.5mu$ as ${}\mkern 3.5mu\overline{\mkern-3.5mu\mathcal{X}\mkern-0.5mu}\mkern 0.5mu^{t}\times{}\mkern 3.5mu\overline{\mkern-3.5mu\mathcal{X}\mkern-0.5mu}\mkern 0.5mu_{\mkern-3.0mut+1}$ , then $\mathcal{S}\mathcal{C}\mathcal{W}_{p}^{\prime}(\mu,\nu)\leq\mathcal{S}\mathcal{C}\mathcal{W}_{p}\left(\mu,\nu\right)$ , i.e. the identity from $\mathscr{P}\!\left({}\mkern 3.5mu\overline{\mkern-3.5mu\mathcal{X}\mkern-0.5mu}\mkern 0.5mu\right)$ with the topology induced by $\mathcal{S}\mathcal{C}\mathcal{W}_{p}$ to $\mathscr{P}\!\left({}\mkern 3.5mu\overline{\mkern-3.5mu\mathcal{X}\mkern-0.5mu}\mkern 0.5mu\right)$ with the topology induced by $\mathcal{S}\mathcal{C}\mathcal{W}_{p}^{\prime}$ is continuous. By Theorem 4.22 the map

[TABLE]

is continuous when we equip $\mathscr{P}\!\left({}\mkern 3.5mu\overline{\mkern-3.5mu\mathcal{X}\mkern-0.5mu}\mkern 0.5mu^{t}\times{}\mkern 3.5mu\overline{\mkern-3.5mu\mathcal{X}\mkern-0.5mu}\mkern 0.5mu_{\mkern-3.0mut+1}\right)$ with the topology induced by $\mathcal{S}\mathcal{C}\mathcal{W}_{p}^{\prime}$ . Now $\mathcal{I}_{t}$ is continuous as a composite of continuous maps. ∎

5. Aldous’ extended weak convergence

In this section we show that Aldous extended $\mathcal{W}_{p}$ -/weak topology is equal to Hellwig’s ( $\mathcal{W}_{p}$ -)information topology.

We recall and paraphrase here the definition, already given in the introduction, of Aldous’ topology.

Definition 5.1.

Given $\mu\in\mathscr{P}\!\left({}\mkern 3.5mu\overline{\mkern-3.5mu\mathcal{X}\mkern-0.5mu}\mkern 0.5mu\right)$ let $\mu_{(x_{i})_{i=1}^{j}}$ be the value of a (classical) disintegration of $\mu$ w.r.t. the first $j$ coordinates at $(x_{i})_{i=1}^{j}$ . (By convention $\mu_{(x_{i})_{i=1}^{0}}=\mu$ ). Define

[TABLE]

The extended $\mathcal{W}_{p}-$ /weak topology on $\mathscr{P}\!\left({}\mkern 3.5mu\overline{\mkern-3.5mu\mathcal{X}\mkern-0.5mu}\mkern 0.5mu\right)$ is the initial topology w.r.t. $\mathcal{E}$ .

Remark 5.2.

Reasonable people may disagree about whether the most faithful / useful transcription of Aldous’ definition should include the factors $j=0$ and $j=N$ in the above product of spaces. When including $j=N$ , as we did, one has to interpret $\delta_{(x_{i})_{i=1}^{N}}\otimes\mu_{(x_{i})_{i=1}^{N}}$ simply as $\delta_{(x_{i})_{i=1}^{N}}$ . We leave it as an exercise to the reader to check that either or both may be dropped in the definition of $\mathcal{E}$ without affecting the resulting topology on $\mathscr{P}\!\left(X\right)$ .

Theorem 5.3.

The ( $\mathcal{W}_{p}$ -)extended weak topology is equal to the ( $\mathcal{W}_{p}$ -)information topology.

Proof.

We construct continuous maps

[TABLE]

such that

[TABLE]

The first equality above implies that the identity on $\mathscr{P}\!\left({}\mkern 3.5mu\overline{\mkern-3.5mu\mathcal{X}\mkern-0.5mu}\mkern 0.5mu\right)$ is continuous from the extended weak topology to the information topology, the second implies that it is continuous in the other direction.

$\mathcal{A}^{\prime}_{k}$ is very simple. We just need to select the right factors and then discard the unnecessary $\delta_{(x_{i})_{i=1}^{k}}$ part of the measure component. Formally

[TABLE]

which is cleary continuous.

We construct $\mathcal{A}$ recursively, by constructing as a composite of continuous maps

[TABLE]

satisfying

[TABLE]

$\mathcal{A}^{0}\left((\nu_{k})_{k=1}^{N-1}\right):=\delta_{\operatorname{int}_{\mathcal{X}_{1}}(\nu_{1})}$ . We need the helper functions

[TABLE]

Given $\mathcal{A}^{m}$ satisfying the induction hypothesis we set

[TABLE]

where $s_{m+1}$ is the obvious permutation of the coordinates to get the factors into the right order. $\mathcal{A}^{m+1}$ is continuous because by [24, Lemma 4.1] $\mathbin{\leavevmode\hbox to10.73pt{\vbox to12.07pt{\pgfpicture\makeatletter\hbox{\hskip 5.36711pt\lower-9.73888pt\hbox to0.0pt{\pgfsys@beginscope\pgfsys@invoke{ }\definecolor{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@rgb@stroke{0}{0}{0}\pgfsys@invoke{ }\pgfsys@color@rgb@fill{0}{0}{0}\pgfsys@invoke{ }\pgfsys@setlinewidth{0.4pt}\pgfsys@invoke{ }\nullfont\hbox to0.0pt{\pgfsys@beginscope\pgfsys@invoke{ }{}{}{ {{}}\hbox{\hbox{{\pgfsys@beginscope\pgfsys@invoke{ }{{}{}{{ {}{}}}{ {}{}} {{}{{}}}{{}{}}{}{{}{}} { }{{{{}}\pgfsys@beginscope\pgfsys@invoke{ }\pgfsys@transformcm{1.0}{0.0}{0.0}{1.0}{-3.83331pt}{-1.75pt}\pgfsys@invoke{ }\hbox{{\definecolor{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@rgb@stroke{0}{0}{0}\pgfsys@invoke{ }\pgfsys@color@rgb@fill{0}{0}{0}\pgfsys@invoke{ }\hbox{{$ {\scriptstyle\otimes} $}} }}\pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope}}} \pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope}}} {{}}\hbox{\hbox{{\pgfsys@beginscope\pgfsys@invoke{ }{{}{}{{ {}{}}}{ {}{}} {{}{{}}}{{}{}}{}{{}{}} { }{{{{}}\pgfsys@beginscope\pgfsys@invoke{ }\pgfsys@transformcm{1.0}{0.0}{0.0}{1.0}{-5.36711pt}{-9.73888pt}\pgfsys@invoke{ }\hbox{{\definecolor{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@rgb@stroke{0}{0}{0}\pgfsys@invoke{ }\pgfsys@color@rgb@fill{0}{0}{0}\pgfsys@invoke{ }\hbox{{$ {\scriptstyle{}\mkern 3.5mu\overline{\mkern-3.5mu\mathcal{X}\mkern-0.5mu}\mkern 0.5mu^{m}} $}} }}\pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope}}} \pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope}}} } \pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope{}{}{}\hss}\pgfsys@discardpath\pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope\hss}}\lxSVG@closescope\endpgfpicture}}}$ is continuous when one of the arguments is an element of some $\mathscr{F}\left(\mathcal{B}\rightsquigarrow\mathcal{C}\right)$ . That (34) still holds for $m+1$ is a straightforward calculation. This way we get to $\mathcal{A}^{N-1}$ . Finally, set

[TABLE]

∎

6. Bounded vs unbounded metrics

Because we will need it in the next section we interject here a proof of Lemma 1.3, which we restate below.

\convergencemoments

Proof of Lemma 1.3.

We provide the proof only for Hellwig’s topology, i.e. (3) of Theorem 1.3 and Theorem 1.2, respectively. As we have already seen in the previous sections, the topologies (2)–(4) are equivalent topologies, and the result therefore carries over to them. The ( $\mathcal{W}_{p}$ -)optimal stopping topology, (5), is treated below. It is clear that convergence w.r.t. $\mathcal{W}_{p}$ -information topology implies convergence in Hellwig’g information topology plus convergence of $p$ -th moments. For the reverse implication, let $1\leq t\leq N-1$ , and denote by $\mathcal{A}:=\overline{\mathcal{X}}^{t}$ the first $t$ and by $\mathcal{B}:=\overline{\mathcal{X}}_{t+1}$ the last $N-t$ coordinates. Now assume that $(\mu_{n})_{n}$ converges to $\mu$ in Hellwig’s information topology and that the $p$ -th moments converge. The classical (not adapted) version of the very lemma we prove here implies that $\mu_{n}\to\mu$ in $\mathcal{W}_{p}$ ; in particular $K:=\{\mu_{n}:n\}\subset\mathscr{P}\!_{p}\!\left(\mathcal{A}\times\mathcal{B}\right)$ is relatively compact. Lemma 1.6 (or really [24, Lemma 3.3]/[9, Lemma 2.6]) therefore guarantees that $\operatorname{dis}_{\mathcal{A}}^{\mathcal{B}}[K]\subset\mathscr{P}\!_{p}\!\left(\mathcal{A}\times\mathscr{P}\!_{p}\!\left(\mathcal{B}\right)\right)$ is relatively compact.

Every subsequence of $(\operatorname{dis}_{\mathcal{A}}^{\mathcal{B}}(\mu_{n}))_{n}$ therefore has a subsequence $(\operatorname{dis}_{\mathcal{A}}^{\mathcal{B}}(\mu_{n_{k}}))_{k}$ which converges w.r.t. the topology on $\mathscr{P}\!_{p}\!\left(\mathcal{A}\times\mathscr{P}\!_{p}\!\left(\mathcal{B}\right)\right)$ (i.e. the one coming from nested Wasserstein metrics) to some $\mu^{\prime}\in\mathscr{P}\!_{p}\!\left(\mathcal{A}\times\mathscr{P}\!_{p}\!\left(\mathcal{B}\right)\right)$ . Because convergence in $\mathscr{P}\!_{p}\!\left(\mathcal{A}\times\mathscr{P}\!_{p}\!\left(\mathcal{B}\right)\right)$ is stronger than convergence in $\mathscr{P}\!\left(\mathcal{A}\times\mathscr{P}\!\left(\mathcal{B}\right)\right)$ (i.e. in the nested weak sense) we must also have $\operatorname{dis}_{\mathcal{A}}^{\mathcal{B}}(\mu_{n_{k}})\overset{k}{\to}\mu^{\prime}$ in $\mathscr{P}\!\left(\mathcal{A}\times\mathscr{P}\!\left(\mathcal{B}\right)\right)$ . But also, by assumption, $\operatorname{dis}_{\mathcal{A}}^{\mathcal{B}}(\mu_{n_{k}})\overset{k}{\to}\operatorname{dis}_{\mathcal{A}}^{\mathcal{B}}(\mu)$ in $\mathscr{P}\!\left(\mathcal{A}\times\mathscr{P}\!\left(\mathcal{B}\right)\right)$ and therefore $\mu^{\prime}=\operatorname{dis}_{\mathcal{A}}^{\mathcal{B}}(\mu)$ . ∎

7. Optimal Stopping

In this section we investigate the relation between the ( $\mathcal{W}_{p}$ -)optimal stopping topology and the adapted Wasserstein topology. Lemma 7.1 states that the topology induced by $\mathcal{A}\mathcal{W}_{p}$ ((1) of Theorem 1.3) is finer than the $\mathcal{W}_{p}$ -optimal stopping topology. Lemma 7.5 states that the $\mathcal{W}_{p}$ -optimal stopping topology is finer than the $\mathcal{W}_{p}$ -information topology ((3) of Theorem 1.3). This will finish the proof of Theorem 1.3.

Recall that

[TABLE]

for $L=(L_{t})_{t=0}^{N}\in AC_{p}(\Omega)$ .

Lemma 7.1.

Let $L\in AC_{p}(\Omega)$ . Then $\mu\mapsto v^{L}(\mu)$ is continuous w.r.t. $\mathcal{A}\mathcal{W}_{p}$ . In fact, one has

[TABLE]

for every $\mu,\nu\in\mathscr{P}\!_{p}\!\left(\Omega\right)$ .

Proof.

Let $\mu,\nu\in\mathscr{P}\!_{p}\!\left(\Omega\right)$ and assume that $v^{L}(\mu)\leq v^{L}(\nu)$ . Moreover, let $\pi\in\operatorname{Cpl}_{bc}(\mu,\nu)$ and $\varepsilon>0$ be arbitrary, and fix a stopping time $\tau$ satisfying $\mathbb{E}^{\nu}\left(L_{\tau}(Y)\right)\leq v^{L}(\nu)+\varepsilon$ . For $u\in[0,1]$ define

[TABLE]

where the equality holds by the properties of stopping times and since $\pi$ is causal. We then have that

[TABLE]

As further $\sigma(\cdot,u)$ is a stopping time for every fixed $u\in[0,1]$ one has $v^{L}(\mu)\leq\int_{[0,1]}\mathbb{E}^{\pi}\left(L_{\sigma(X,u)}(X)\right)\,\mathrm{d}u$ and therefore

[TABLE]

Changing the role of $\mu$ and $\nu$ and using that $\varepsilon>0$ and $\pi\in\operatorname{Cpl}_{bc}(\mu,\nu)$ was arbitrary yields (35).

Now assume that $\mathcal{A}\mathcal{W}_{p}(\mu_{n},\mu)\to 0$ and that $\pi_{n}\in\operatorname{Cpl}\left(\mu_{n},\mu\right)$ is less than $1/n$ away from attaining the infimum $\mathcal{A}\mathcal{W}_{p}(\mu_{n},\mu)$ . Then $\mathcal{W}_{p}(\pi_{n},\pi)\to 0$ , where $\pi\in\operatorname{Cpl}\left(\mu,\mu\right)$ is the identity coupling $\mathscr{P}\!\left(1_{\Omega}\bm{,}1_{\Omega}\right)(\mu)$ of $\mu$ . (A coupling between $\pi_{n}$ and $\pi$ is given by $\mathscr{P}\!\left((x,y)\mapsto(x,y,y,y)\right)(\pi_{n})$ .) Because $(x,y)\mapsto\max_{0\leq t\leq N}|L_{t}(x)-L_{t}(y)|$ is a continuous function of growth of at most order $p$ , we get that

[TABLE]

Together with (35) this implies that $v^{L}$ is continuous w.r.t. $\mathcal{A}\mathcal{W}_{p}$ . ∎

Remark 7.2.

The above proof reveals that if $L_{t}$ is Lipschitz with constant $c>0$ for every $t$ , then $|v^{L}(\mu)-v^{L}(\nu)|\leq c\,\mathcal{SCW}_{1}(\mu,\nu)$ .

In order to show that the optimal stopping topology is finer than the $\mathcal{W}_{p}$ -information topology, we need to make a few preparations.

Lemma 7.3.

Let $\mathcal{A}$ be a Polish space. Then the family

[TABLE]

is convergence determining for the weak topology on $\mathscr{P}\!\left(\mathscr{P}\!\left(\mathcal{A}\right)\right)$ , that is, a sequence of probability measures $(\mu_{n})_{n}$ in $\mathscr{P}\!\left(\mathscr{P}\!\left(\mathcal{A}\right)\right)$ converges weakly to a probability measure $\mu\in\mathscr{P}\!\left(\mathscr{P}\!\left(\mathcal{A}\right)\right)$ if and only if $\int F\,\mathrm{d}\mu_{n}\to\int F\,\mathrm{d}\mu$ for all $F$ in (38).

This follows from the Stone-Weierstrass theorem in case of compact $\mathcal{A}$ and readily extends to general Polish spaces e.g. via Stone-Čech compactification.

Lemma 7.4.

Let $\mathcal{A}$ be a Polish space. The family of functions

[TABLE]

is convergence determining for the weak topology on $\mathscr{P}\!\left(\mathscr{P}\!\left(\mathcal{A}\right)\right)$ .

Proof.

Let $L$ , $G$ , and $(h_{i})_{i\leq L}$ as in (38). Moreover, let $m\in\mathbb{R}$ such that $|h_{i}|\leq m$ for all $1\leq i\leq L$ and define $I:=[-m,m]^{L}$ . Then $I\subset\mathbb{R}^{L}$ is compact and satisfies

[TABLE]

Let $\sigma\colon\mathbb{R}\to\mathbb{R}$ be some fixed bounded continuous sigmoid function such as $\sigma(r)=(1+e^{-r})^{-1}$ or $\sigma(r)=\max(0,\min(r,1))$ .

By the universal approximation result of Cybenko [21, Theorem 2], the set

[TABLE]

is dense in $C(I,\mathbb{R})$ w.r.t. the supremum norm. As a result, it is enough to replace $G$ in (38) by functions of the form $x\mapsto\sum_{i=1}^{m}u_{i}\sigma(v_{i}\cdot x+w_{i})$ . Evaluating the latter function on the vector $x=(\int h_{1}\,\mathrm{d}\mu,\dots,\int h_{L}\,\mathrm{d}\mu)$ yields

[TABLE]

upon defining $v_{i}^{L+1}:=b_{i}$ , $w_{L+1}:=1$ , and finally $\bar{h}_{i}:=\sum_{k=1}^{L+1}v_{i}^{k}h_{k}$ for every $i$ . The result follows from Lemma 7.3. ∎

Lemma 7.5.

The $\mathcal{W}_{p}$ -optimal stopping topology is finer than the $\mathcal{W}_{p}$ -information topology.

Proof.

The choice $L_{T}:=-\rho(x,x_{0})^{p}-1$ and $L_{t}:=0$ for $t\neq T$ shows that convergence in the $\mathcal{W}_{p}$ -optimal stopping topology implies convergence of the $p$ -th moments. Thus, we are left to show that convergence in the optimal stopping topology implies convergence in Hellwig’s information topology. Then, by the part of Lemma 1.3 which has already been established, we obtain convergence in the $\mathcal{W}_{p}$ -information topology.

Fix $1\leq t\leq N-1$ and denote by $\mathcal{A}:=\overline{\mathcal{X}}^{t}$ the first $t$ and by $\mathcal{B}:=\overline{\mathcal{X}}_{t+1}$ the last $N-t$ coordinates. As $C_{b}(\mathcal{A})$ is convergence determining for $\mathscr{P}\!\left(\mathcal{A}\right)$ , and $\{\nu\mapsto G(\int_{\mathcal{B}}h\,\mathrm{d}\nu):h\in C_{b}(\mathcal{B}),G\in C_{b}({\mathbb{R}})\}$ is, by Lemma 7.4, convergence determining for $\mathscr{P}\!\left(\mathscr{P}\!\left(\mathcal{B}\right)\right)$ , it follows e.g. from [26, Proposition 4.6 (p.115)] that

[TABLE]

is convergence determining for the weak topology on $\mathscr{P}\!\left(\mathcal{A}\times\mathscr{P}\!\left(\mathcal{B}\right)\right)$ . Since $h$ in (40) is bounded, one can actually take $g$ in (40) to be compactly supported. But a continuous compactly supported function can be approximated uniformly by piecewise linear functions. The latter are linear combinations of functions of the form $z\mapsto\min(c,dz)$ where $c,d\in\mathbb{R}$ . It therefore follows that

[TABLE]

is also convergence determining for the weak topology on $\mathscr{P}\!\left(\mathcal{A}\times\mathscr{P}\!\left(\mathcal{B}\right)\right)$ . Let $F$ be a function in (41), defined via $f\in C_{b}(\mathcal{A})$ and $h\in C_{b}(\mathcal{B})$ , and let $m\in\mathbb{R}$ be a bound for $|f|$ and $|h|$ . Define $L\in AC_{p}(\Omega)$ via

[TABLE]

(Where $\overline{X}^{t}$ is the projection onto the first $t$ coordinates and $\overline{X}_{t+1}$ is the projection onto the remaining $N-t$ coordinates.)

By dynamic programming (the Snell-envelope theorem) one has

[TABLE]

for every $\mu\in\mathscr{P}\!\left(\mathcal{A}\times\mathcal{B}\right)$ . This implies that the optimal stopping topology is finer than the initial topology of $\mu\mapsto\int F\,\mathrm{d}(\operatorname{dis}_{\mathcal{A}}^{\mathcal{B}}(\mu))$ over $F$ in (41). As (41) is convergence determining for the weak topology on $\mathscr{P}\!\left(\mathcal{A}\times\mathscr{P}\!\left(\mathcal{B}\right)\right)$ , the optimal stopping topology is indeed finer than the information topology, and as observed at the beginning of this proof therefore the $\mathcal{W}_{p}$ -optimal stopping topology is finer than the $\mathcal{W}_{p}$ -information topology. ∎

Bibliography62

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] B. Acciaio, J. Backhoff-Veraguas, and R. Carmona. Extended mean field control problems: stochastic maximum principle and transport perspective. ar Xiv preprint ar Xiv:1802.05754 , 2018.
2[2] B. Acciaio, J. Backhoff-Veraguas, and A. Zalashko. Causal optimal transport and its links to enlargement of filtrations and continuous-time stochastic optimization. Ar Xiv e-prints , 2016.
3[3] D. J. Aldous. Weak convergence and general theory of processes. Unpublished draft of monograph; Department of Statistics, University of California, Berkeley, CA 94720, July 1981.
4[4] L. Ambrosio, N. Gigli, and G. Savaré. Gradient flows in metric spaces and in the space of probability measures . Lectures in Mathematics ETH Zürich. Birkhäuser Verlag, Basel, second edition, 2008.
5[5] J. Backhoff-Veraguas, D. Bartl, M. Beiglböck, and M. Eder. Adapted Wasserstein Distances and Stability in Mathematical Finance. ar Xiv e-prints , page ar Xiv:1901.07450, Jan 2019.
6[6] J. Backhoff-Veraguas, M. Beiglböck, M. Eder, and A. Pichler. Fundamental properties of process distances. Ar Xiv e-prints , 2017.
7[7] J. Backhoff-Veraguas, M. Beiglböck, M. Huesmann, and S. Källblad. Martingale Benamou–Brenier: a probabilistic perspective. Ar Xiv e-prints , Aug. 2017.
8[8] J. Backhoff-Veraguas, M. Beiglböck, Y. Lin, and A. Zalashko. Causal transport in discrete time and applications. SIAM Journal on Optimization , 27(4):2528–2562, 2017.

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Videos

All adapted topologies are equal

Abstract.

1. Introduction

1.1. Outline

1.2. Adapted Wasserstein distances, nested distance

1.3. Hellwig’s information topology

Remark 1.1**.**

1.4. Aldous’ extended weak topology

1.5. The optimal stopping topology

1.6. Main result

Theorem 1.2**.**

1.6.1. ppp-Wasserstein and unbounded metrics

Theorem 1.3**.**

1.7. Further remarks on related work

1.7.1. Some further articles of successors of Aldous

1.7.2. Previous applications of adapted Wasserstein distances.

1.8. Another formulation of the adapted Wasserstein distance and of Hellwigs information topology

Theorem 1.4**.**

Remark 1.5**.**

1.8.1. Hellwig’s information topology in terms of adapted Wasserstein distances

1.9. Preservation of Compactness

Lemma 1.6** (Compactness lemma).**

2. Preparations

2.1. Notation

2.1.1. Operations on Spaces

Definition 2.1**.**

2.1.2. Maps between spaces

2.1.3. Processes which take values in different spaces at different times

3. Hellwig’s Wp\mathcal{W}_{p}Wp​-information topology is equal to the topology induced by AWp\mathcal{A}\mathcal{W}_{p}AWp​

Lemma 3.1**.**

Proof.

3.1. Homeomorphisms

Remark 3.2**.**

Lemma 3.3**.**

Proof.

Lemma 3.4**.**

Proof.

Lemma 3.5** (int⁡\operatorname{int}int is ‘associative’).**

Proof.

Lemma 3.6**.**

Proof.

Lemma 3.7**.**

Proof.

Lemma 3.8**.**

Proof.

3.2. Proof of Lemma 3.7

Lemma 3.9** ([24, Lemma 4.2]).**

Proof of Lemma 3.7.

4. The symmetrized causal Wasserstein distance SCWp\mathcal{S}\mathcal{C}\mathcal{W}_{p}SCWp​

Remark 4.1**.**

4.1. Extending the causal ‘distance’

Definition 4.2**.**

Remark 4.3**.**

Remark 4.4**.**

Lemma 4.5**.**

Proof.

Lemma 4.6**.**

Proof.

Lemma 4.7**.**

Proof.

Lemma 4.8**.**

Lemma 4.9**.**

Proof.

Remark 4.10**.**

Lemma 4.11**.**

Proof.

Lemma 4.12**.**

Proof.

Lemma 4.13**.**

Proof.

Lemma 4.14**.**

Proof.

Lemma 4.15**.**

Proof.

Remark 1.1.

Theorem 1.2.

1.6.1. $p$ -Wasserstein and unbounded metrics

Theorem 1.3.

Theorem 1.4.

Remark 1.5.

Lemma 1.6 (Compactness lemma).

Definition 2.1.

3. Hellwig’s $\mathcal{W}_{p}$ -information topology is equal to the topology induced by $\mathcal{A}\mathcal{W}_{p}$

Lemma 3.1.

Remark 3.2.

Lemma 3.3.

Lemma 3.4.

Lemma 3.5 ( $\operatorname{int}$ is ‘associative’).

Lemma 3.6.

Lemma 3.7.

Lemma 3.8.

Lemma 3.9 ([24, Lemma 4.2]).

4. The symmetrized causal Wasserstein distance $\mathcal{S}\mathcal{C}\mathcal{W}_{p}$

Remark 4.1.

Definition 4.2.

Remark 4.3.

Remark 4.4.

Lemma 4.5.

Lemma 4.6.

Lemma 4.7.

Lemma 4.8.

Lemma 4.9.

Remark 4.10.

Lemma 4.11.

Lemma 4.12.

Lemma 4.13.

Lemma 4.14.

Lemma 4.15.

Remark 4.16.

Definition 4.17.

Theorem 4.18.

Remark 4.19.

Lemma 4.20.

Lemma 4.21.

Alternative proof of Lemma 4.20 when $\mathcal{X}_{1}$ has no isolated points.

Theorem 4.22.

Theorem 4.23.

Definition 5.1.

Remark 5.2.

Theorem 5.3.

Lemma 7.1.

Remark 7.2.

Lemma 7.3.

Lemma 7.4.

Lemma 7.5.