Asymptotic distribution and convergence rates of stochastic algorithms   for entropic optimal transportation between probability measures

Bernard Bercu; J\'er\'emie Bigot

arXiv:1812.09150·math.ST·December 10, 2024

Asymptotic distribution and convergence rates of stochastic algorithms for entropic optimal transportation between probability measures

Bernard Bercu, J\'er\'emie Bigot

PDF

TL;DR

This paper analyzes the convergence and distribution of stochastic algorithms for estimating entropic optimal transportation costs, specifically Sinkhorn divergences, using a Robbins-Monro approach with theoretical guarantees and practical experiments.

Contribution

It establishes almost sure convergence, asymptotic normality, and convergence rates for a new recursive estimator of Sinkhorn divergence in semi-discrete and discrete settings.

Findings

01

Proves almost sure convergence of the estimator.

02

Derives asymptotic normality results.

03

Provides numerical experiments demonstrating effectiveness.

Abstract

This paper is devoted to the stochastic approximation of entropically regularized Wasserstein distances between two probability measures, also known as Sinkhorn divergences. The semi-dual formulation of such regularized optimal transportation problems can be rewritten as a non-strongly concave optimisation problem. It allows to implement a Robbins-Monro stochastic algorithm to estimate the Sinkhorn divergence using a sequence of data sampled from one of the two distributions. Our main contribution is to establish the almost sure convergence and the asymptotic normality of a new recursive estimator of the Sinkhorn divergence between two probability measures in the discrete and semi-discrete settings. We also study the rate of convergence of the expected excess risk of this estimator in the absence of strong concavity of the objective function. Numerical experiments on synthetic and real…

Figures30

Click any figure to enlarge with its caption.

Equations436

W_{ε} (μ, ν) = v \in R^{J} max E [h_{ε} (X, v)]

W_{ε} (μ, ν) = v \in R^{J} max E [h_{ε} (X, v)]

V_{n + 1} = V_{n} + γ_{n + 1} \nabla_{v} h_{ε} (X_{n + 1}, V_{n})

V_{n + 1} = V_{n} + γ_{n + 1} \nabla_{v} h_{ε} (X_{n + 1}, V_{n})

W_{n} = \frac{1}{n} k = 1 \sum n h_{ε} (X_{k}, V_{k - 1}) .

W_{n} = \frac{1}{n} k = 1 \sum n h_{ε} (X_{k}, V_{k - 1}) .

n \to \infty lim W_{n} = W_{ε} (μ, ν) a.s.

n \to \infty lim W_{n} = W_{ε} (μ, ν) a.s.

\sqrt{n}\Bigl{(}\widehat{W}_{n}-W_{\varepsilon}(\mu,\nu)\Bigr{)}\mathrel{\mathop{\kern 0.0pt\longrightarrow}\limits^{{\mbox{\calcal L}}}}\mathcal{N}\bigl{(}0,\sigma^{2}_{\varepsilon}(\mu,\nu)\bigr{)}

\sqrt{n}\Bigl{(}\widehat{W}_{n}-W_{\varepsilon}(\mu,\nu)\Bigr{)}\mathrel{\mathop{\kern 0.0pt\longrightarrow}\limits^{{\mbox{\calcal L}}}}\mathcal{N}\bigl{(}0,\sigma^{2}_{\varepsilon}(\mu,\nu)\bigr{)}

W_{ε} (μ, ν) = π \in Π (μ, ν) min G_{ε} (π), \vspace - 2 e x

W_{ε} (μ, ν) = π \in Π (μ, ν) min G_{ε} (π), \vspace - 2 e x

G_{ε} (π) = \int_{X \times Y} c (x, y) d π (x, y) + ε KL (π ∣ μ \otimes ν),

G_{ε} (π) = \int_{X \times Y} c (x, y) d π (x, y) + ε KL (π ∣ μ \otimes ν),

\textrm{KL}(\pi|\xi)=\int_{\mathcal{X}\times\mathcal{Y}}\Bigl{(}\log\Bigl{(}\dfrac{d\pi}{d\xi}(x,y)\Bigr{)}-1\Bigr{)}d\pi(x,y).

\textrm{KL}(\pi|\xi)=\int_{\mathcal{X}\times\mathcal{Y}}\Bigl{(}\log\Bigl{(}\dfrac{d\pi}{d\xi}(x,y)\Bigr{)}-1\Bigr{)}d\pi(x,y).

0 \leq c (x, y) \leq c_{X} (x) + c_{Y} (y)

0 \leq c (x, y) \leq c_{X} (x) + c_{Y} (y)

W_{\varepsilon}(\mu,\nu)=\!\!\sup_{\begin{subarray}{c}(u,v)\in\mathcal{C}_{b}(\mathcal{X})\times\mathcal{C}_{b}(\mathcal{Y})\end{subarray}}\!\int_{\mathcal{X}\times\mathcal{Y}}\!\Bigl{(}u(x)+v(y)-w_{c}(x,y)\!\Bigr{)}d\mu(x)d\nu(y)

W_{\varepsilon}(\mu,\nu)=\!\!\sup_{\begin{subarray}{c}(u,v)\in\mathcal{C}_{b}(\mathcal{X})\times\mathcal{C}_{b}(\mathcal{Y})\end{subarray}}\!\int_{\mathcal{X}\times\mathcal{Y}}\!\Bigl{(}u(x)+v(y)-w_{c}(x,y)\!\Bigr{)}d\mu(x)d\nu(y)

w_{c}(x,y)=\varepsilon\exp\Bigl{(}\frac{u(x)+v(y)-c(x,y)}{\varepsilon}\Bigr{)}.

w_{c}(x,y)=\varepsilon\exp\Bigl{(}\frac{u(x)+v(y)-c(x,y)}{\varepsilon}\Bigr{)}.

W_{ε} (μ, ν) = ε π \in Π (μ, ν) min KL (π ∣ γ)

W_{ε} (μ, ν) = ε π \in Π (μ, ν) min KL (π ∣ γ)

d γ (x, y) = exp (\frac{- c ( x , y )}{ε}) d μ (x) d ν (x) .

d γ (x, y) = exp (\frac{- c ( x , y )}{ε}) d μ (x) d ν (x) .

W_{ε} (μ, ν) = v \in C_{b} (Y) sup H_{ε} (v)

W_{ε} (μ, ν) = v \in C_{b} (Y) sup H_{ε} (v)

H_{ε} (v) = \int_{X} v_{c, ε} (x) d μ (x) + \int_{Y} v (y) d ν (y) - ε,

H_{ε} (v) = \int_{X} v_{c, ε} (x) d μ (x) + \int_{Y} v (y) d ν (y) - ε,

v_{{c,\varepsilon}}(x)=\left\{\begin{array}[]{ccc}{\displaystyle\inf_{y\in\mathcal{Y}}}\Bigl{\{}c(x,y)-v(y)\Bigr{\}}&\mbox{if}&\varepsilon=0,\vspace{1ex}\\ -\varepsilon\log\Bigl{(}\int_{\mathcal{Y}}\exp\Bigl{(}\frac{v(y)-c(x,y)}{\varepsilon}\Bigr{)}d\nu(y)\Bigr{)}&\mbox{if}&\varepsilon>0.\end{array}\right.

v_{{c,\varepsilon}}(x)=\left\{\begin{array}[]{ccc}{\displaystyle\inf_{y\in\mathcal{Y}}}\Bigl{\{}c(x,y)-v(y)\Bigr{\}}&\mbox{if}&\varepsilon=0,\vspace{1ex}\\ -\varepsilon\log\Bigl{(}\int_{\mathcal{Y}}\exp\Bigl{(}\frac{v(y)-c(x,y)}{\varepsilon}\Bigr{)}d\nu(y)\Bigr{)}&\mbox{if}&\varepsilon>0.\end{array}\right.

W_{ε} (μ, ν) = v \in C_{b} (Y) max E [h_{ε} (X, v)]

W_{ε} (μ, ν) = v \in C_{b} (Y) max E [h_{ε} (X, v)]

ν = j = 1 \sum J ν_{j} δ_{y_{j}}

ν = j = 1 \sum J ν_{j} δ_{y_{j}}

W_{ε} (μ, ν) = v \in R^{J} max H_{ε} (v),

W_{ε} (μ, ν) = v \in R^{J} max H_{ε} (v),

H_{ε} (v) = E [h_{ε} (X, v)]

H_{ε} (v) = E [h_{ε} (X, v)]

h_{\varepsilon}(x,v)=\left\{\begin{array}[]{ccc}\sum_{j=1}^{J}v_{j}\nu_{j}+{\displaystyle\min_{1\leq j\leq J}\Bigl{\{}c(x,y_{j})-v_{j}\Bigr{\}}}&\!\mbox{if}\!&\varepsilon=0,\vspace{1ex}\\ \sum_{j=1}^{J}v_{j}\nu_{j}-\varepsilon\log\Bigl{(}\sum_{j=1}^{J}\exp\Bigl{(}\dfrac{v_{j}-c(x,y_{j})}{\varepsilon}\Bigr{)}\nu_{j}\Bigr{)}-\varepsilon&\!\mbox{if}\!&\varepsilon>0.\end{array}\right.

h_{\varepsilon}(x,v)=\left\{\begin{array}[]{ccc}\sum_{j=1}^{J}v_{j}\nu_{j}+{\displaystyle\min_{1\leq j\leq J}\Bigl{\{}c(x,y_{j})-v_{j}\Bigr{\}}}&\!\mbox{if}\!&\varepsilon=0,\vspace{1ex}\\ \sum_{j=1}^{J}v_{j}\nu_{j}-\varepsilon\log\Bigl{(}\sum_{j=1}^{J}\exp\Bigl{(}\dfrac{v_{j}-c(x,y_{j})}{\varepsilon}\Bigr{)}\nu_{j}\Bigr{)}-\varepsilon&\!\mbox{if}\!&\varepsilon>0.\end{array}\right.

\nabla_{v} h_{ε} (x, v) = ν - π (x, v),

\nabla_{v} h_{ε} (x, v) = ν - π (x, v),

\nabla^{2}_{v}h_{\varepsilon}(x,v)=\frac{1}{\varepsilon}\Bigl{(}\pi(x,v)\pi(x,v)^{T}-\mathop{\rm diag}\nolimits(\pi(x,v))\Bigr{)}

\nabla^{2}_{v}h_{\varepsilon}(x,v)=\frac{1}{\varepsilon}\Bigl{(}\pi(x,v)\pi(x,v)^{T}-\mathop{\rm diag}\nolimits(\pi(x,v))\Bigr{)}

\pi_{j}(x,v)=\Bigl{(}\sum_{k=1}^{J}\nu_{k}\exp\Bigl{(}\dfrac{v_{k}-c(x,y_{k})}{\varepsilon}\Bigr{)}\Bigr{)}^{-1}\nu_{j}\exp\Bigl{(}\dfrac{v_{j}-c(x,y_{j})}{\varepsilon}\Bigr{)}.

\pi_{j}(x,v)=\Bigl{(}\sum_{k=1}^{J}\nu_{k}\exp\Bigl{(}\dfrac{v_{k}-c(x,y_{k})}{\varepsilon}\Bigr{)}\Bigr{)}^{-1}\nu_{j}\exp\Bigl{(}\dfrac{v_{j}-c(x,y_{j})}{\varepsilon}\Bigr{)}.

\nabla H_{ε} (v) = E [\nabla_{v} h_{ε} (X, v)] = ν - E [π (X, v)],

\nabla H_{ε} (v) = E [\nabla_{v} h_{ε} (X, v)] = ν - E [π (X, v)],

\nabla^{2}H_{\varepsilon}(v)={\mathbb{E}}[\nabla_{v}^{2}h_{\varepsilon}(X,v)]=\frac{1}{\varepsilon}{\mathbb{E}}\bigl{[}\pi(X,v)\pi(X,v)^{T}-\mathop{\rm diag}\nolimits(\pi(X,v))\bigr{]}.

\nabla^{2}H_{\varepsilon}(v)={\mathbb{E}}[\nabla_{v}^{2}h_{\varepsilon}(X,v)]=\frac{1}{\varepsilon}{\mathbb{E}}\bigl{[}\pi(X,v)\pi(X,v)^{T}-\mathop{\rm diag}\nolimits(\pi(X,v))\bigr{]}.

V_{n + 1} = V_{n} + γ_{n + 1} \nabla_{v} h_{ε} (X_{n + 1}, V_{n})

V_{n + 1} = V_{n} + γ_{n + 1} \nabla_{v} h_{ε} (X_{n + 1}, V_{n})

n = 1 \sum \infty γ_{n} = + \infty and n = 1 \sum \infty γ_{n}^{2} < + \infty.

n = 1 \sum \infty γ_{n} = + \infty and n = 1 \sum \infty γ_{n}^{2} < + \infty.

\widehat{V}_{n+1}=\widehat{V}_{n}+\gamma_{n+1}\bigl{(}\nabla_{v}h_{\varepsilon}(X_{n+1},\widehat{V}_{n})-\alpha\langle\widehat{V}_{n},\boldsymbol{v}_{J}\rangle\boldsymbol{v}_{J}\bigr{)}

\widehat{V}_{n+1}=\widehat{V}_{n}+\gamma_{n+1}\bigl{(}\nabla_{v}h_{\varepsilon}(X_{n+1},\widehat{V}_{n})-\alpha\langle\widehat{V}_{n},\boldsymbol{v}_{J}\rangle\boldsymbol{v}_{J}\bigr{)}

H_{ε, α} (v) = E [h_{ε, α} (X, v)]

H_{ε, α} (v) = E [h_{ε, α} (X, v)]

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Full text

Asymptotic distribution and convergence rates of stochastic algorithms for entropic optimal transportation between probability measures

Bernard Bercu and Jérémie Bigot

Université de Bordeaux

Institut de Mathématiques de Bordeaux et CNRS (UMR 5251)

Abstract.

This paper is devoted to the stochastic approximation of entropically regularized Wasserstein distances between two probability measures, also known as Sinkhorn divergences. The semi-dual formulation of such regularized optimal transportation problems can be rewritten as a non-strongly concave optimisation problem. It allows to implement a Robbins-Monro stochastic algorithm to estimate the Sinkhorn divergence using a sequence of data sampled from one of the two distributions. Our main contribution is to establish the almost sure convergence and the asymptotic normality of a new recursive estimator of the Sinkhorn divergence between two probability measures in the discrete and semi-discrete settings. We also study the rate of convergence of the expected excess risk of this estimator in the absence of strong concavity of the objective function. Numerical experiments on synthetic and real datasets are also provided to illustrate the usefulness of our approach for data analysis.

Jérémie Bigot is a member of Institut Universitaire de France (IUF), and this work has been carried out with financial support from the IUF.

Keywords: Stochastic optimisation; Convergence of random variables; Optimal transport; Entropic regularization; Sinkhorn divergence; Wasserstein distance.

AMS classifications: Primary 62G05; secondary 62G20.

1. Introduction

1.1. Optimal transport and regularized Wasserstein distances for data analysis

The statistical analysis of high-dimensional data using tools from the theory of optimal transport [48] and the notion of Wasserstein distance between probability measures has recently gained increasing popularity. When elements in a dataset may be represented as probability distributions, the use of the Wasserstein distance leads to relevant statistics in various fields such as fingerprints comparison [45], clinical trials [35], metagenomics [22], clustering of discrete distributions [50], learning of generative models [4], or geodesic principal component analysis [9, 44]. Wasserstein distances are also of interest in the setting of statistical inference from empirical measures for hypothesis testing on discrepancies between multivariate distributions [45].

However, the cost to evaluate a Wasserstein distance between two discrete probability distributions with supports of equal size $K$ is generally of order $K^{3}\log K$ . Consequently, this evaluation represents a serious limitation for high-dimensional data analysis. To overcome this issue, Cuturi [15] has proposed to add an entropic regularization term to the linear program corresponding to a standard optimal transport problem. It leads to the notion of Sinkhorn divergence between two probability distributions which may be computed through an iterative algorithm where the cost of each iteration is of order $K^{2}$ . This proposal makes feasible the use of regularized optimal transportation distance for data analysis, and it has found various applications in generative models [27], multi-label learning [24], dictionary learning [43], or image processing [16, 29, 40], to name but a few. For an overview of regularized optimal transport applied to machine learning, we refer the reader to the recent book of Cuturi and Peyré [17] as well as to [2, 15] for deterministic algorithms on the computation of Sinkhorn divergences.

1.2. Main contributions and related works

This paper is inspired by the recent work of Genevay, Cuturi, Peyré and Bach [26] on a very efficient statistical procedure to evaluate the possibly regularized Wasserstein distance $W_{\varepsilon}(\mu,\nu)$ between an arbitrary probability measure $\mu$ and a discrete distribution $\nu$ with finite support of size $J$ . This statistical procedure is based on the well-known Robbins-Monro algorithm for stochastic optimisation [42] and its averaged version [38]. The keystone of their approach [26] is that $W_{\varepsilon}(\mu,\nu)$ can be rewritten in expectation form as

[TABLE]

where $X$ is a random vector drawn from the unknown distribution $\mu$ and $h_{\varepsilon}(x,v)$ is a suitable function of the regularization parameter $\varepsilon\geq 0$ . As it was shown in [26] on clouds of word embeddings, this statistical procedure is easy to implement with a low computational cost. For a sequence $(X_{n})$ of independent and identically distributed random variables sharing the same distribution as $X$ , we shall extend the statistical analysis of [26] by proving that, for $\varepsilon>0$ , the Robbins-Monro algorithm

[TABLE]

converges almost surely to a maximizer $v^{\ast}$ of $H_{\varepsilon}(v)={\mathbb{E}}[h_{\varepsilon}(X,v)]$ . We also investigate the asymptotic normality of $\widehat{V}_{n}$ . It leads us to estimate the Sinkhorn divergence $W_{\varepsilon}(\mu,\nu)$ by the new recursive estimator

[TABLE]

Under standard assumptions in stochastic optimisation, we shall prove that

[TABLE]

as well as the asymptotic normality

[TABLE]

where the asymptotic variance $\sigma^{2}_{\varepsilon}(\mu,\nu)={\mathbb{E}}\bigl{[}h_{\varepsilon}^{2}(X,v^{\ast})\bigr{]}-W_{\epsilon}^{2}(\mu,\nu)$ can also be estimated in a recursive manner. Finally, we analyze the rate of convergence of the expected excess risk $H_{\varepsilon}(v^{\ast})-{\mathbb{E}}[\widehat{W}_{n}]$ . We shall prove that the expected excess risk goes to zero faster than $1/\sqrt{n}$ in the absence of strong concavity of the objective function $H_{\varepsilon}$ . These asymptotic results allow to better understand the convergence of stochastic algorithms for regularized optimal transport problems and to analyse the influence of their asymptotic variance on the quality of estimation. We shall also establish further results in the unregularized case where the regularization parameter $\varepsilon=0$ .

The asymptotic behavior of empirical Wasserstein distances when both $\mu$ and $\nu$ are absolutely continuous measures has been extensively studied over the last years [18, 19, 20, 23, 41]. For probability measures supported on finite spaces, limiting distributions for empirical Wasserstein distance have been obtained in [45], while the asymptotic distribution of empirical Sinkhorn divergence has been recently considered in [8, 32].

1.3. Organisation of the paper

Notation and formulation of the possibly regularized optimal transportation problem are presented in Section 2. Asymptotic properties of our stochastic algorithms for the regularized optimal transport are stated in Section 3 and further results for the unregularized optimal transport are provided in Section 4. Numerical experiments illustrating our theoretical results on simulated and real data are given in Section 5, where we consider the problem of estimating an optimal mapping between the distribution of spatial locations of reported incidents of crime in Chicago and the locations of Police stations. All the proofs are postponed to Appendices Appendix A Three keystone lemmas. and Appendix B Proofs of the main results..

2. Formulation of the optimal transportation problem

Let $\mathcal{X}$ and $\mathcal{Y}$ be two metric spaces. Denote by $\mathcal{M}_{+}^{1}(\mathcal{X})$ and $\mathcal{M}_{+}^{1}(\mathcal{Y})$ the sets of probability measures on $\mathcal{X}$ and $\mathcal{Y}$ , and by $\mathcal{C}_{b}(\mathcal{X})$ and $\mathcal{C}_{b}(\mathcal{Y})$ the spaces of bounded and continuous functions on $\mathcal{X}$ and $\mathcal{Y}$ , respectively. When $\mathcal{X}=\{x_{1},\ldots,x_{I}\}$ and $\mathcal{Y}=\{y_{1},\ldots,y_{J}\}$ are finite sets, we identify the spaces $\mathcal{C}_{b}(\mathcal{X})$ and $\mathcal{C}_{b}(\mathcal{Y})$ by the Euclidean spaces ${\mathbb{R}}^{I}$ and ${\mathbb{R}}^{J}$ . For $\mu\in\mathcal{M}_{+}^{1}(\mathcal{X})$ and $\nu\in\mathcal{M}_{+}^{1}(\mathcal{Y})$ , let $\Pi(\mu,\nu)$ be the set of probability measures on $\mathcal{X}\times\mathcal{Y}$ with marginals $\mu$ and $\nu$ . As formulated in Section 2 of [26], the problem of entropic optimal transportation between $\mu\in\mathcal{M}_{+}^{1}(\mathcal{X})$ and $\nu\in\mathcal{M}_{+}^{1}(\mathcal{Y})$ is as follows.

Definition 2.1.

For any $(\mu,\nu)\in\mathcal{M}_{+}^{1}(\mathcal{X})\times\mathcal{M}_{+}^{1}(\mathcal{Y})$ , the Kantorovich formulation of the possibly regularized optimal transport between $\mu$ and $\nu$ is the following convex minimization problem

[TABLE]

where $c:\mathcal{X}\times\mathcal{Y}\to{\mathbb{R}}$ is a lower semi-continuous function referred to as the cost function of moving mass from location $x$ to $y$ , $\varepsilon\geq 0$ is a regularization parameter, and KL stands for the Kullback-Leibler divergence, between $\pi$ and a positive measure $\xi$ on $\mathcal{X}\times\mathcal{Y}$ , up to the additive term $\int_{\mathcal{X}\times\mathcal{Y}}d\xi(x,y)$ , namely

[TABLE]

For $\varepsilon=0$ , the quantity $W_{0}(\mu,\nu)$ is the standard optimal transportation cost, while for $\varepsilon>0$ , we shall refer to $W_{\varepsilon}(\mu,\nu)$ as the Sinkhorn divergence between the two probability measures $\mu$ and $\nu$ . The choice of the cost function $c$ depends on the application, and it usually reflects some prior knowledge on the data or the problem at hand. Throughout the paper, following [49, Part I-4], we consider cost functions satisfying the following assumption which holds for many standard choices. The cost $c$ is a lower semi-continuous function satisfying, for all $(x,y)\in\mathcal{X}\times\mathcal{Y}$ ,

[TABLE]

where $c_{\mathcal{X}}$ and $c_{\mathcal{Y}}$ are real-valued functions such that $\int_{\mathcal{X}}c_{\mathcal{X}}(x)d\mu(x)<+\infty$ and $\int_{\mathcal{Y}}c_{\mathcal{Y}}(y)d\nu(y)<+\infty$ . Under condition (2.3), $W_{\varepsilon}(\mu,\nu)$ is finite whatever is the value of the regularization parameter $\varepsilon\geq 0$ . Moreover, we always have $W_{0}(\mu,\nu)\geq 0$ , while $W_{\varepsilon}(\mu,\nu)$ can be negative for $\varepsilon>0$ which comes from the expression of the KL divergence up to a constant additive term in Definition 2.1. We shall now define the dual and semi-dual formulations of the minimization problem (2.1) as introduced in [26]. For $\varepsilon=0$ , these formulations are classical results in optimal transport known as Kantorovich’s duality. If the regularization parameter $\varepsilon>0$ , it follows from [26, Proposition 2.1] that the dual expression of the minimization problem (2.1) is given by

[TABLE]

where

[TABLE]

The proof in [26] to derive the strong duality statement (2.4) relies on a formal application of Fenchel-Rockafellar’s theorem. Nevertheless, it appears that another proof can be found in the previous work [11] concerned by obtaining variational characterizations for the existence of a probability measure with given marginals, which is a problem closely related to regularized optimal transport, see also [33] for the connection between the Monge-Kantorovich and the Schrödinger problem. Indeed, it is known [17, Proposition 4.2] that the primal problem (2.1) can be refactored as the I-projection [13] problem

[TABLE]

with

[TABLE]

Hence, under condition (2.3), the duality result (2.4) can be derived from [11, Theorem 3.9]. The dual formulation (2.4) also follows as a particular instance of [12, Theorem 3.2] which considers unbalanced transport for marginals with different mass, but it is expressed as a maximization problem over the set of bounded functions $L^{\infty}(\mathcal{X})\times L^{\infty}(\mathcal{Y})$ rather than $\mathcal{C}_{b}(\mathcal{X})\times\mathcal{C}_{b}(\mathcal{Y})$ .

Then, arguing as in [26], the semi-dual of the convex minimization problem (2.1) is as follows. For any $\varepsilon\geq 0$ , the optimal transportation between $\mu$ and $\nu$ is obtained by solving the concave maximization problem

[TABLE]

where

[TABLE]

and for a given cost function satisfying (2.3), we define the $c$ -transform of $v\in\mathcal{C}_{b}(\mathcal{Y})$ as

[TABLE]

Thanks to condition (2.3), the sup in (2.6) is a max meaning that it exists a dual variable $v^{\ast}$ such that $W_{0}(\mu,\nu)=H_{0}(v^{\ast})$ , see e.g. [49, Theorem 5.9]. In the regularized case $\varepsilon>0$ , the existence of maximizers of (2.6) appears to be a more delicate issue. From (2.3), the cost $c$ belongs to the set $L^{1}(\mu\otimes\nu)$ of integrable functions with respect to $\mu\otimes\nu$ . Hence, combining [13, Corollary 3.2] with the characterization (2.5) of regularized optimal transport, there exist a pair of functions $(u^{\ast},v^{\ast})$ belonging to $L^{1}(\mu)\times L^{1}(\nu)$ and maximizing (2.4). However, an integrable function being not necessarily a bounded and continuous function, this result cannot be used to prove that $(u^{\ast},v^{\ast})$ is a pair of dual variables for the dual problem (2.4) when formulated as an optimisation over $\mathcal{C}_{b}(\mathcal{X})\times\mathcal{C}_{b}(\mathcal{Y})$ . The main difficulty seems to arise for unbounded costs. As a matter of fact, when the regularization parameter $\varepsilon>0$ , the existence of a function $v^{\ast}\in L^{\infty}(\nu)$ maximizing (2.6) is proved in [25, Theorem 7] under the additional assumption that the cost function $c$ is bounded.

In the rest of the paper, we shall now suppose that the cost $c$ is not necessarily bounded, but that, for any $\varepsilon\geq 0$ , there exists $v^{\ast}$ such that $W_{\varepsilon}(\mu,\nu)=H_{\varepsilon}(v^{\ast})$ . This identity is the keystone result which allows us to formulate the problem of estimating $W_{\varepsilon}(\mu,\nu)$ in the setting of stochastic optimization. Indeed, the semi-dual problem (2.6) can be rewritten in expectation form as

[TABLE]

where $X$ is a random vector drawn from the unknown distribution $\mu$ , and for $x\in\mathcal{X}$ and $v\in\mathcal{C}_{b}(\mathcal{Y})$ , $h_{\varepsilon}(x,v)=\int_{\mathcal{Y}}v(y)d\nu(y)+v_{{c,\varepsilon}}(x)-\varepsilon.$ In all the sequel, we will further assume that $\nu$ is a discrete probability measure with finite support $\mathcal{Y}=\{y_{1},\ldots,y_{J}\}$ in the sense that

[TABLE]

where the locations $\{y_{1},\ldots,y_{J}\}$ are a known sequence and $\delta$ stands for the standard Dirac measure. The weights $\{\nu_{1},\ldots,\nu_{J}\}$ are a known positive sequence that sum up to one. By a slight abuse of notation, we identify $\nu$ to the vector of ${\mathbb{R}}^{J}$ with positive entries $(\nu_{1},\ldots,\nu_{J})$ . We also denote by ${\mathbf{0}}_{J}$ and ${\mathbf{1}}_{J}$ the column vectors of ${\mathbb{R}}^{J}$ with all coordinates equal to zero and one respectively, and by $\langle\ ,\ \rangle$ the standard inner product in ${\mathbb{R}}^{J}$ . Following the terminology from [26], the discrete setting corresponds to the supplementary assumption that $\mu$ is also a discrete probability measure with finite support while the semi-discrete setting is the general case where $\mu\in\mathcal{M}_{+}^{1}(\mathcal{X})$ is an arbitrary probability measure that is absolutely continuous with respect to the Lesbesgue measure and $\nu$ is a discrete measure with finite support (see e.g. Chapter 5 in [17] for an introduction to semi-discrete optimal transport problems and related references). When the size of the support of the measure $\nu$ is relatively small compared to the size of the support of $\mu$ , we would recommend to use the stochastic approach proposed in this paper. This suggestion is supported by the numerical results reported in Section 5. Now, if $\nu$ is the discrete measure (2.8), the semi-dual problem (2.6) can be reformulated as

[TABLE]

where

[TABLE]

3. Asymptotic properties of stochastic algorithms for regularized optimal transport

Throughout this section, we assume that $\varepsilon>0$ .

3.1. The stochastic Robbins-Monro algorithms

Our goal is to estimate the Sinkhorn divergence $W_{\varepsilon}(\mu,\nu)$ using a stochastic Robbins-Monro algorithm [42]. For any $v\in{\mathbb{R}}^{J}$ , the function $H_{\varepsilon}(v)$ , given by (2.10), is the mean value of $h_{\varepsilon}(X,v)$ where $X$ is a random vector drawn from the unknown distribution $\mu$ . For $\varepsilon>0$ , the function $h_{\varepsilon}$ , defined by (2.11), is twice differentiable in the second variable. The gradient vector as well as the Hessian matrix of $h_{\varepsilon}$ can be easily calculated. More precisely, we have for any $x\in\mathcal{X}$ ,

[TABLE]

where the $j^{th}$ component of the vector $\pi(x,v)\in{\mathbb{R}}^{J}$ is such that

[TABLE]

Consequently, it follows from (2.10), (3.1) and (3.2) that the gradient vector and the Hessian matrix of $H_{\varepsilon}$ are given by

[TABLE]

One can observe that for any $v\in{\mathbb{R}}^{J}$ , $\nabla^{2}_{v}H_{\varepsilon}(v)$ is a negative semi-definite matrix. Therefore, if $v^{\ast}$ is a maximizer of (2.9), we have $\nabla H_{\varepsilon}(v^{\ast})=0$ and for all $v\in{\mathbb{R}}^{J}$ , $\langle v-v^{\ast},\nabla H_{\varepsilon}(v)\rangle\leq 0$ . It leads us to estimate the vector $v^{\ast}$ by the Robbins-Monro algorithm [42] given, for all $n\geq 0$ , by

[TABLE]

where the initial value $\widehat{V}_{0}$ is a square integrable random vector which can be arbitrarily chosen and $(\gamma_{n})$ is a positive sequence of real numbers decreasing towards zero satisfying

[TABLE]

Two main issues arise with this Robbins-Monro algorithm. First of all, we clearly have from (3.5) that for any $v\in{\mathbb{R}}^{J}$ , $\nabla^{2}_{v}H_{\varepsilon}(v){\mathbf{1}}_{J}={\mathbf{0}}_{J}$ which implies that zero is an eigenvalue of the Hessian matrix $\nabla^{2}_{v}H_{\varepsilon}(v)$ associated with the eigenvector $\boldsymbol{v}_{J}=\frac{1}{\sqrt{J}}\mathbf{1}_{J}.$ Next, it follows from [15] that the maximizer $v^{\ast}$ of (2.9) is unique up to a scalar translation of the form $v^{\ast}-t\boldsymbol{v}_{J}$ for any $t\in{\mathbb{R}}$ . Throughout the paper, we shall denote by $v^{\ast}$ the maximizer of (2.9) satisfying $\langle v^{\ast},\boldsymbol{v}_{J}\rangle=0$ which means that $v^{\ast}$ belongs to $\langle\boldsymbol{v}_{J}\rangle^{\perp}$ where $\langle\boldsymbol{v}_{J}\rangle$ is the one-dimensional subspace of ${\mathbb{R}}^{J}$ spanned by $\boldsymbol{v}_{J}$ . Therefore, to obtain a consistent estimator of $v^{\ast}$ it is necessary to slightly modify the Robbins-Monro algorithm (3.6).

Algorithm 1

A first strategy is as follows. It is easy to see from the expression (3.1) that the gradient $\nabla_{v}h_{\varepsilon}(x,v)$ is a vector of ${\mathbb{R}}^{J}$ belonging to the linear space $\langle\boldsymbol{v}_{J}\rangle^{\perp}$ for any vectors $x\in\mathcal{X}$ and $v\in{\mathbb{R}}^{J}$ . Hence, if the initial value $\widehat{V}_{0}$ belongs to $\langle\boldsymbol{v}_{J}\rangle^{\perp}$ , one has immediately that $(\widehat{V}_{n})$ is a stochastic sequence with values in the subspace $\langle\boldsymbol{v}_{J}\rangle^{\perp}$ . The analysis of its convergence to $v^{\ast}$ can thus be done by considering the restriction of the function $v\mapsto h_{\varepsilon}(x,v)$ to the linear subspace $\langle\boldsymbol{v}_{J}\rangle^{\perp}$ .

Algorithm 2

A second strategy is to estimate $v^{\ast}$ by the modified stochastic Robbins-Monro algorithm given, for all $n\geq 0$ , by

[TABLE]

where $\widehat{V}_{0}$ is a square integrable random vector which can be arbitrarily chosen, the sequence $(\gamma_{n})$ satisfies (3.7), and where $\alpha$ is a typically small positive parameter. The role played by $\alpha$ is to overcome the fact that the Hessian matrix $\nabla^{2}_{v}H_{\varepsilon}(v)$ is singular. One can observe that if $\widehat{V}_{0}\in\langle\boldsymbol{v}_{J}\rangle^{\perp}$ then $\langle\widehat{V}_{n},\boldsymbol{v}_{J}\rangle=0$ for all $n\geq 0$ , and thus Algorithm 2 is equivalent to Algorithm 1. By a slight abuse of notation, we shall also refer to Algorithm 1 as the case where $\alpha=0$ and we refer to Algorithm 2 as the case where $\alpha>0$ , although it is clear that $\widehat{V}_{n}$ depends on $\alpha$ for Algorithm 2 when $\widehat{V}_{0}\notin\langle\boldsymbol{v}_{J}\rangle^{\perp}$ . One may also remark that Algorithm 2 corresponds to a stochastic ascent algorithm to compute a maximizer over ${\mathbb{R}}^{J}$ of the strictly concave function

[TABLE]

where $h_{\varepsilon,\alpha}(x,v)=h_{\varepsilon}(x,v)-\frac{\alpha}{2}\bigl{(}\langle v,\boldsymbol{v}_{J}\rangle\bigr{)}^{2}.$ An important role in the choice of $\alpha$ will be the control of the smallest eigenvalue of the Hessian matrix $\nabla^{2}_{v}H_{\varepsilon,\alpha}(v^{\ast})$ . In the case $\alpha=0$ , the objective function $H_{\varepsilon}(v)$ has a bounded gradient. As a matter of fact, since $\|\nu\|\leq 1$ and $\|\pi(x,v)\|\leq 1$ , it follows that for all $x\in\mathcal{X}$ and $v\in{\mathbb{R}}^{J}$ ,

[TABLE]

which ensures that $\|\nabla H_{\varepsilon}(v)\|\leq 2$ . In the case $\alpha>0$ , we also have

[TABLE]

which implies that $\|\nabla H_{\varepsilon,\alpha}(v)\|\leq 2+\alpha\|v\|$ . The gradient of the objective function $H_{\varepsilon,\alpha}(v)$ is clearly not bounded.

3.2. Almost sure convergence and asymptotic normality

Our first result concerns the almost sure convergence of the Robbins-Monro algorithms.

Theorem 3.1.

For both algorithms, we have the almost sure convergence

[TABLE]

We now focus our attention on the asymptotic normality of the Robbins-Monro algorithms. For any $v\in{\mathbb{R}}^{J}$ , let $\Gamma_{\varepsilon}(v)$ be the positive semidefinite matrix given by

[TABLE]

One can observe that for $v=v^{\ast}$ , $\Gamma_{\varepsilon}(v^{\ast})={\mathbb{E}}\bigl{[}\pi(X,v^{\ast})\pi(X,v^{\ast})^{T}\bigr{]}-\nu\nu^{T},$ since $\nabla H_{\varepsilon}(v^{\ast})=0$ implies that $\nu={\mathbb{E}}[\pi(X,v^{\ast})]$ . For any $v\in{\mathbb{R}}^{J}$ , denote

[TABLE]

We shall see in Lemma A.1 that for any $v\in{\mathbb{R}}^{J}$ , the matrix $A_{\varepsilon}(v)$ is negative semi-definite with $\operatorname{rank}(A_{\varepsilon}(v))=J-1$ . It means that the second smallest eigenvalue of the matrix $-A_{\varepsilon}(v)$ is always positive. By a slight abuse of notation, we shall denote by $\rho_{A_{\varepsilon}}(v)$ the second smallest eigenvalue of the matrix $-A_{\varepsilon}(v)$ . Moreover, let

[TABLE]

It follows from Remark A.1 that for any $v\in{\mathbb{R}}^{J}$ , the matrix $A_{\varepsilon,\alpha}(v)$ is a negative definite. We shall also denote $\rho_{A_{\varepsilon,\alpha}}(v)=-\lambda_{\max}A_{\varepsilon,\alpha}(v)$ where $\lambda_{\max}A_{\varepsilon,\alpha}(v)$ stands for the maximum eigenvalue of the matrix $A_{\varepsilon,\alpha}(v)$ . It is not hard to see that $\rho_{A_{\varepsilon,\alpha}}(v)=\min(\rho_{A_{\varepsilon}}(v),\alpha)$ . Hereafter, in order to unify the notation, we put

[TABLE]

Theorem 3.2.

Assume that the step $\gamma_{n}=\gamma/n$ where

[TABLE]

Then, for both algorithms, we have the asymptotic normality

[TABLE]

where the asymptotic covariance matrix $\Sigma^{\ast}$ is the unique solutiuon of Lyapounov’s equation with $\zeta=1/2\gamma$

[TABLE]

Remark 3.1.

Theorem 3.2 is also true if $\gamma_{n}=\gamma/n^{c}$ with $\gamma>0$ and $1/2<c<1$ , see Pelletier [37], Theorem 1. To be more precise, we have the asymptotic normality

[TABLE]

The convergence rate $n^{c}$ is clearly always slower than $n$ , which means that the choice $\gamma_{n}=\gamma/n$ outperforms the choice $\gamma_{n}=\gamma/n^{c}$ in term of convergence rate. However, in the special case $\gamma_{n}=\gamma/n^{c}$ , the restriction (3.16) involving the knownledge of $\rho^{\ast}$ is no longer needed.

Some refinements on the asymptotic behavior of the Robbins-Monro algorithms are as follows.

Theorem 3.3.

Assume that the step $\gamma_{n}=\gamma/n$ where $\gamma>0$ satisfies (3.16). Then, for both algorithms, we have the quadratic strong law

[TABLE]

where $\Sigma^{\ast}$ is given by (3.18). Moreover, for any eigenvectors $v\in{\mathbb{R}}^{J}$ of the Hessian matrix $A_{\varepsilon,\alpha}(v^{\ast})$ , we have the law of iterated logarithm

[TABLE]

In particular,

[TABLE]

where $P$ is the matrix whose columns are the eigenvectors of $A_{\varepsilon,\alpha}(v^{\ast})$ .

Remark 3.2.

Theorem 3.3 also holds in the special case where $\gamma_{n}=\gamma/n^{c}$ with $\gamma>0$ and $1/2<c<1$ , see Pelletier [36] Theorems 1 and 3. For example, the quadratic strong law (3.19) has to be replaced by

[TABLE]

In the special case $\gamma_{n}=\gamma/n^{c}$ , Theorem 3.3 is true without condition (3.16).

The explicit calculation of the asymptotic covariance matrix $\Sigma^{\ast}$ in (3.17) is far from being simple since there is no closed-form solution of equation (3.18). To overcome this issue, one may use the averaged Robbins-Monro algorithm given by $\overline{V}_{n}=\frac{1}{n}\sum_{k=1}^{n}\widehat{V}_{k}.$ which satisfies the second-order reccurence equation

[TABLE]

where the random vector $Y_{n+1}$ is given by

[TABLE]

Our next result is devoted to the asymptotic behavior of the averaged Robbins-Monro algorithms.

Theorem 3.4.

For both algorithms, we have the almost sure convergence

[TABLE]

Moreover, assume that the step $\gamma_{n}=\gamma/n^{c}$ where $\gamma>0$ and $1/2<c<1$ . Then, we have the asymptotic normality

[TABLE]

In particular, if the sequence $(\overline{V}_{n})$ is associated with Algorithm 1, convergence (3.25) can be rewritten as

[TABLE]

*where $A_{\varepsilon}^{{\dagger}}(v^{\ast})$ stands for the Moore-Penrose inverse of $A_{\varepsilon}(v^{\ast})$ . *

Remark 3.3.

We already saw that the Hessian matrix $A_{\varepsilon}(v^{\ast})$ is negative semi-definite with $\operatorname{rank}(A_{\varepsilon}(v^{\ast}))=J-1$ , which implies that its Moore-Penrose inverse is given by $A_{\varepsilon}^{{\dagger}}(v^{\ast})=\sum_{j=1}^{J-1}\frac{1}{\lambda_{j}}v_{j}v_{j}^{T}$ where $\lambda_{1},\ldots,\lambda_{J-1}$ are the negative eigenvalues of $A_{\varepsilon}(v_{\ast})$ and $v_{1},\ldots,v_{J-1}$ are the associated orthonormal eigenvectors. Moreover,

[TABLE]

3.3. Estimation of the Sinkhorn divergence

Herafter, we focus our attention on the estimation of the Sinkhorn divergence $W_{\epsilon}(\mu,\nu)$ . For that purpose, a natural recursive estimator of $W_{\epsilon}(\mu,\nu)$ is given by

[TABLE]

Our first main result concerns the asymptotic behavior of the Sinkhorn divergence estimator $\widehat{W}_{n}$ .

Theorem 3.5.

Assume that the cost function $c$ satisfies for any $y\in\mathcal{Y}$ ,

[TABLE]

Then, for both algorithms, we have the almost sure convergence

[TABLE]

Moreover, assume that the step $\gamma_{n}=\gamma/n$ where $\gamma>0$ satisfies (3.16), or that $\gamma_{n}=\gamma/n^{c}$ where $\gamma>0$ and $1/2<c<1$ . Then, for both algorithms, we have the asymptotic normality

[TABLE]

*where the asymptotic variance $\sigma^{2}_{\varepsilon}(\mu,\nu)={\mathbb{E}}\bigl{[}h_{\varepsilon}^{2}(X,v^{\ast})\bigr{]}-W_{\epsilon}^{2}(\mu,\nu)$ . *

Remark 3.4.

The asymptotic variance $\sigma^{2}_{\varepsilon}(\mu,\nu)$ can be estimated by

[TABLE]

Via the same lines as in the proof of (3.30), we can show that $\widehat{\sigma}^{\,2}_{n}\rightarrow\sigma^{2}_{\varepsilon}(\mu,\nu)$ a.s. Therefore, using Slutsky’s Theorem, it follows from (3.31) that

[TABLE]

Convergence (3.33) allows us to construct confidence intervals for the Sinkhorn divergence $W_{\epsilon}(\mu,\nu)$ as illustrated in the numerical experiments of Section 5.

Our second main result is devoted to the expected excess risk of the Sinkhorn divergence estimator $\widehat{W}_{n}$ . It follows from (3.28) that

[TABLE]

Hence, the expected excess risk of $\widehat{W}_{n}$ is defined as the non-negative quantity

[TABLE]

It is well known that only assuming concavity of the objective function leads to convergence rates for the expected excess risk $\widehat{R}_{n}$ of the order $1/\sqrt{n}$ for the Robbins-Monro algorithm. This rate of convergence cannot be improved without supplementary assumptions such as the strong concavity of the objective function $H_{\varepsilon}$ , which leads to faster rates of convergence of the order $1/n^{d}$ for some $1/2<d\leq 1$ which depends on the decay of the step $\gamma_{n}=\gamma/n^{c}$ where $1/2<c<1$ . We refer the reader to [6], [7], [28] for a recent overview on the convergence rates of first-order stochastic algorithms.

However, as it was already shown in [26], the objective function $H_{\varepsilon}$ in the semi-dual problem (2.6) cannot be strongly concave, even by restricting the maximization to the subset $\langle\boldsymbol{v}_{J}\rangle^{\perp}$ . Indeed, the gradient $v\mapsto\nabla H_{\varepsilon}(v)$ being bounded on ${\mathbb{R}}^{J}$ , it follows that $H_{\varepsilon}$ is a Lipschitz function, and thus it cannot be strongly concave on $\langle\boldsymbol{v}_{J}\rangle^{\perp}$ , see e.g. Section 3.2 in [7]. Nevertheless, for the stochastic optimization problem (2.6), it is possible to derive rates of convergence faster than $1/\sqrt{n}$ for the expected excess risk $\widehat{R}_{n}$ . To this end, we borrow some ideas related to the so-called notion of generalized self-concordance coming from the recent contribution of Bach [6] and leading to faster rates of convergence for stochastic algorithms with non-strongly concave objective functions.

Theorem 3.6.

Assume that $\widehat{V}_{0}$ is a random vector such that, for any integer $p\geq 1$ , ${\mathbb{E}}[\|\widehat{V}_{0}\|^{p}]$ is finite. Moreover, assume that the step $\gamma_{n}=\gamma/n^{c}$ where $\gamma>0$ and $2/3<c<1$ . In addition, suppose that $0<\varepsilon\leq 1$ and

[TABLE]

Then, there exists a positive constant $C$ such that for any $n\geq 1$ ,

[TABLE]

Remark 3.5.

It is easy to see that the assumption $\varepsilon\leq 1$ implies that $\theta_{\varepsilon}<1/2$ . Consequently, the condition (3.35) is not really restrictive and it is fulfilled by a suitable choice of $\gamma$ depending on $\rho^{\ast}$ . By inequality (A.4) and Remark A.2, one has the following bounds $\min_{1\leq j\leq J}\nu_{j}\leq\varepsilon\rho^{\ast}\leq 1,$ which are used to choose the parameter $\gamma$ in the numerical experiments carried out in Section 5. Finally, it follows from (3.34) together with inequality (3.36) that

[TABLE]

Consequently, if $c>3/4$ , inequality (3.37) shows that the expected excess risk of $\widehat{R}_{n}$ may converge to zero faster than $1/\sqrt{n}$ when the sequence $(\widehat{V}_{n})$ is given by Algorithm 1.

4. Further results on the unregularized case

We now focus our attention on the unregularized case where $\varepsilon=0$ . The function $h_{0}$ , defined by (2.11), is not differentiable. Nevertheless, as remarked in [26], it follows from (2.11) that for any $x\in\mathcal{X}$ , a supergradient $\partial_{v}h_{0}(x,v)$ of $h_{0}(x,\cdot)$ at $v$ is

[TABLE]

where $\pi^{0}(x,v)$ denotes the vector of ${\mathbb{R}}^{J}$ with all entries equal to zero expect the $j^{*}$ -th which is equal to one, that is $\pi^{0}_{j}(x,v)=\hbox to0.0pt{1\hss}\kern 1.60004pt\hbox to0.0pt{1\hss}\kern 3.99994pt_{j=j^{*}}$ , where

[TABLE]

One can note that for $x\in\mathcal{X}$ and $v\in{\mathbb{R}}^{J}$ , the integer $j^{*}$ may be not unique. In this case, the set of supergradients $\partial_{v}h_{0}(x,v)$ is not a singleton. In contrast with the regularized case where $\varepsilon>0$ , the function $H_{0}$ does not necessarily have a unique maximizer $v^{\ast}$ over $\langle\boldsymbol{v}_{J}\rangle^{\perp}$ . To estimate such a maximizer $v^{\ast}$ , which is assumed to belong to $\langle\boldsymbol{v}_{J}\rangle^{\perp}$ , we shall consider a standard stochastic supergradient ascent given, for all $n\geq 0$ , by

[TABLE]

where the sequence $(\gamma_{n})$ satisfies (3.7), $\alpha$ is a typically small positive parameter and $\partial_{v}h_{0}(X_{n+1},\widehat{V}_{n}^{\,0})$ is any supergradient of $h_{0}(X_{n+1},\cdot)$ at $\widehat{V}_{n}^{\,0}$ of the form (4.1). Therefore, a recursive estimator of $W_{0}(\mu,\nu)$ is given by

[TABLE]

In order to investigate the asymptotic properties of the random sequences $(\widehat{V}_{n}^{\,0})$ and $(\widehat{W}_{n}^{\,0})$ , two issues need to be addressed: the regularity of the function $H_{0}$ and the uniqueness of its maximizer over $\langle\boldsymbol{v}_{J}\rangle^{\perp}$ . In the discrete setting, the function $H_{0}$ is clearly not differentiable. However, in the semi-discrete setting where $\mu$ is absolutely continuous, the differentiability of $H_{0}$ has been proved in [31] under appropriate regularity assumptions on the cost function and the measure $\mu$ . More precisely, by [31, Theorem 2.1], we obtain the following conditions ensuring the $\mathcal{C}^{1}$ -smoothness of $H_{0}$ .

Proposition 4.1.

Assume that that $\mathcal{X}={\mathbb{R}}^{d}$ . Moreover, suppose that

**(i): **

For all $1\leq j\leq J$ , the function $x\mapsto c(x,y_{j})$ is continuous.

**(ii): **

The measure $\mu$ is absolutely continuous with bounded and compactly supported probability density function.

**(iii): **

For any $j\neq k$ and for all $t\in{\mathbb{R}}$ , the set $\{x\in{\mathbb{R}}^{d},c(x,y_{j})-c(x,y_{k})=t\}$ has zero Lebesgue measure.

Then, the function $H_{0}$ is continuously differentiable on ${\mathbb{R}}^{J}$ ,

[TABLE]

where $\partial_{v}h_{0}(x,v)$ is any supergradient of $h_{0}(x,\cdot)$ at $v$ of the form (4.1).

Guaranteeing the uniqueness of the maximizer $v^{\ast}$ of $H_{0}$ over $\langle\boldsymbol{v}_{J}\rangle^{\perp}$ is more involved. In particular, in the semi-discrete setting, we are not aware of any sufficient conditions ensuring such a property. Nevertheless, if one assumes the uniqueness of $v^{\ast}$ up to scalar translations, then under the assumptions of Proposition 4.1, it follows that $\nabla H_{0}$ is continuous. Therefore, arguing as in the proofs of Theorem 3.1 and Theorem 3.5, one obtains under the assumptions of Proposition 4.1 that

[TABLE]

and

[TABLE]

Furthermore, under additional assumptions it follows from [34, Theorem 6] that $H_{0}$ is twice continuously differentiable. Nevertheless, in contrast to the regularized case, the second smallest eigenvalue of $A_{0}(v^{\ast})=-\nabla^{2}H_{0}(v^{\ast})$ may be zero. Therefore, the proof of the asymptotic normality for $(\widehat{V}_{n}^{\,0})$ and $(\widehat{W}_{n}^{\,0})$ is much more tricky and left open for future work.

5. Statistical applications and numerical experiments

In this section, we report results on numerical experiments for probability measures $\mu$ and $\nu$ with supports included in ${\mathbb{R}}^{d}$ for $d\geq 2$ using synthetic and real data sets. All numerical experiments are carried out with the statistical computing environment R [39], and they are based on iid samples from $\mu$ . We mainly investigate the numerical behavior of the two recursive estimators $\widehat{W}_{n}$ and $\widehat{\sigma}^{2}_{n}$ . The reader has to keep in mind that the estimators $\widehat{W}_{n}$ and $\widehat{\sigma}^{2}_{n}$ depend on the positive value of the regularization parameter $\varepsilon$ as well as on the positive value of $\alpha$ and the statistical characteristics of $\mu$ and $\nu$ . However, for the sake of simplicity, we have chosen to denote them as $\widehat{W}_{n}$ and $\widehat{\sigma}^{2}_{n}$ . We carry out our numerical experiments for different values of $\varepsilon$ to illustrate the convergence of the recursive algorithms proposed in this paper as $n$ increases. Following the discussion in Section 3 on the calibration of the step size for $\varepsilon>0$ , we took $\gamma_{n}=\gamma/n^{c}$ with $c=0.51$ and $\gamma=\varepsilon/(2\nu_{\min})$ where $\nu_{\min}$ stands for $\nu_{\min}=\min_{1\leq j\leq J}\nu_{j}$ . In the unregularized case $\varepsilon=0$ of Section 4, we took $\gamma=\varepsilon_{\min}/(4\nu_{\min})$ where $\varepsilon_{\min}=0.01$ is the smallest value of regularization used in these numerical experiments. The cost function is chosen as the standard Euclidean distance,

[TABLE]

In our numerical experiments, we have found that Algorithm 1 with $\alpha=0$ and $\widehat{V}_{0}=0$ , and Algorithm 2 with $\alpha=\nu_{\min}/\varepsilon$ and $\widehat{V}_{0}=\boldsymbol{v}_{J}$ share the same numerical behavior for all sufficiently large values of $n$ , that is $n\geq 10^{2}$ . Consequently, we only report here the results obtained with Algorithm 1.

5.1. Discrete setting in dimension two

We first consider a setting in dimension $d=2$ with discrete probability measures, and we investigate the regularized case where $\varepsilon>0$ . The measure $\nu$ is assumed to be the uniform measure on a grid $\mathcal{Y}\subset[0,1]^{2}$ made of $J=25$ regularly spaced points. The measure $\mu$ is obtained by projecting a mixture of Gaussian densities on an $N\times N$ regular grid of $[0,1]^{2}$ . The cardinality $I=N^{2}$ of its support $\mathcal{X}$ varies in the numerical experiments. The two measures are displayed in Figure 1 for different values of $N$ .

The computation of the Sinkhorn divergence $W_{\varepsilon}(\mu,\nu)$ is done via the package Barycenter111https://cran.r-project.org/package=Barycenter. It allows us to solve the semi-dual maximization problem (2.6) using the Sinkhorn algorithm [15] which is a fixed point iteration algorithm for obtaining a solution of the primal problem (2.1) of the form

[TABLE]

where $\mbox{$ \boldsymbol{C} $}\in{\mathbb{R}}^{I\times J}$ is the cost matrix, $(u^{\ast},v^{\ast})\in{\mathbb{R}}_{+}^{I}\times{\mathbb{R}}_{+}^{J}$ is a pair of optimal dual variables and $\exp(\cdot)$ denotes the entrywise exponential. To obtain such a pair of dual variables in an iterative way, the Sinkhorn algorithm alternately scales the row and column sums of matrices written in the form (5.1) to match the marginals $\mu\in{\mathbb{R}}_{+}^{J}$ and $\nu\in{\mathbb{R}}_{+}^{I}$ , that is $u^{n+1}=\mu./(S_{\varepsilon}v^{n})$ and $v^{n+1}=\nu./(S_{\varepsilon}^{T}u^{n+1})$ , where $./$ denotes the elementwise ratio between vectors. Hence, at each iteration, the computational cost of the Sinkhorn algorithm is of the order $I^{2}+J^{2}$ . An advantage of stochastic algorithms for optimal transport is that the computational cost of the recursive estimators $\widehat{W}_{n}$ and $\widehat{\sigma}^{\,2}_{n}$ at each iteration of (3.6) and (3.32) is only of order $J$ as discussed in details in [26]. Moreover, the computation of these estimators do not require the full knowledge of the measure $\mu$ , and the storage of the full cost matrix $\boldsymbol{C}$ . The computational cost at each iteration of the Sinkhorn algorithm can be reduced by using a greedy coordinate descent algorithm referred to as the Greenkhorn algorithm [3] which consists in only updating one row or column of a matrix written in the form (5.1) by selecting the one that most violates the constraint that its row and columns sums should match the desired marginal $\mu$ and $\nu$ . As described in [3], it is possible to implement this algorithm in such a way that the computational cost at each iteration is linear in the dimension of the input measures that is of order $I+J$ . A stochastic version of the Greenkhorn algorithm has also been proposed in [1], where, instead of selecting the column or row which most violates the constraint, one row or column is randomly selected according to probability chosen in such a way that the columns and rows with highest violation are updated more frequently. Note that the stochastic Greenkhorn algorithm makes use of the full knowledge of $\mu$ , and it is thus a stochastic algorithm of a different nature than the Robbins-Monro algorithm investigated in this paper. In particular, our approach does not use the knowledge of $\mu$ , and the recursive estimators estimators $\widehat{W}_{n}$ and $\widehat{\sigma}^{\,2}_{n}$ have not been considered so far in the literature. In the discrete setting, it is proposed in [26] to use a stochastic averaged gradient algorithm (which uses the knowledge of $\mu$ ) to estimate $v^{\ast}$ , and we refer to [26, Section 3] for detailed experiments on the comparison of this approach to the Sinkhorn algorithm.

In Figure 2, we report numerical results on the comparison between the recursive estimator $\widehat{W}_{n}$ and the numerical approximation of $W_{\varepsilon}(\mu,\nu)$ using either the Greenkhorn algorithm or its stochastic version as a function of the iterations whose computational costs are linear in the dimension of the input measures. The output of the Sinkhorn algorithm is used as the ground truth for $W_{\varepsilon}(\mu,\nu)$ . Using the results from Section 3, one can construct confidence intervals for the Sinkhorn divergence between $\mu$ and $\nu$ by considering

[TABLE]

to be approximately normally distributed. One can see in Figure 2 that the $95\%$ confidence intervals always contain the value $W_{\varepsilon}(\mu,\nu)$ for $n\geq 3.10^{5}$ and all values of $\varepsilon$ and cardinality $I=N^{2}$ of the support of $\mu$ . The Greenkhorn algorithm and its stochastic version perform similarly. For small values of $N\leq 20$ and large values of $\varepsilon$ , we observe that these algorithms converge in very few iterations. However, for larger values of $N$ , that is larger sizes $I$ of the support of $\mu$ , and for small values of $\varepsilon$ , the recursive estimator $\widehat{W}_{n}$ clearly outperforms such Greenkhorn algorithms in the number of required iterations to obtain a satisfactory approximation of $W_{\varepsilon}(\mu,\nu)$ .

5.2. Semi-discrete setting in dimension $d\geq 2$

Simulated data. We now consider a synthetic example where $\mu$ is an absolutely continuous measure obtained as a mixture of three Gaussian densities with support truncated to $[0,1]^{d}$ for $d\geq 2$ . We shall let the dimension $d$ growing as well as the size $J$ of the support of $\nu$ when $d$ increases. For each $d\geq 2$ , $\nu$ is chosen as the uniform discrete probability measure supported on $J=5^{d}$ points drawn uniformly on the hypercube $[0,1]^{d}$ . We report results for $d\in\{3,4,5\}$ . There exist various algorithms for semi-discrete optimal transport in the unregularized case to evaluate $W_{0}(\mu,\nu)$ . We refer to [34, Section 1.2] for an overview and a discussion of their computational cost. These approaches are based on the knowledge of the measure $\mu$ that is generally projected over a partition of $[0,1]^{d}$ . However, available implementations222https://github.com/mrgt/PyMongeAmpere and https://cran.r-project.org/package=transport based on the works in [30, 31] are typically restricted to the dimension $d=2$ . For larger values of $d$ , projecting $\mu$ on a sufficiently finite partition becomes computationally prohibitive, and storing the resulting cost matrix $\boldsymbol{C}$ becomes too memory demanding which makes a direct use of Sinkhorn or Greenkhorn algorithms not feasible. Moreover, to the best of our knowledge, apart from stochastic approaches as in [26], there is no other method to evaluate $W_{\varepsilon}(\mu,\nu)$ in the semi-discrete setting.

In the following numerical experiments, we briefly study how the recursive estimators $\widehat{W}_{n}$ and $\widehat{\sigma}_{n}$ scales with increasing dimension $d$ and support size $J=5^{d}$ , for $d\in\{3,4,5\}$ , in both unregularized and regularized cases. First, we observe that for various values of the regularization parameter $\varepsilon$ and the dimension $d$ , the confidence intervals obtained via the Gaussian approximation (5.2) give an accurate estimation of the range of variation of $\widehat{W}_{n}$ calculated by Monte Carlo simulations as shown in Figure 3. Note that, in these numerical experiments, we conjecture that the Gaussian approximation (5.2) also holds for $\varepsilon=0$ .

Finally, in Figure 4, we display the evolution of the size $2\times 1.96\widehat{\sigma}_{n}/\sqrt{n}$ of the confidence intervals for $W_{\varepsilon}(\mu,\nu)$ (after $n=4.10^{4}$ iterations) based on the Gaussian approximation (5.2) as the dimension $d$ increases and $J=5^{\lceil\sqrt{d}\rceil}$ for $2\leq d\leq 20$ . The size of these confidence intervals is clearly an increasing function of $d$ . This suggests that the number $n$ of iterations should increase with $d$ to keep the same level of accuracy when estimating $W_{\varepsilon}(\mu,\nu)$ .

Real data. We consider a dataset containing spatial locations $X_{1},\ldots,X_{N}$ of reported incidents of crime with the exception of murders in Chicago in 2014, publicly available at https://data.cityofchicago.org. These $N$ data points are ordered in a chronological manner from January to December. Victims’ addresses are shown at the block level only. Specific locations are not identified in order to protect the privacy of victims and to have a sufficient amount of data for the statistical analysis. For simplicity, spatial locations of the city of Chicago are represented on the unit square $[0,1]^{2}$ . For the year 2014, $N=16104$ spatial locations of reported incidents of crime in Chicago are available. They are displayed in Figure 5(a). Chicago has $J=23$ Police stations whose locations are shown in Figure 5(b) with a kernel density estimation of the unknown distribution $\mu$ of crime locations.

We assume that Police stations have the same capacity, and they are thus modeled by the uniform discrete measure $\nu$ on these locations. We first report the evolution of the recursive confidence intervals $\widehat{W}_{n}\pm 1.96\widehat{\sigma}_{n}/\sqrt{n}$ for various values of $\varepsilon$ in the unregularized and regularized cases. To evaluate the convergence of our stochastic algorithm, we have also computed the values of $W_{\varepsilon}(\widehat{\mu}_{N},\nu)$ where $\widehat{\mu}_{N}$ is the standard empirical measure approximating $\mu$ .

For the regularized case $\varepsilon>0$ , we used the Sinkhorn algorithm. For $\varepsilon=0$ , we followed the method proposed in [30] that is specific to the Euclidean cost $c(x,y)=\|x-y\|_{2}$ and implemented in the package Transport. One can observe in Figure 5(c) a very good convergence of the algorithm for different values of $\varepsilon$ .

Finally, we consider the problem of estimating an optimal partition of the city of Chicago into 23 districts matching expected locations of crimes with the capacity of Police stations so that the expected cost of travelling from a station to a crime’s location is minimal. This can be done by estimating, in the unregularized case, an optimal map $T$ which pushes forward $\mu$ onto $\nu$ . Since $\mu$ is absolutely continuous, it is well-known [10, 14] that there exists a unique optimal mapping $T:\mathop{\rm supp}\nolimits(\mu)\to\{y_{1},\ldots,y_{J}\}$ which pushes forward $\mu$ onto $\nu$ . This mapping is clearly piecewise constant. It follows from Corollary 1.2 in [31] that for all $1\leq j\leq J$ ,

[TABLE]

where $v^{\ast}\in{\mathbb{R}}^{J}$ is any maximiser of the semi-dual problem (2.6). The sets $\{T^{-1}(y_{j})\}$ are the so-called Laguerre cells that correspond to an important concept from computational geometry (see e.g. [30, 31] and Chapter 5 in [17]). Then, based on a sample $X_{1},\ldots,X_{N}$ from $\mu$ , it is natural to estimate the Laguerre cells by

[TABLE]

where $\widehat{V}_{N,j}^{0}$ stands for $j$ -entry of the vector $\widehat{V}_{N}^{0}$ obtained from (4.2). An example of estimated Laguerre cells is given in Figure 5(d). We observe that cells of small size are located near the modes of the estimated distribution of crime locations.

Appendix A

Three keystone lemmas.

The proofs of the main results of this paper rely on three keystone lemmas. The first one is devoted to the spectrum of the Hessian matrix $A_{\varepsilon}(v)$ given by (3.14). Denote by $\lambda_{1},\ldots,\lambda_{J}$ its real eigenvalues and by $v_{1},\ldots,\boldsymbol{v}_{J}$ its associated orthonormal eigenvectors, where $\boldsymbol{v}_{J}=\frac{1}{\sqrt{J}}\mathbf{1}_{J}$ and $\lambda_{J}=0$ . Let

[TABLE]

Lemma A.1.

For any $v\in{\mathbb{R}}^{J}$ , the Hessian matrix $A_{\varepsilon}(v)$ is negative semi-definite with $\operatorname{rank}(A_{\varepsilon}(v))=J-1$ . More precisely,

[TABLE]

where for all $x\in\mathcal{X}$ and for all $1\leq j\leq J-1$ , the positive eigenvalues $\lambda_{j}^{\varepsilon}(x,v)$ are given by

[TABLE]

with

[TABLE]

Moreover, we also have

[TABLE]

which implies that

[TABLE]

Remark A.1.

For any $v\in{\mathbb{R}}^{J}$ , denote

[TABLE]

where $\lambda_{\max}A_{\varepsilon,\alpha}(v)$ stands for the maximum eigenvalue of the Hessian matrix $A_{\varepsilon,\alpha}(v)$ given by (3.15). It is not hard to see that for all $v\in{\mathbb{R}}^{J}$ , the matrix $A_{\varepsilon,\alpha}(v)$ is negative definite. More precisely, its negative eigenvalues are $\lambda_{1},\ldots,\lambda_{J-1},-\alpha$ and

[TABLE]

As a matter of fact, $A_{\varepsilon,\alpha}(v)\boldsymbol{v}_{J}=-\alpha\boldsymbol{v}_{J}$ , while for any $u\in\langle\boldsymbol{v}_{J}\rangle^{\perp}$ , $A_{\varepsilon,\alpha}(v)u=A_{\varepsilon}(v)u$ .

Proof.

First of all, we obtain from (3.5) together with (3.14) that for any $v\in{\mathbb{R}}^{J}$ ,

[TABLE]

where, for all $x\in\mathcal{X}$ ,

[TABLE]

We deduce from Theorem 1 in [47] that for any $v\in{\mathbb{R}}^{J}$ , $\operatorname{rank}(A_{\varepsilon}(x,v))=J-1$ , and that the positive eigenvalues of $A_{\varepsilon}(x,v)$ are given by (A.2) except the smallest one $\lambda_{J}^{\varepsilon}(x,v)=0$ which is associated, for all $x\in\mathcal{X}$ , to the same eigenvector $\boldsymbol{v}_{J}$ . Consequently, it follows from (A.5) that $A_{\varepsilon}(v)$ is a negative semi-definite matrix such that $\operatorname{rank}(A_{\varepsilon}(v))=J-1$ . In addition, (A.5) clearly leads to (A.1). Furthermore, equality (A.3) follows (3.14) and (3.4) which implies the fact that $\nu={\mathbb{E}}\bigl{[}\pi(X,v^{\ast})\bigr{]}$ . Hereafter, we have the decomposition

[TABLE]

where, for all $x\in\mathcal{X}$ ,

[TABLE]

It follows from Theorem 6 and inequality (5.11) in [46] that the second smallest eigenvalue of ${\mathcal{A}}_{\varepsilon}(x,v^{\ast})$ is lower bounded by $\min_{1\leq j\leq J}\nu_{j}$ for all $x\in\mathcal{X}$ . Hence, we deduce inequality (A.4) from this lower bound and the decomposition (A.6), which completes the proof of Lemma A.1. ∎

The second lemma deals with the Taylor expansion of the concave function $H_{\varepsilon}$ . It allows us to control the excess risk of our Sinkhorn divergence estimator. Let $g$ be the strictly increasing function defined, for all positive $\eta$ , by

[TABLE]

One can observe that we always have $g(\eta)\leq-\exp(-\eta)$ .

Lemma A.2.

For any $v\in{\mathbb{R}}^{J}$ , we have

[TABLE]

where

[TABLE]

and

[TABLE]

where for all $x\in\mathcal{X}$ and for all $1\leq j\leq J-1$ , the positive eigenvalues $\lambda_{j}^{\varepsilon}(x,v)$ are given by (A.2). Moreover, for any $v\in{\mathbb{R}}^{J}$ ,

[TABLE]

Moreover, assume that $\|v-v^{\ast}\|\leq A$ for some positive constant $A$ . Then,

[TABLE]

Remark A.2.

On the one hand, we deduce from (A.2) that for all $x\in\mathcal{X}$ and for all $1\leq j\leq J-1$ , $0<\lambda_{j}^{\varepsilon}(x,v)<1$ which means that $\Lambda_{\varepsilon}\leq 1/\varepsilon$ . On the other hand, inequalities (A.7) and (A.10) are also true for the strictly concave function $H_{\varepsilon,\alpha}$ given by (3.9). As a matter of fact, it is only necessary to replace $H_{\varepsilon}$ and $\Lambda_{\varepsilon}$ by $H_{\varepsilon,\alpha}$ and $\max(\Lambda_{\varepsilon},\alpha)$ in (A.7). Moreover, since $\langle v^{\ast},\boldsymbol{v}_{J}\rangle=0$ , one can observe for (A.10) that for any $v\in{\mathbb{R}}^{J}$ ,

[TABLE]

Inequality (A.10) is typically a consequence of the so-called notion of generalized self-concordance as introduced in [5] for the study of logistic regression. Generalized self-concordance has been widely used in [6] to obtain rates of convergence of order $1/n$ for non-strongly convex functions using a constant step size.

Proof.

The first step of the proof is to establish a second-order Taylor expansion of the concave function $H_{\varepsilon}$ . For any $v\in{\mathbb{R}}^{J}$ and for all $t$ in the interval $[0,1]$ , denote $v_{t}=v^{\ast}+t(v-v^{\ast})$ . Let $\varphi$ be the function defined, for all $t\in[0,1]$ , by

[TABLE]

The second-order Taylor expansion of $\varphi$ with integral remainder is given by

[TABLE]

However, it follows from the chain rule of differentiation that for all $t\in[0,1]$ ,

[TABLE]

Hence, as $\varphi(1)=H_{\varepsilon}(v)$ , $\varphi(0)=H_{\varepsilon}(v^{\ast})$ and $\varphi^{\prime}(0)=\langle v-v^{\ast},\nabla H_{\varepsilon}(v^{\ast})\rangle=0$ , we obtain from (A.12) that for any $v\in{\mathbb{R}}^{J}$ ,

[TABLE]

We already saw in (A.5) that $-\varepsilon A_{\varepsilon}(v)={\mathbb{E}}\bigl{[}A_{\varepsilon}(X,v)\bigr{]}$ . In addition, for all $x\in\mathcal{X}$ ,

[TABLE]

which implies by (A.8) and (A.9) that $-A_{\varepsilon}(v_{t})\leq\ell_{A_{\varepsilon}}(v_{t})I_{J}\leq\Lambda_{\varepsilon}I_{J}$ . Consequently, we deduce from (A.13) that for any $v\in{\mathbb{R}}^{J}$ ,

[TABLE]

which clearly leads to

[TABLE]

In order to prove (A.10), we have to compute the third-order derivative of the function $\varphi$ which is given by

[TABLE]

where for all $v,a,b,c\in{\mathbb{R}}^{J}$ ,

[TABLE]

A direct calculation of this third-order derivative is not easy. However, it follows from (3.4) that

[TABLE]

where for all $x\in\mathcal{X}$ , $m(x,v_{t})=\langle v-v^{\ast},\pi(x,v_{t})\rangle$ . Moreover, one can notice that for all $x\in\mathcal{X}$ and for all $v\in{\mathbb{R}}^{J}$ ,

[TABLE]

Consequently, we obtain from the chain rule of differentiation together with (A.14) and (A.15) that

[TABLE]

where for all $x\in\mathcal{X}$ ,

[TABLE]

In the same way, we also obtain from (A.15) and (A.16) that

[TABLE]

where for all $x\in\mathcal{X}$ ,

[TABLE]

Hence, we deduce from the previous calculation that for all $x\in\mathcal{X}$ ,

[TABLE]

and

[TABLE]

One may remark that for all $x\in\mathcal{X}$ , the variance term $\sigma^{2}(x,v_{t})\geq 0$ . More precisely, $m(x,v_{t})$ and $\sigma^{2}(x,v_{t})$ are the mean and the variance of a discrete random variable $Z(x,v_{t})$ with values in $\{v_{1}-v_{1}^{\ast},\ldots,v_{J}-v_{J}^{\ast}\}$ and distribution $(\pi_{1}(x,v_{t}),\dots,\pi_{J}(x,v_{t}))$ . Moreover, the third cumulant $\kappa^{3}(x,v_{t})$ of the random variable $Z(x,v_{t})$ is given, for all $x\in\mathcal{X}$ , by

[TABLE]

Therefore, we obtain from (A.19) and (A.20) that for all $x\in\mathcal{X}$ ,

[TABLE]

It is not hard to see from the Cauchy-Schwarz inequality that

[TABLE]

Consequently, inequality (A.21) ensures that for all $x\in\mathcal{X}$ ,

[TABLE]

which implies, via (A.16) and (A.17), that for all $t\in[0,1]$ ,

[TABLE]

Inequality (A.22) means that the function $\varphi$ satisfies the so-called generalized self-concordance property as defined in Appendix B of [6]. We are now in position to prove (A.10). Let $\Phi$ be the function defined, for all $t\in[0,1]$ , by

[TABLE]

The second-order Taylor expansion of $\Phi$ with integral remainder is given by

[TABLE]

We already saw that for all $t\in[0,1]$ ,

[TABLE]

Hence, as $\Phi(1)=\nabla H_{\varepsilon}(v)$ , $\Phi(0)=\nabla H_{\varepsilon}(v^{\ast})=0$ and $\Phi^{\prime}(0)=\nabla^{2}H_{\varepsilon}(v^{\ast})(v-v^{\ast})$ , we obtain from (A.23) that for any $v\in{\mathbb{R}}^{J}$ ,

[TABLE]

For any $z\in{\mathbb{R}}^{J}$ , we clearly have $\langle z,\Phi^{\prime\prime}(t)\rangle=\nabla^{3}H_{\varepsilon}(v_{t})[z,v-v^{\ast},v-v^{\ast}]$ . Consequently, it follows from (A.22) that for all $t\in[0,1]$ ,

[TABLE]

By taking $z=\Phi^{\prime\prime}(t)/\|\Phi^{\prime\prime}(t)\|$ into (A.25), we find from (A.25) that for all $t\in[0,1]$ ,

[TABLE]

Integrating by parts, we deduce from (A.24) and (A.26) that for any $v\in{\mathbb{R}}^{J}$ ,

[TABLE]

Finally, as $\varphi(1)=H_{\varepsilon}(v)$ , $\varphi(0)=H_{\varepsilon}(v^{\ast})$ , we obtain from (A.7) together with (A.27) that for any $v\in{\mathbb{R}}^{J}$ ,

[TABLE]

which is exactly what we wanted to prove. It only remains to prove inequality (A.11). We shall proceed as in the proof of Lemma 13 in [6]. It follows from (A.22) that for all $t\in[0,1]$ ,

[TABLE]

By integrating (A.28) between [math] and $t$ , we obtain that for all $t\in[0,1]$ ,

[TABLE]

which leads to

[TABLE]

However, we already saw that $\varphi^{\prime\prime}(t)=(v-v^{\ast})^{T}\nabla^{2}H_{\varepsilon}(v_{t})(v-v^{\ast})$ which implies that $\varphi^{\prime\prime}(0)\leq-\rho^{\ast}\|v-v^{\ast}\|^{2}$ . By integrating (A.29) between [math] and $1$ , we find that for any $v\in{\mathbb{R}}^{J}$ ,

[TABLE]

since $\varphi^{\prime}(0)=0$ and $\varphi^{\prime}(1)=\langle v-v^{\ast},\nabla H_{\varepsilon}(v)\rangle$ . Let $g$ be the strictly increasing function defined, for all positive $\eta$ , by

[TABLE]

We immediately deduce from (A.30) that, as soon as $\|v-v^{\ast}\|\leq A$ for some positive constant $A\leq 1$ ,

[TABLE]

proving that inequality (A.11) holds for $A\leq 1$ . Hereafter, assume that $\|v-v^{\ast}\|\leq A$ where $A>1$ . If $\|v-v^{\ast}\|\leq 1$ , we clearly obtain from (A.30) that

[TABLE]

meaning that inequality (A.11) is satisfied. Moreover, assume that $1<\|v-v^{\ast}\|\leq A$ . We infer from (A.29) that

[TABLE]

which implies that

[TABLE]

The assumption $\|v-v^{\ast}\|\leq A$ clearly implies that $\|v-v^{\ast}\|\geq A^{-1}\|v-v^{\ast}\|^{2}$ . Finally, we deduce from (A.31) that

[TABLE]

It ensures that (A.11) holds for any $A>1$ , completing the proof of Lemma A.2. ∎

The third lemma concerns a sharp upper bound for a very simple recursive inequality which will be useful in the control the excess risk of our Sinkhorn divergence estimator.

Lemma A.3.

Let $(Z_{n})$ be a sequence of positive real numbers satisfying, for all $n\geq 0$ , the recursive inequality

[TABLE]

where $a,b,\alpha$ and $\beta$ are positive constants satisfying $a\leq 1$ , $\alpha\leq 1$ , $1<\beta<2$ and $\beta\leq 2\alpha$ with $\beta<a+1$ in the special case where $\alpha=1$ . Then, there exists a positive constant $C$ such that, for any $n\geq 1$ ,

[TABLE]

Proof.

One can observe that the first term on the right hand side inequality (A.33) is always non-negative thanks to the condition $a\leq 1$ . We shall proceed as in the proof of Theorem 1 in [7]. It follows from (A.33) that for all $n\geq 1$ ,

[TABLE]

First of all, we focus our attention on the case where $0<\alpha<1$ . We clearly have

[TABLE]

Hence, we obtain from the elementary inequality $1-x\leq\exp(-x)$ that

[TABLE]

where

[TABLE]

Deriving an upper bound for the second term in the right hand side of inequality (A.35) is more involved. To this end, we denote for all $1\leq k\leq n+1$ ,

[TABLE]

with the convention that $P_{n+1}^{n}=1$ . For some interger $1\leq m\leq n$ which will be fixed soon, we have the decomposition

[TABLE]

Therefore, noticing that $P_{k+1}^{n}\leq P_{m+1}^{n}$ for all $1\leq k\leq m$ , we obtain that

[TABLE]

However, one can observe that for all $1\leq k\leq n$ ,

[TABLE]

which implies that

[TABLE]

Furthermore, we clearly have

[TABLE]

In addition, by choosing the integer $m$ such that $2n\leq 4m\leq 3n$ with $n\geq 2$ , we obtain that $n^{1-\alpha}-m^{1-\alpha}\geq dn^{1-\alpha}$ where

[TABLE]

Hence, it follows from inequality (A.39) that

[TABLE]

Consequently, we deduce from (A.37) together with (A.38) and (A.40) that the second term in the right hand side of (A.35) is bounded by

[TABLE]

Therefore, we obtain from (A.35), (A.36) and (A.41) that for all $n\geq 2$ ,

[TABLE]

It implies that there exists a positive constant $C$ such that for all $n\geq 1$ ,

[TABLE]

Hereafter, we assume that $\alpha=1$ . It is not hard to see that

[TABLE]

Consequently,

[TABLE]

Moreover, for all $1\leq k\leq n$ , we also have

[TABLE]

which implies that

[TABLE]

Hence,

[TABLE]

leading to

[TABLE]

since $\beta<a+1$ . Thus, we obtain from (A.35) together with (A.42) and (A.43) that

[TABLE]

Finally, as $a>\beta-1$ , we deduce from (A.44) that there exists a positive constant $C$ such that for all $n\geq 1$ ,

[TABLE]

which achieves the proof of Lemma A.3. ∎

Appendix B

Proofs of the main results.

We shall now proceed to the proofs of the main results of the paper. We recall that $(X_{n})$ is a sequence of independent and identically distributed random vectors sharing the same distribution as $X$ . We shall denote by ${\mathcal{F}}_{n}$ the $\sigma$ -algebra of the events occurring up to time $n$ , that is ${\mathcal{F}}_{n}=\sigma(X_{1},\ldots,X_{n})$ .

Proof of Theorem 3.1. We obtain from (3.8) and (3.23) that for all $n\geq 0$ ,

[TABLE]

where the random vector $Y_{n+1}$ satisfies ${\mathbb{E}}[Y_{n+1}|{\mathcal{F}}_{n}]=\int_{\mathcal{X}}\nabla_{v}h_{\varepsilon}(x,\widehat{V}_{n})d\mu(x)-\alpha\langle\widehat{V}_{n},\boldsymbol{v}_{J}\rangle\boldsymbol{v}_{J}.$ Hence, it follows from (3.9) that

[TABLE]

where, for all $v\in{\mathbb{R}}^{J}$ , $\nabla H_{\varepsilon,\alpha}(v)=\nabla H_{\varepsilon}(v)-\alpha\langle v,\boldsymbol{v}_{J}\rangle\boldsymbol{v}_{J}.$ The gradient $\nabla H_{\varepsilon,\alpha}$ is a continuous function from ${\mathbb{R}}^{J}$ to ${\mathbb{R}}^{J}$ such that $\nabla H_{\varepsilon,\alpha}(v^{\ast})=\nabla H_{\varepsilon}(v^{\ast})-\alpha\langle v^{\ast},\boldsymbol{v}_{J}\rangle\boldsymbol{v}_{J}=0$ . Moreover, for any $v\in{\mathbb{R}}^{J}$ such that $v\neq v^{\ast}$ ,

[TABLE]

As a matter of fact, if $v$ belongs to $\langle\boldsymbol{v}_{J}\rangle$ and $v\neq 0$ , then $\langle v-v^{\ast},\nabla H_{\varepsilon}(v)\rangle\leq 0$ and $\langle v,\boldsymbol{v}_{J}\rangle^{2}=||v||^{2}>0$ , which ensures that (B.3) is satisfied on $\langle\boldsymbol{v}_{J}\rangle$ . Moreover, if $v$ belongs to $\langle\boldsymbol{v}_{J}\rangle^{\perp}$ , then $\langle v-v^{\ast},\nabla H_{\varepsilon}(v)\rangle<0$ since $H_{\varepsilon}$ has a unique maximum on $\langle\boldsymbol{v}_{J}\rangle^{\perp}$ . It means that (B.3) also holds true on $\langle\boldsymbol{v}_{J}\rangle^{\perp}$ . Denote $\sigma^{2}(\widehat{V}_{n})={\mathbb{E}}[||Y_{n+1}-\nabla H_{\varepsilon,\alpha}(\widehat{V}_{n})||^{2}|{\mathcal{F}}_{n}]$ and $\Psi_{\varepsilon}(\widehat{V}_{n})={\mathbb{E}}[||Y_{n+1}||^{2}|{\mathcal{F}}_{n}]$ . Obviously, for any $v\in{\mathbb{R}}^{J}$ , $\Psi_{\varepsilon}(v)=\sigma^{2}(v)+||\nabla H_{\varepsilon,\alpha}(v)||^{2}$ . In addition, it follows from the upper bound (3.11) that

[TABLE]

Hence, we deduce from (B.4) that for all $v\in{\mathbb{R}}^{J}$ , $\Psi_{\varepsilon}(v)\leq 2(4+\alpha^{2}||v||^{2})\leq K(1+||v||^{2})$ where $K=2\max(4,\alpha^{2})$ . Therefore, all the assumptions of Theorem 1.4.26 of Duflo [21] are satisfied and we can conclude from (3.8) that $\widehat{V}_{n}\rightarrow v^{\ast}$ a.s. which achieves the proof of Theorem 3.1. $\mathbin{\vbox{\hrule\hbox{\vrule height=6.02773pt\kern 6.00006pt\vrule height=6.02773pt}\hrule}}$

Proof of Theorem 3.2. We shall now prove the asymptotic normality of the modified Robbins-Monro algorithm (3.8) which can be rewritten as

[TABLE]

where $\varepsilon_{n+1}=Y_{n+1}-{\mathbb{E}}[Y_{n+1}|{\mathcal{F}}_{n}]$ . First of all, we clearly have ${\mathbb{E}}[\varepsilon_{n+1}|{\mathcal{F}}_{n}]=0$ . Moreover, ${\mathbb{E}}[\varepsilon_{n+1}\varepsilon_{n+1}^{T}|{\mathcal{F}}_{n}]={\mathbb{E}}[Y_{n+1}Y_{n+1}^{T}|{\mathcal{F}}_{n}]-\nabla H_{\varepsilon,\alpha}(\widehat{V}_{n})\nabla H_{\varepsilon,\alpha}(\widehat{V}_{n})^{T}$ where, thanks to (3.1) and (3.23),

[TABLE]

with

$\zeta_{n}=\int_{\mathcal{X}}\pi(x,\widehat{V}_{n})d\mu(x).$

It immediately follows from the above identity that

[TABLE]

However, we can deduce from Theorem 3.1 that

[TABLE]

Hence, it follows from (B.6), (B.7) together with Theorem 3.1 that

[TABLE]

where $\Gamma_{\varepsilon}(v)$ is given by (3.13). Moreover, we obtain from (3.11) and the upper bound (B.4) that

[TABLE]

which implies that

[TABLE]

Therefore, as $\widehat{V}_{n}$ converges almost surely to $v^{\ast}$ , we find from (B.8) that

[TABLE]

Furthermore, we already saw that $\nabla H_{\varepsilon,\alpha}(v)=\nabla H_{\varepsilon}(v)-\alpha\langle v,\boldsymbol{v}_{J}\rangle\boldsymbol{v}_{J}$ . Consequently, $\nabla^{2}H_{\varepsilon,\alpha}(v)=A_{\varepsilon,\alpha}(v)=A_{\varepsilon}(v)-\alpha\boldsymbol{v}_{J}\boldsymbol{v}_{J}^{T}$ and it follows from (A.10) that for all $v\in{\mathbb{R}}^{J}$ , $\nabla H_{\varepsilon,\alpha}(v)=A_{\varepsilon,\alpha}(v^{\ast})(v-v^{\ast})+O\bigl{(}||v-v^{\ast}||^{2}\bigr{)}.$ Finally, under the assumption (3.16) on the maximum eigenvalue of $A_{\varepsilon,\alpha}(v^{\ast})$ , we obtain from Theorem 1 of Pelletier [37] or Theorem 2.3 in the more recent contribution of Zhang [51] that $\sqrt{n}\bigl{(}\widehat{V}_{n}-v^{\ast}\bigr{)}\mathrel{\mathop{\kern 0.0pt\longrightarrow}\limits^{{\mbox{\calcal L}}}}\mathcal{N}_{J}\bigl{(}0,\gamma\Sigma^{\ast}\bigr{)},$ which is exactly what we wanted to prove. $\mathbin{\vbox{\hrule\hbox{\vrule height=6.02773pt\kern 6.00006pt\vrule height=6.02773pt}\hrule}}$

Proof of Theorem 3.3. We already saw from (B.5) that for all $n\geq 0$ , $\widehat{V}_{n+1}=\widehat{V}_{n}+\gamma_{n+1}\bigl{(}\nabla H_{\varepsilon,\alpha}(\widehat{V}_{n})+\varepsilon_{n+1}\bigr{)}$ where ${\mathbb{E}}[\varepsilon_{n+1}|{\mathcal{F}}_{n}]=0$ and

[TABLE]

In addition, we also have from (B.9) that $\sup_{n\geq 0}{\mathbb{E}}[||\varepsilon_{n+1}||^{4}|{\mathcal{F}}_{n}]<\infty$ a.s. Then, the quadratic strong law (3.19) immediately follows from Theorem 3 in [36]. We also deduce the law of iterated logarithm (3.20) from Theorem 1 in [36], which completes the proof of Theorem 3.3. $\mathbin{\vbox{\hrule\hbox{\vrule height=6.02773pt\kern 6.00006pt\vrule height=6.02773pt}\hrule}}$

Proof of Theorem 3.4. The proof of the almost sure convergence (3.24) follows from (3.12) by the Cesàro mean convergence theorem [21], while the proof of the asymptotic normality (3.25) is a direct consequence of the averaging principle for stochastic algorithms given e.g. by Theorem 2 of Polyak and Judistsky [38]. In order to prove (3.26), one can observe that if $(\overline{V}_{n})$ is the sequence associated with Algorithm 1, then for all $n\geq 0$ , $\overline{V}_{n}$ belongs to $\langle\boldsymbol{v}_{J}\rangle^{\perp}$ . It follows from Lemma A.1 that the Moore-Penrose inverse of $A_{\varepsilon}(v^{\ast})$ is given by (3.27). Hence, if one denotes by $P_{J}$ the orthogonal projection on $\langle\boldsymbol{v}_{J}\rangle^{\perp}$ , the asymptotic normality (3.26) is a direct consequence of the asymptotic normality (3.25) combined with the facts that $P_{J}\bigl{(}\overline{V}_{n}-v^{\ast}\bigr{)}=\overline{V}_{n}-v^{\ast}$ and $(A_{\varepsilon,\alpha}(v^{\ast}))^{-1}=A_{\varepsilon}^{{\dagger}}(v_{\ast})-\frac{1}{\alpha}v_{J}v_{J}^{T}$ which implies that $P_{J}(A_{\varepsilon,\alpha}(v^{\ast}))^{-1}\Gamma_{\varepsilon}(v^{\ast})(A_{\varepsilon,\alpha}(v^{\ast}))^{-1}P_{J}=A_{\varepsilon}^{{\dagger}}(v_{\ast})\Gamma_{\varepsilon}(v^{\ast})A_{\varepsilon}^{{\dagger}}(v_{\ast}).$ $\mathbin{\vbox{\hrule\hbox{\vrule height=6.02773pt\kern 6.00006pt\vrule height=6.02773pt}\hrule}}$

Proof of Theorem 3.5. We now focus our attention on the Sinkhorn divergence estimator $\widehat{W}_{n}$ which can be separated into two terms,

[TABLE]

where, for all $n\geq 1$ , $\xi_{n}=h_{\varepsilon}(X_{n},\widehat{V}_{n-1})-H_{\varepsilon}(\widehat{V}_{n-1})$ . On the one hand, it follows from (2.9) and (2.10) that $H_{\varepsilon}$ is a continuous function from ${\mathbb{R}}^{J}$ to ${\mathbb{R}}$ such that $H_{\varepsilon}(v^{\ast})=W_{\varepsilon}(\mu,\nu)$ . Consequently, we immediately deduce from (3.12) and the Cesàro mean convergence theorem [21] that

[TABLE]

On the other hand, denote $M_{n}=\sum_{k=1}^{n}\xi_{k}.$ It is not hard to see that $(M_{n})$ is a locally square integrable real martingale. Its predictable quadratic variation is given by

[TABLE]

We deduce once again from the Cesàro mean convergence theorem that

[TABLE]

where the asymptotic variance $\sigma^{2}_{\varepsilon}(\mu,\nu)={\mathbb{E}}\bigl{[}h_{\varepsilon}^{2}(X,v^{\ast})\bigr{]}-W_{\epsilon}^{2}(\mu,\nu)$ . Note that using the convexity the function $u\mapsto-\log(u)$ and the positivity of the cost $c$ , it can be easily shown that, for all $x\in\mathcal{X}$ and $v\in{\mathbb{R}}^{J}$ ,

[TABLE]

and thus the integrability condition (3.29) combined with the upper bound (B.13) ensures that $\sigma^{2}_{\varepsilon}(\mu,\nu)$ is finite. Hence, we obtain from (B.12) together with the strong law of large numbers for martingales given, e.g. by Theorem 1.3.24 of Duflo [21] that

[TABLE]

Therefore, the almost sure convergence (3.30) clearly follows from the conjunction of (B.10), (B.11) and (B.14). It remains to prove the asymptotic normality (3.31). For that purpose, denote

[TABLE]

We have from (B.10) the decomposition

[TABLE]

We claim that

[TABLE]

while the second term in the right-hand side of equality (B.16) goes to zero almost surely. As a matter of fact, we already saw from (B.12) that

[TABLE]

Consequently, in order to apply the central limit theorem for martingales given e.g. by Corollary 2.1.10 in [21], it is only necessary to check that Lindeberg’s condition is satisfied, that is for all $\eta>0$ ,

[TABLE]

We have for all $\eta>0$ ,

[TABLE]

which implies that

[TABLE]

However, we find from the Cesàro mean convergence theorem that

[TABLE]

where $\Lambda_{\varepsilon}(v^{\ast})=\int_{\mathcal{X}}h_{\varepsilon}^{4}(x,v^{\ast})d\mu(x)+\bigl{(}W_{\varepsilon}(\mu,\nu)\bigr{)}^{4}.$ Using once again the upper bound (B.13), it can be checked that the above limiting value is finite thanks to the integrability condition (3.29). Consequently, the above convergence ensures that Lindeberg’s condition is clearly satisfied which leads to the asymptotic normality (B.17). Furthermore, we have from (B.15) that

[TABLE]

However, we saw from (A.7) that $H_{\varepsilon}(v^{\ast})-H_{\varepsilon}(v)\leq\frac{1}{2\varepsilon}\|v-v^{\ast}\|^{2}$ for any $v\in{\mathbb{R}}^{J}$ . Hence, we obtain from (B.19) that

[TABLE]

On the one hand, if the step $\gamma_{n}=\gamma/n$ where $\gamma>0$ satisfies (3.16), we deduce from (3.19) that $\lim_{n\rightarrow\infty}\frac{1}{\log n}\sum_{k=1}^{n}\bigl{\|}\widehat{V}_{k}-v^{\ast}\bigr{\|}^{2}=\gamma\text{Tr}(\Sigma^{\ast})$ a.s. On the other hand, if the step $\gamma_{n}=\gamma/n^{c}$ where $\gamma>0$ and $1/2<c<1$ ,we also have from (3.22) that $\lim_{n\rightarrow\infty}\frac{1}{n^{1-c}}\sum_{k=1}^{n}\bigl{\|}\widehat{V}_{k}-v^{\ast}\bigr{\|}^{2}=\frac{\gamma}{1-c}\text{Tr}(\Sigma^{\ast})$ a.s. In both cases, it follows from (B.20) that

[TABLE]

Finally, we obtain from (B.16) together with (B.17) and (B.21) the asymptotic normality (3.31) which completes the proof of Theorem 3.5. $\mathbin{\vbox{\hrule\hbox{\vrule height=6.02773pt\kern 6.00006pt\vrule height=6.02773pt}\hrule}}$

Proof of Theorem 3.6. The proof of Theorem 3.6 is divided into three steps. The first one deals with a crude bound of the moments of

[TABLE]

The second step is devoted to recursive inequalities involving ${\mathbb{E}}[\Delta_{n}]$ and ${\mathbb{E}}[\Delta_{n}^{2}]$ , while the last step completes the proof of Theorem 3.6. A key ingredient in the proof comes from inequality (A.7) which implies that

[TABLE]

Step 1. Bounding the moments. We claim that for any integer $p\geq 1$ , there exits a positive constant $C_{p}$ such that

[TABLE]

We first prove the crude bound (B.23) for $p=1$ . It follows from (B.1) that for all $n\geq 0$ ,

[TABLE]

However, we clearly obtain from inequality (B.4) that

[TABLE]

Consequently, it is not hard to see from (B.24) and the Cauchy-Schwarz inequality that $\Delta_{n}$ is integrable which immediately implies that $\widehat{V}_{n}-v^{\ast}$ is also square integrable. Hence, by taking the conditional expectation on both sides of (B.24), we find from (B.2) and (B.3) that for all $n\geq 0$ ,

[TABLE]

Therefore, by taking the expectation on both sides of the above inequality, we obtain that ${\mathbb{E}}[\Delta_{n+1}]\leq{\mathbb{E}}[\Delta_{n}]+8\gamma^{2}_{n+1}.$ Hence, we infer from (3.7) that

[TABLE]

which shows that (B.23) holds for $p=1$ . Let us now consider the case $p=2$ . It follows from the decomposition (B.24) together with (B.25) and the Cauchy-Schwarz inequality that for all $n\geq 0$ ,

[TABLE]

Hence, by taking the conditional expectation on both sides of the above inequality, we deduce from (B.2) and (B.3) that for all $n\geq 0$ ,

[TABLE]

As before, by taking the expectation on both sides of the above inequality, we obtain that

[TABLE]

leading, via (3.7), to (B.23) for $p=2$ . The proof for the general case $p\geq 3$ is left to the reader inasmuch as it follows the same lines as the one for $p=2$ .

Step 2. Obtaining two recursive inequalities involving ${\mathbb{E}}[\Delta_{n}]$ and ${\mathbb{E}}[\Delta_{n}^{2}]$ . By inserting inequality (B.25) into (B.24), we find that for all $n\geq 0$ ,

[TABLE]

Consequently, by taking the conditional expectation on both sides of (B.26), we obtain from (B.2) that for all $n\geq 0$ ,

[TABLE]

Denote by $\delta_{n}$ the approximation error term when linearizing the gradient of $H_{\varepsilon}$ , $\delta_{n}=\nabla H_{\varepsilon}(\widehat{V}_{n})-\nabla^{2}H_{\varepsilon}(v^{\ast})(\widehat{V}_{n}-v^{\ast}).$ We have the decomposition

[TABLE]

On the one hand, we saw that $(\widehat{V}_{n}-v^{\ast})^{T}\nabla^{2}H_{\varepsilon}(v^{\ast})(\widehat{V}_{n}-v^{\ast})\leq-\rho^{\ast}\bigl{\|}\widehat{V}_{n}-v^{\ast}\bigr{\|}^{2}$ which implies that $\langle\widehat{V}_{n}-v^{\ast},\nabla H_{\varepsilon}(\widehat{V}_{n})\rangle\leq-\rho^{\ast}\Delta_{n}+\langle\widehat{V}_{n}-v^{\ast},\delta_{n}\rangle.$ On the other hand, it follows from inequality (A.10) that $\|\delta_{n}\|\leq\frac{\Lambda_{\varepsilon}}{\varepsilon\sqrt{2}}\Delta_{n}.$ Hence, we deduce from Young’s inequality on the product of positive real numbers that

[TABLE]

Hereafter, inserting (B.29) into (B.27), we find that for all $n\geq 0$ ,

[TABLE]

By taking the expectation on both sides of (B.30), we obtain that

[TABLE]

We shall now derive a second recursive inequality for ${\mathbb{E}}[\Delta_{n}^{2}]$ . Using once again the upper bound (B.26) together with (B.25) and the Cauchy-Schwarz inequality, we obtain that

[TABLE]

Let $(a_{n})$ be an increasing sequence of positive real numbers tending to infinity as $n$ goes to infinity, such that for all $n\geq 0$ , $a_{n}\geq 1$ . On the event $A_{n}=\{\Delta_{n}\leq a_{n}^{2}\}$ , it follows from inequality (A.11) that for all $n\geq 0$ ,

[TABLE]

where

[TABLE]

Hence, we deduce from (B.32) that for all $n\geq 0$ ,

[TABLE]

Therefore, by taking the expectation on both sides of the above inequality, we obtain a second recursive inequality

[TABLE]

Step 3. Combining the recursive inequalities involving ${\mathbb{E}}[\Delta_{n}]$ and ${\mathbb{E}}[\Delta_{n}^{2}]$ . By successively using the Cauchy-Schwarz and Markov inequalities, we have for any integer $p\geq 1$ ,

[TABLE]

Hereafter, since $\gamma_{n}=\gamma/n^{c}$ where $1/2<c<1$ , it seems convenient to choose

[TABLE]

For this particular choice and thanks to condition (3.35), one has $a_{n}\geq 1$ for any $n\geq 0$ and that the first term on the right hand side of (B.33) is always non-negative. Moreover, using the upper bound (B.34), we finally obtain that there exists a positive constant $B_{2}$ such that for all $n\geq 0$ ,

[TABLE]

Consequently, by choosing the integer $p$ such that $p(1-c)\geq 2c$ , we immediately obtain from (B.35) that for all $n\geq 0$ ,

[TABLE]

where $b_{2}=2B_{2}$ . Therefore, as $\varepsilon\leq 1$ and $1<2c<2$ , it follows from Lemma A.3 that there exists a positive constant $D_{2}$ such that for any $n\geq 1$ ,

[TABLE]

Furthermore, by inserting inequality (B.37) into (B.31), we obtain that

[TABLE]

where $a=\rho^{\ast}\gamma\leq 1$ thanks to condition (3.35), and $B_{1}=\frac{4^{c-1}\gamma\Lambda^{2}_{\varepsilon}}{\varepsilon^{2}\rho^{\ast}}D_{2}.$ Since $c<1$ , we clearly have $3c-1<2c$ . Hence, we infer from (B.38) that there exists a positive constant $b_{1}$ such that for all $n\geq 0$ ,

[TABLE]

which is a recursive inequality of the same form than (B.36). By choosing $2/3<c<1$ , one can clearly see that all the assumptions of Lemma A.3 are satisfied. Consequently, we deduce from (A.34) that there exists a positive constant $D_{1}$ such that for any $n\geq 1$ ,

[TABLE]

Finally, inserting (B.40) into (B.22) completes the proof of Theorem 3.6. $\mathbin{\vbox{\hrule\hbox{\vrule height=6.02773pt\kern 6.00006pt\vrule height=6.02773pt}\hrule}}$

Bibliography51

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] Abid, B. K., and Gower, R. M. Greedy stochastic algorithms for entropy-regularized optimal transport problems. In Proceedings of the 21th International Conference on Artificial Intelligence and Statistics (Lanzarote, Spain, Apr. 2018).
2[2] Altschuler, J., Weed, J., and Rigollet, P. Near-linear time approximation algorithms for optimal transport via sinkhorn iteration. In Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, 4-9 December 2017, Long Beach, CA, USA (2017), pp. 1961–1971.
3[3] Altschuler, J., Weed, J., and Rigollet, P. Near-linear time approximation algorithms for optimal transport via sinkhorn iteration. In Proceedings of the 31st International Conference on Neural Information Processing Systems (USA, 2017), NIPS’17, Curran Associates Inc., pp. 1961–1971.
4[4] Arjovsky, M., Chintala, S., and Bottou, L. Wasserstein generative adversarial networks. In Proceedings of the 34th International Conference on Machine Learning, ICML 2017, Sydney, NSW, Australia, 6-11 August 2017 (2017), pp. 214–223.
5[5] Bach, F. Self-concordant analysis for logistic regression. Electronic Journal of Statistics 4 (2010), 384–414.
6[6] Bach, F. R. Adaptivity of averaged stochastic gradient descent to local strong convexity for logistic regression. Journal of Machine Learning Research 15 , 1 (2014), 595–627.
7[7] Bach, F. R., and Moulines, E. Non-asymptotic analysis of stochastic approximation algorithms for machine learning. In Advances in Neural Information Processing Systems 24: 25th Annual Conference on Neural Information Processing Systems 2011. Proceedings of a meeting held 12-14 December 2011, Granada, Spain. (2011), pp. 451–459.
8[8] Bigot, J., Cazelles, E., and Papadakis, N. Central limit theorems for Sinkhorn divergence between probability distributions on finite spaces and statistical applications. Preprint - ar Xiv:1711.08947, Nov. 2017.

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Videos

Asymptotic distribution and convergence rates of stochastic algorithms for entropic optimal transportation between probability measures

Abstract.

1. Introduction

1.1. Optimal transport and regularized Wasserstein distances for data analysis

1.2. Main contributions and related works

1.3. Organisation of the paper

2. Formulation of the optimal transportation problem

Definition 2.1**.**

3. Asymptotic properties of stochastic algorithms for regularized optimal transport

3.1. The stochastic Robbins-Monro algorithms

Algorithm 1

Algorithm 2

3.2. Almost sure convergence and asymptotic normality

Theorem 3.1**.**

Theorem 3.2**.**

Remark 3.1**.**

Theorem 3.3**.**

Remark 3.2**.**

Theorem 3.4**.**

Remark 3.3**.**

3.3. Estimation of the Sinkhorn divergence

Theorem 3.5**.**

Remark 3.4**.**

Theorem 3.6**.**

Remark 3.5**.**

4. Further results on the unregularized case

Proposition 4.1**.**

5. Statistical applications and numerical experiments

5.1. Discrete setting in dimension two

5.2. Semi-discrete setting in dimension d≥2d\geq 2d≥2

Appendix A

Lemma A.1**.**

Remark A.1**.**

Proof.

Lemma A.2**.**

Remark A.2**.**

Proof.

Lemma A.3**.**

Proof.

Appendix B

Definition 2.1.

Theorem 3.1.

Theorem 3.2.

Remark 3.1.

Theorem 3.3.

Remark 3.2.

Theorem 3.4.

Remark 3.3.

Theorem 3.5.

Remark 3.4.

Theorem 3.6.

Remark 3.5.

Proposition 4.1.

5.2. Semi-discrete setting in dimension $d\geq 2$

Lemma A.1.

Remark A.1.

Lemma A.2.

Remark A.2.

Lemma A.3.