Statistical inference for Bures-Wasserstein barycenters
Alexey Kroshnin, Vladimir Spokoiny, Alexandra Suvorikova

TL;DR
This paper introduces the Bures-Wasserstein barycenter as a Fréchet mean of positive semi-definite operators, exploring its existence, uniqueness, and statistical properties with applications to quantum mechanics.
Contribution
It defines the Bures-Wasserstein barycenter for operators, establishes conditions for its existence and uniqueness, and analyzes its convergence and concentration properties in statistical inference.
Findings
Existence and uniqueness conditions for Bures-Wasserstein barycenters.
Convergence and concentration results for empirical barycenters.
Connections to optimal transportation and quantum mechanics applications.
Abstract
In this work we introduce the concept of Bures-Wasserstein barycenter , that is essentially a Fr\'echet mean of some distribution supported on a subspace of positive semi-definite Hermitian operators . We allow a barycenter to be restricted to some affine subspace of and provide conditions ensuring its existence and uniqueness. We also investigate convergence and concentration properties of an empirical counterpart of in both Frobenius norm and Bures-Wasserstein distance, and explain, how obtained results are connected to optimal transportation theory and can be applied to statistical inference in quantum mechanics.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Statistical inference for Bures-Wasserstein barycenters
Alexey Kroshnin label=e1][email protected] [ Institute for Information Transmission Problems RAS
National Research University Higher School of Economics
Université Claude Bernard Lyon 1
Vladimir Spokoiny label=e2][email protected] [ Weierstrass Institute for Applied Analysis and Stochastics
Alexandra Suvorikova label=e3][email protected] [ University of Potsdam
Abstract
In this work we introduce the concept of Bures-Wasserstein barycenter , that is essentially a Fréchet mean of some distribution supported on a subspace of positive semi-definite Hermitian operators . We allow a barycenter to be constrained to some affine subspace of and provide conditions ensuring its existence and uniqueness. We also investigate convergence and concentration properties of an empirical counterpart of in both Frobenius norm and Bures-Wasserstein distance, and explain, how obtained results are connected to optimal transportation theory and can be applied to statistical inference in quantum mechanics.
Bures-Wasserstein barycenter,
Wasserstein barycenter,
Hermitian operators,
Central limit theorem,
Concentration,
keywords:
\startlocaldefs\endlocaldefs
and
1 Introduction
Space of finite-dimensional Hermitian operators provides a powerful toolbox for data representation. For instance, in quantum mechanics it is used for mathematical description of physical properties of a quantum system, also known as observables. The reason is due to the fact that the measurements obtained in a physical experiment should be associated to real-valued quantities. Hermitian operators posses real-valued spectrum and satisfy the above requirement. A subspace of real-valued symmetric matrices is also of great interest: points in are widely used for description of systems in engineering applications, medical studies, neural sciences, evolutionary biology e.t.c. Usually such data sets are considered to be randomly sampled from an unknown distribution (Goodnight and Schwartz (1997); Calsbeek and Goodnight (2009); Álvarez-Esteban et al. (2015); del Barrio et al. (2017); Gonzalez et al. (2017)), and statistical characteristics of such as, in particular, mean and variance, appear to be of interest for further planning of an experiment and analysis of obtained results fur further development of natural science models. The current study focuses on a space of positive semi-definite Hermitian matrices and presents a possible approach to analysis and aggregation of relevant statistical information from data-sets, for which the linearity assumption might be violated. This makes classical Euclidean definitions of mean and variance not sensitive enough to capture effects of interest. This case appears extremely often in multiple contexts. As an example one can consider a data set that is represented as probability measures which belong to the same scale-location family, e.g. some astronomic measurements Alvarez-Esteban et al. (2018), Example 4.6. Non-linearity assumption requires the development of a novel toolbox suitable for further statistical analysis. In order to detect non-linear effects, we suggest to endow with the Bures-Wasserstein distance which is recently introduced in a seminal paper Bhatia et al. (2018). It is defined as follows. For any pair of positive matrices it is written as:
[TABLE]
It is worth noting that being restricted to the space of symmetric positive definite matrices , boils down to a classical 2-Wasserstein distance between measures that belong to the same scale-location family (see e.g. Agueh and Carlier (2011), Section 6 or Álvarez-Esteban et al. (2015)). A more detailed discussion on this particular choice of the distance is presented in Section 2.1.
After choosing a proper distance, we are now ready to introduce a model of information aggregation the statistical properties of which are investigated in the current study. Let be a probability distribution supported on some set . Further without loss of generality we assume, that assigns positive probability to the intersection of with space of positive definite Hermitian matrices , and that the spectrum of its elemnts is on average bounded away from infinity:
Assumption 1**.**
[TABLE]
Two statistically important characteristics of are Fréchet mean and Fréchet variance. The former one can be regarded as a typical representative of a data-set in hand, whereas the latter appears in analysis on data variability, see e.g. Del Barrio et al. (2015). We briefly recall both concepts below. For an arbitrary point Fréchet variance of is defined as
[TABLE]
Classical Fréchet mean of is a set of global minimizers of :
[TABLE]
However, in many cases we are interested in a minimizer, that belongs to some affine sub-space :
[TABLE]
For instance, such a necessity may arise while considering a random set of quantum density operators. For introduction to density operators theory one may look through Fano (1957). This example is considered in more details in Section 3.2. Note, that the setting (1.3) covers the setting (1.2). So without loss of generality we further address only (1.3). Obviously, the first crucial question concerns existence and uniqueness of . And positive answers on both issues, along with necessary conditions, are presented in Theorem 2.1. This immediately allows us to define the global Fréchet variance of as .
Given an i.i.d. sample , , one constructs an empirical analogue of :
[TABLE]
An empirical Fréchet mean and global empirical variance also exist and unique:
[TABLE]
These facts follow from Theorem 2.1. This work studies convergence of the estimators and and investigate concentration properties of both quantities. The discussion of practical applicability of the obtained results is postponed to Section 3. There we explain their relation to optimal transportation theory and present a possible application to statistical analysis in quantum mechanics.
1.1 Contribution of the present study
Central limit theorem and
concentration of
The first main result of this study concerns asymptotic normality of the approximation error of population Fréchet mean by its empirical counterpart:
[TABLE]
where “” stands for weak convergence, and is some covariance operator acting on the linear subspace associated with affine subspace . From now on we use bold symbols e.g. , to denote operators, wheres classical ones i.e. stand for matrices or vectors. This convergence result cannot be directly used for construction of asymptotic confidence sets because it relies on the unknown covariance matrix . However, Theorem 2.2 ensures, that this covariance matrix can be replaced by its empirical counterpart :
[TABLE]
where denotes an identity operator. Along with asymptotic normality of , we are interested in the limiting distribution of . Corollary 2.1 shows, that
[TABLE]
where is some normally distributed vector. Data-driven asymptotic confidence sets for are obtained by replacement of by its empirical counterpart :
[TABLE]
where is a metric which induces weak convergence. Furthermore, we investigate concentration properties of in both Frobenius norm and metric. The following two bounds hold with h.p.:
[TABLE]
where stands for some generic constant. A more detailed discussion is presented in Theorem 2.3 and Corollary 2.2 respectively. It is worth noting that concentration results are obtained under assumption of sub-Gaussianity of in the following sense:
Assumption 2** (Sub-Gaussianity of ).**
Let be sub-Gaussian, i.e.
[TABLE]
with some constants .
All above-mentioned results are closely connected to convergence and concentration of empirical -Wasserstein barycenters. For the sake of transparency we postpone this discussion to Section 3.1.
CLT and concentration for
We also show asymptotic normality of approximation error of by and prove concentration of :
[TABLE]
[TABLE]
where the latter result holds with h.p., and , are constants which depend on sub-Gaussianity parameters , dimension , and parameter . See Theorem 2.4 and Theorem 2.5 respectively.
The paper is organised as follows. Section 2 explains the obtained results in more details. Section 3 illustrates the connection to other scientific problems. Finally, Section 4 contains simulations and experiments on both artificial and real data-sets.
2 Results
This section presents obtained results in more details, and the first question we address is the particular choice of the distance.
2.1 Bures-Wasserstein distance
The original Bures metric appears in quantum mechanics in relation to fidelity measure between two quantum states and is used for measurement of quantum entanglement Marian and Marian (2008); Dajka et al. (2011). Let , be two quantum states. Mathematically speaking, this means that
[TABLE]
Fidelity of these states is defined as . It quantifies “closeness” of and , see Jozsa (1994). It is obvious, that in case of (2.1) Bures-Wasserstein distance turns into
[TABLE]
It is interesting to note, that the distance appears not only one of the central distances, used in quantum mechanics, but also an object of extensive investigation in transportation theory Takatsu et al. (2011). Let and be two centred Gaussian distributions. Then 2-Wasserstein distance between them is written as
[TABLE]
The case of Gaussian measures is naturally extended to measures that belong to a same scale-location family Alvarez-Esteban et al. (2018). In the last few years Wasserstein distance attracts a lot of attention of data scientists and machine learning community, as it takes into account geometrical similarities between objects, see e.g. Gramfort et al. (2015); Flamary et al. (2018); Montavon et al. (2016). Due to this fact satisfies the requirement of taking into account non-linearity of a data set under consideration. For more information on optimal transportation theory we recommend Villani (2009).
Following Bhatia et al. (2018), we continue to investigate properties of . The next lemma presents an alternative analytical expression for the distance.
Lemma 2.1**.**
Let and . Then (1.1) can be rewritten as
[TABLE]
where
[TABLE]
By \bigl{(}S^{1/2}QS^{1/2}\bigr{)}^{-1/2} we denote the pseudo-inverse matrix \left(\bigl{(}S^{1/2}QS^{1/2}\bigr{)}^{1/2}\right)^{+}.
Note, that in optimal transportation theory is referred to as an optimal push-forward (optimal transportation map) between two centred normal distributions and . Following optimal transport notations it is denoted as .
For general notes on optimal transportation maps see Brenier (1991); for a particular case of scale-location and Gaussian families one may refer to Alvarez-Esteban et al. (2018); Takatsu et al. (2011). Lemma A.2 presents differentiability of the optimal map . It is one of the key-ingredients in the proof of main results of the present study. Note, that in case of differentiability of is obtained in Rippl et al. (2016). More technical details on properties of are presented in Section A.2. However, for better understanding of the proofs of main results we highly recommend to at least look through Section A.1 which is dedicated to investigation of properties of and its differential .
2.2 Existence and uniqueness of and
Along with investigation of properties of the distance in hand and before moving to more general statistical questions, one should ask her- or himself, whether Fréchet mean exists and, if so, is it unique or not? Let be a linear subspace of associated to , i.e. the following representation holds: for some . We further assume that has a non-empty intersection with the space of positive definite operators:
Assumption 3**.**
,
The next theorem ensures existence and uniqueness of the Fréchet mean (1.3).
Theorem 2.1** (Existence and uniqueness of Fréchet mean ).**
Under Assumptions 1 and 3 there exists a unique positive-definite barycenter of : . Moreover, it is characterised as the unique solution of the equation
[TABLE]
where is the orthogonal projector onto .
Note, that this result generalises the result for scale-location families in 2-Wasserstein space, presented in Álvarez-Esteban et al. (2015), Theorem 3.10 and originally obtained in a seminal work Agueh and Carlier (2011), Theorem 6.1. Namely, if , then exists, is unique, and is characterised as the unique solution of a fixed-point equation similar to (2.4)
[TABLE]
Existence, uniqueness, and measurability of the estimator defined in (1.4) is a direct corollary of the above theorem. The proof of Theorem 2.1 is presented in Section A.3.
2.3 Convergence of and
Armed with the knowledge about properties of , , and , we are now equipped enough, so that to introduce the main results of the current study. Theorem 2.2 presents asymptotic convergence of to .
Theorem 2.2** (Central limit theorem for the Fréchet mean).**
Under Assumptions 1 and 3 an approximation error rate of the Fréchet mean by its empirical counterpart is
[TABLE]
where is a self-adjoint linear operator acting from to defined in (A.7). Moreover, if is non-degenerated, then
[TABLE]
with is a data-driven empirical counterpart of defined in (A.8).
Remark 1**.**
Here denotes a restriction of a quadratic form to a subspace :
[TABLE]
We intentionally postpone the explicit definitions of and , as they require an introduction of many technical details. This would make the description of main results less transparent. The proof of the theorem relies on the Fréchet differentiablilty of in the vicinity of :
[TABLE]
where is a differential of at point . Here we imply differentiability of by the lower argument .
It is worth noting that the result (B) obtained in CLT enables construction of data-driven asymptotic confidence sets. However, there might appear technical problems with inversion of the empirical covariance. For instance, numerical simulations show, that can be degenerated if is supported on a set of diagonal matrices. This immediately raises a question concerning the development of some other confidence set construction methodology based on re-sampling techniques which would simplify the process from computational point of view. We consider this as a subject for further research.
As soon as the Bures-Wasserstein distance is the main tool for the analysis in , the convergence properties of are also of great interest. The next lemma is almost a straightforward corollary of Theorem 2.2.
Corollary 2.1** (Asymptotic distribution of ).**
Under conditions of Theorem 2.2 it holds
[TABLE]
where is random matrix, s.t. . Moreover,
[TABLE]
where and .
To illustrate the result, we consider the case of diagonal . This setting allows us to write down the explicit form of the limiting distribution. If , then right-hand side of the above corollary for -case is:
[TABLE]
where . All proofs are collected in Section A.3. Section 4 illustrates asymptotic behaviour of and on both artificial and real data sets.
2.4 Concentration of
The next important issue is concentration properties of under the assumption of sub-Gaussianity of (Assumption 2).
Theorem 2.3** (Concentration of ).**
Let
[TABLE]
It holds under Assumptions 2 and 3, that
[TABLE]
for any and , where
[TABLE]
operator is defined in (B.3), comes from Proposition 1, and and are defined in Lemma B.3.
Concentration of is a corollary of the above theorem.
Corollary 2.2** (Concentration of in distance).**
Under conditions of Theorem 2.3 the following result holds
[TABLE]
Proofs are collected in Section B.
2.5 Central limit theorem and concentration for
In this section we investigate properties of the Fréchet variance , defined in (1.4). The next theorem presents central limit theorem for empirical variance .
Theorem 2.4** (Central limit theorem for ).**
Let be s.t. and . Then
[TABLE]
The last important result of the current study describes concentration properties of .
Theorem 2.5** (Concentration of ).**
Let Assumption 2 be fulfilled. Then under conditions of Theorem 2.4 it holds:
[TABLE]
with
[TABLE]
There are parameters of sub-exponential r.v. .
Proofs of these two theorems are collected in Section B.1.
3 Connection to other problems
In this section we explain the connection of obtained results to some other problems. Section 3.1 investigates the relation between Bures-Wasserstein barycenter and 2-Wasserstein barycenter of some scale-location family. Section 3.2 illustrates the idea of search of a barycenter on some affine subspace .
3.1 Connection to scale-location families of measures
We first present the concept of a scale-location family of absolutely continuous measures supported on .
Definition 3.1**.**
Let be a random variable that follows law : , where is a set of all continuous measures with finite second moment. A set of all affine transformations of is
[TABLE]
It is referred to as a scale-location family.
Scale-location families attract lots of attention in modern data analysis and appear in many practical applications, as this concept is user-friendly in terms of theoretical analysis and, at the same time, possess very high modelling power. For example, it is widely used in medical imaging Wassermann et al. (2010), modelling of molecular dynamic Gonzalez et al. (2017), clustering procedures del Barrio et al. (2017), climate modelling Mallasto and Feragen (2017), embedding of complex objects in low dimensional spaces Muzellec and Cuturi (2018) and so on.
A possible metric that takes into account non-linearity of the underlying data-set is -Wasserstein distance . Let be elements of and let , . We denote their first and second moments as
[TABLE]
It is a well-known fact, that in case of scale-location families depends only on the first and second moments of observed measures:
[TABLE]
For more details on general class of optimal transportation distances we recommend excellent books Ambrosio and Gigli (2013) or Villani (2009).
Distribution over scale-location family
In many cases we are interested in scale-location families generated at random. Let be a probability measure supported on some . And let be a generic probability space, s.t. for any there exists an image in \,\mu_{\omega}\stackrel{{\scriptstyle\operatorname{def}}}{{=}}\mathcal{L}\bigl{(}P_{\omega}X+p_{\omega}\bigr{)}\,, where is a scaling parameter and is a shift parameter. A randomly sampled measure belongs to by construction, and its first and second moments are written as
[TABLE]
where denote the first and the second moments of .
Fréchet variance of at any arbitrary point is written as
[TABLE]
Given an i.i.d. sample from , we define an empirical analogon of :
[TABLE]
Then population and empirical barycenters and are
[TABLE]
Note, that and belong to and are uniquely characterised by their first and second moments and respectively, see e.g. Theorem 3.10 Álvarez-Esteban et al. (2015):
[TABLE]
[TABLE]
It is worth noting that the concept of Wasserstein barycenter originally presented in a seminal work by Agueh and Carlier (2011) becomes a topic of extensive scientific interest in the last few years. A work Bigot et al. (2012) focuses on convergence of parametric class of barycenters, while Bigot et al. (2017) investigate asymptotic properties of regularised barycenters. The most general results on limiting distribution of convergence of empirical barycenters are obtained in Ahidar-Coutrix et al. (2018). This work provides rates of convergence for empirical barycentres of a Borel probability measure on a metric space either under assumptions on weak curvature constraint of the underlying space or for a case of a non-negatively curved space on which geodesics, emanating from a barycenter, can be extended. Theorem 2.2 specifies the results, obtained in Ahidar-Coutrix et al. (2018) for the case of scale-location families. Corollary 2.1 partially answers an (implicit) question, raised by work
Le Gouic and Loubes (2015), concerned the rate of convergence of . Namely, for the case of scale-location families it is of order . However, the above mentioned work covers only (1.2) case. The paper Kroshnin (2018) obtains an analog of law of large numbers for the case of arbitrary cost functions for barycenters on some affine sub-space (1.3). A result, similar in spirit to Theorem 2.5 is obtained in Del Barrio et al. (2016). However, there authors consider only the space of probability measures supported on the real line (i.e. ) endowed with 2-Wasserstein distance. To the best of our knowledge, there are no results similar to concentration Theorem 2.3 and Theorem 2.2 in case of 2-Wasserstein distance.
3.2 Connection to quantum mechanics
This section illustrates the idea of barycenter restricted to some affine sub-space . We first briefly recall the concept of quantum densities. Quantum density operator is used in quantum mechanics as a possible way of description of statistical state of a quantum system. It might be considered as an analogue to a phase-space density in classical statistical mechanics. The formalism was introduced by John von Neumann in 1927. In essence a density matrix is a Hermitian positive semi-definite operator with the unit trace, , .
Given a random ensemble of density matrices, one is able to recovery the mean using averaging in classical Euclidean sense. However, Bures-Wasserstein barycenter suggests an alternative way to define the “most typical” representant (1.3) in terms of fidelity measure (2.2). We consider a following statistical setting. Let be some mechanism which generates quantum states . Given an i.i.d sample we write population and empirical variance of as
[TABLE]
Then population and empirical barycenters in the class of all -dimensional density operators are defined as
[TABLE]
It can be easily shown, that “taking global Fréchet barycenter” or, in other words neglecting the condition , we end up with the global baryceneter, which is the solution of the fixed point equation which is already mentioned in Section 2: . This is a contraction mapping. Thus and is not a density operator. In other words condition ensures, that and also belong to the class of density operators. Taking into account the results obtained in Section 2, is a natural consistent estimator of with known rate of convergence and deviation properties.
4 Experiments on simulated and real data sets
4.1 Simulated data
In this section we consider a simulated data set. So as to generate a covariance matrix , we generate at random an orthogonal matrix and a diagonal matrix , s.t. . The following images Fig.1 - Fig.3 illustrate convergence of , , and presented in Theorem 2.2, Corollary 2.1, and Theorem 2.4 respectively. The following numerical experiments were performed using R . The population barycenter was computed using a sample of observed covariance matrices. A solid line depicts the density of a limiting distribution, whereas dashed lines correspond to densities for different sample sizes for Bures-Wasserstein barycenter with . Simulation were carried out for matrices of size and .
4.2 Data aggregation in climate modelling
In this section we carry out the experiments on a family of Gaussian process, using a climate-related data set, collected in Siberia (Russia) between 1930 and 2009 Bulygina and Razuvaev (2012); Tatusko (1990). We set to be a family of Gaussian curves, that describe the daily minimum temperatures within one year, measured at a set of 30 randomly sampled meteorological stations. Each curve is obtained by means of regression and maximum likelihood estimation and is sampled in points. More details on this data set are provided in Mallasto and Feragen (2017). The scale-location family under consideration is written as
[TABLE]
where is a Gaussian process, characterised by mean and covariance inherent to a year , . We let for all . A Gaussian process is the population Wasserstein barycenter of . It is characterised by
[TABLE]
A family of approximating processes with parameters (3.3) is constructed by means of re-sampling with replacement of the original data set. Sample size varies in range . Fig. 5 and Fig. 5 present densities of and respectively.
Appendix A Proof of Central Limit Theorem
A.1 Properties of
Proof of Lemma 2.1.
First, we prove that optimal is self-adjoint. Indeed, assume the opposite, then
[TABLE]
and thus . Therefore
[TABLE]
If is Hermitian but not positive semi-definite, then , , hence again .
Finally, if , then it is straightforward to check that given by (2.3) and
[TABLE]
∎
The proof of the Central Limit theorem mainly relies on the differentiability of the map (2.3). Lemma A.2 shows that can be linearised in the vicinity of :
[TABLE]
where is a self-adjoint negative-definite operator and stands for an operator norm. Properties of are investigated in Lemma A.3. Let us introduce some notation: if is a functional of a matrix , then we denote its differential as .
Lemma A.1**.**
Map is differentiable on , and its differential is given by
[TABLE]
where is the eigenvalue decomposition.
Proof.
First, let us consider the map . It is smooth and its differential
[TABLE]
is non-degenerated:
[TABLE]
whenever . From now on denotes a scalar product associated to Frobenius norm.
Now applying the inverse function theorem we obtain that the inverse map is also smooth and its differential enjoys the following equation
[TABLE]
thus
[TABLE]
[TABLE]
and
[TABLE]
∎
Lemma A.2** (Fréchet-differentiability of the map ).**
For any the map can be linearised in the vicinity of as
[TABLE]
where
[TABLE]
* is an eigenvalue decomposition of *
[TABLE]
Proof.
The proof mainly relies on the differentiation of the pseudo-inverse term \,\bigl{(}S^{1/2}QS^{1/2}\bigr{)}^{-1/2}\,, as soon as
[TABLE]
Obviously we can consider only restriction to and therefore assume w.l.o.g. . As , by Lemma A.1 and von Neumann series expansion we obtain for infinitesimal and corresponding that
[TABLE]
Then the differential is written as
[TABLE]
Therefore,
[TABLE]
where is defined by (A.1). ∎
Lemmas A.3 and A.4 are technical and explore properties of .
Lemma A.3**.**
For any , , the properties of operator defined in (A.1) are following:
- (I)
it is self-adjoint; 2. (II)
it is negative semi-definite; 3. (III)
it enjoys the following bounds:
[TABLE] 4. (IV)
it is homogeneous w.r.t. with degree and w.r.t. with degree , i.e. and for any ; 5. (V)
it is monotone w.r.t. (once range is fixed): in the sense of self-adjoint operators on whenever and ; in particular, is monotone w.r.t. for fixed .
Proof.
Slightly changing notations, we rewrite (A.1) as
[TABLE]
where matrices and come from Lemma A.2 and
[TABLE]
(I) Self-adjointness
Consider a scalar product
[TABLE]
We now introduce a following notation
[TABLE]
Then the above equality can be continued as follows:
[TABLE]
where . Thus the operator is self-adjoint.
(II) Boundedness and (III) eigenvalues
Denoting by (i.e. now ) and taking into account the above expansion of an inner product, one obtains
[TABLE]
Note, that the function is monotonously increasing in both arguments and , thus
[TABLE]
For the sake of simplicity we introduce a new variable
[TABLE]
its Frobenius norm is written as
[TABLE]
Moreover, the following inequality for trace holds:
[TABLE]
Here is the orthogonal projector onto range of .
Then combining (A.2) with (A.3), the upper and lower bounds can be obtained as follows:
[TABLE]
Note, that if is degenerated, the lower bound becomes trivial.
(IV) Homogeneity and (V) monotonicity
Homogeneity follows directly from definition (A.1). Now we prove monotonicity. As range of is fixed, we may assume . Consider
[TABLE]
with replacement to be change of variables. As soon as is supposed to be fixed, it is enough to show that the differential is monotone in . Notice that the operator at point is equal to the differential of the inverse map at point :
[TABLE]
In turn, can be expressed as
[TABLE]
the right part of the above equation is self-adjoint, negative-definite and
[TABLE]
Choose (thus ) and let for . Then for any fixed
[TABLE]
i.e. and hence for the differential of the inverse inequality holds: . This entails monotonicity of . ∎
Corollary A.1**.**
We define a following rescaled operator
[TABLE]
Then a following bound on its eigenvalues hold:
[TABLE]
Proof.
Notice that inequalities
[TABLE]
are a trivial consequence of Lemma A.3 (III). Now defining for any
[TABLE]
we obtain from (A.2) that
[TABLE]
Therefore, the above inequalities are sharp. ∎
Lemma A.4**.**
For any , consider
[TABLE]
Then
[TABLE]
Moreover, if , then
[TABLE]
Remark 2**.**
The above inequality might seem confusing due to the fact that , however this is explained by the fact that is negative definite.
Proof.
Notice that
[TABLE]
Monotonicity and homogeneity with degree of (see Lemma A.3) yield
[TABLE]
and
[TABLE]
Therefore,
[TABLE]
and respectively,
[TABLE]
The rest of the Lemma follow from the fact that
[TABLE]
and inequalities
[TABLE]
∎
A.2 Properties of
The next lemma ensures strict convexity of . In essence, the proof mainly relies on Theorem 7 Bhatia et al. (2018).
Lemma A.5**.**
For any function is convex on . Moreover, if , then it is strictly convex.
Proof.
According to (Bhatia et al., 2018, Theorem 7) function is strictly concave on , hence function
[TABLE]
is convex on for any positive semi-definite . Moreover, if , then is an injective linear map, and therefore is strictly convex. ∎
Further we present differentiability of and its quadratic approximation.
Lemma A.6**.**
For any , function is twice differentiable in with
[TABLE]
Moreover, the following quadratic approximation holds: for any
[TABLE]
with defined in (A.5).
Proof.
Note that
[TABLE]
where comes from Lemma A.2. Furthermore, Lemma A.1 implies that
[TABLE]
Consequently, is differentiable and
[TABLE]
Applying Lemma A.2 one obtains
[TABLE]
Quadratic approximation
Let , , . The Taylor expansion in integral form applied to implies
[TABLE]
Following the same ideas as in the proof of Lemma A.4 one obtains that
[TABLE]
and
[TABLE]
Thus
[TABLE]
∎
A.3 Central limit theorem for
First let us prove uniqueness and positive-definiteness of Bures–Wasserstein barycenter.
Proof of Theorem 2.1.
Since
[TABLE]
and as , the barycenter always exists by continuity and compactness argument. In case applying Lemma A.5 we obtain strict convexity of the integral
[TABLE]
and therefore, uniqueness of the minimizer .
To prove that consider arbitrary degenerated , (which exists by Assumption 3) and . Let us define . We are going to show, that
[TABLE]
Consider eigen-decomposition , , where . Respectively, we write in a block form
[TABLE]
Cramer’s rule for inverse matrix and Laplace’s formula yield
[TABLE]
therefore
[TABLE]
Consequently,
[TABLE]
In the same way one can obtain
[TABLE]
Consequently,
[TABLE]
Since by Assumption 3 and is convex, we conclude that
[TABLE]
thus cannot be a barycenter of . This yields .
Since is convex and barycenter of is positive-definite and unique, it is characterized as a stationary point of Fréchet variation on subspace , i.e. as a solution to equation
[TABLE]
as required. ∎
The proof of CLT widely uses the concept of covariance operators on the space of optimal transportation maps and on the space of covariance matrices.
Covariance operator on the space of optimal maps
Consider and . We define covariance of , its empirical counterpart , and its data-driven estimator as follows
[TABLE]
Covariance operators on the space of covariance matrices
Let be an empirical barycenter. The covariance of and its empirical counterpart are defined as
[TABLE]
[TABLE]
where
[TABLE]
Now we are ready to prove the central limit theorem for the empirical barycenter (Theorem 2.2). However, for the sake of transparency we provide below a complete statement.
Theorem** (Central limit theorem for the covariance of empirical barycenter).**
The approximation error rate of the Fréchet mean by its empirical counterpart is
[TABLE]
Moreover, if is non-degenerated, then
[TABLE]
Proof of Theorem 2.2.
The proof consists of two parts: proof of (A) and (B).
Proof of (A)
As are convex functions, they a.s. uniformly converge to strictly convex function on any compact set by the uniform law of large numbers. Therefore, their minimizers also converge a.s. . In particular, with dominating probability.
Applying the expansion from Lemma A.2 at point implies
[TABLE]
where and is a self-adjoint operator on s.t. \,\left\langle X,\alpha\bigl{(}X;Q_{n},S\bigr{)}\right\rangle=o\bigl{(}\bigl{\langle}X,\bm{dT}_{Q_{*}}^{S}(X)\bigr{\rangle}\bigr{)}\, as uniformly in and according to Lemma A.4. Note, that the condition for being a barycenter is . This fact together with averaging of (A.11) over give:
[TABLE]
where
[TABLE]
and is defined in (A.9). Recall that (A.9) is a population counterpart of . This operator is correctly defined since by Lemma A.3 one can show that it is self-adjoint, positive definite and bounded:
[TABLE]
Since by the law of large numbers and \,\alpha_{n}(X)=o\bigl{(}\norm{\bm{F}_{n}(X)}\bigr{)}\,, we obtain from (A.12)
[TABLE]
where is a bounded linear operator, because is negative definite for any by Lemma A.3. The result (A) follows immediately from the CLT for .
Proof of (B)
Note that result (A) is equivalent to the fact, that
[TABLE]
To ensure convergence of we need to show that
- •
(is proved in Lemma B.2);
- •
.
Convergence of to
Monotonicity and homogeneity with degree of (see Lemma A.3) yield
[TABLE]
where comes from (A.5). This naturally leads to the following relation:
[TABLE]
Note, that and . This immediately implies
[TABLE]
Since this implies .
The above results ensures validity of substitution by . This yields (B). ∎
The asymptotic convergence results for is a straightforward corollary of the above theorem. Here is the proof.
Proof of Corollary 2.1.
Since a.s., Lemma A.6 implies
[TABLE]
By Theorem 2.2 is asymptotically normal and centred, therefore
[TABLE]
where and .
Note, that , , and . The latter result follows from Lemma A.3 (IV, V):
[TABLE]
This yields
[TABLE]
where . This, in turn, entails
[TABLE]
Appendix B Concentration of barycenters
The next lemma is a key ingredient in the proof of concentration result for .
Lemma B.1**.**
Consider
[TABLE]
where
[TABLE]
Then
[TABLE]
whenever and .
Proof.
Let us define for . Due to Lemmas A.3 and A.4 we have for any
[TABLE]
Therefore,
[TABLE]
At the same time,
[TABLE]
Hence
[TABLE]
Rewriting the inequality above we obtain
[TABLE]
provided that . ∎
Proposition 1** (Concentration of ; Hsu et al. (2012), Theorem 1).**
Under Assumption 2 it holds
[TABLE]
Before proving concentration results, we define operator as:
[TABLE]
Proof of Theorem 2.3.
Let be s.t. the following upper bound on from Lemma B.3 holds:
[TABLE]
It is easy to see that this condition is fulfilled for under a proper choice of generic constant in definition of . Then with a following bound holds
[TABLE]
The above facts together with definition of B.1 yield
[TABLE]
Combining the above bounds with Proposition 1, we obtain:
[TABLE]
Now it follows from Lemma B.1 that
[TABLE]
whenever . Here we used that if at least one of matrices is non-degenerated. ∎
Proof of Corollary 2.2.
To prove this result we use Lemma A.6 and choose , . Thus we obtain
[TABLE]
Hence by Theorem 2.3
[TABLE]
with probability at least . ∎
B.1 Central limit theorem and concentration for
Proof of Theorem 2.4.
By definition empirical Fréchet variance is
[TABLE]
Lemma A.6 ensures the following bound on :
[TABLE]
with defined in (2.4). The above quadratic bound together with , and yield:
[TABLE]
On the other hand, by classical central limit theorem we obtain:
[TABLE]
∎
Proof of Theorem 2.5.
Following the proof of Theorem 2.4 we consider :
[TABLE]
Following the proof of Theorem 2.3, we obtain that with the following upper bounds hold:
[TABLE]
Thus
[TABLE]
and consequently
[TABLE]
Now we consider a difference . According to Assumption 2 and therefore are sub-exponential r.v. with some parameters . Then Lemma B.4 ensures
[TABLE]
with probability . Combining two above bounds, we obtain:
[TABLE]
with probability
[TABLE]
Choosing , we obtain
[TABLE]
∎
B.2 Auxiliary results
Lemma B.2**.**
Let ; then
[TABLE]
where
[TABLE]
where is the condition number of matrix and is -Schatten (nuclear) norm of an operator.
Proof.
Note, that for any the following decomposition holds
[TABLE]
Summing by yields
[TABLE]
Note, that each
[TABLE]
Lemmas A.3 (III) and A.4 yield
[TABLE]
hence . The above expression together with (B.6) and Cauchy-Schwarz inequality lead to the upper bound on :
[TABLE]
∎
Further we present concentration of around . Denote as an Orlicz norm with Young function , i.e.
[TABLE]
Then sub-Gaussianity of a r.v. is equivalent to and it ensures
[TABLE]
Lemma B.3** (Concentration of , Proposition 2 in Koltchinskii et al. (2011)).**
There exists a constant , s.t. for all it holds with probability at least
[TABLE]
where , .
Lemma B.4** (Sub-exponential tail bounds).**
Suppose that is sub-exponential with parameters . Then
[TABLE]
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1Agueh and Carlier (2011) Martial Agueh and Guillaume Carlier. Barycenters in the Wasserstein space. SIAM Journal on Mathematical Analysis , 43(2):904–924, 2011.
- 2Ahidar-Coutrix et al. (2018) Adil Ahidar-Coutrix, Thibaut Le Gouic, and Quentin Paris. On the rate of convergence of empirical barycentres in metric spaces: curvature, convexity and extendible geodesics. ar Xiv preprint ar Xiv:1806.02740 , 2018.
- 3Álvarez-Esteban et al. (2015) P. C. Álvarez-Esteban, E. del Barrio, J. A. Cuesta-Albertos, and C. Matrán. Wide consensus for parallelized inference. Ar Xiv e-prints , November 2015.
- 4Alvarez-Esteban et al. (2018) Pedro C Alvarez-Esteban, Eustasio del Barrio, Juan A Cuesta-Albertos, Carlos Matrán, et al. Wide consensus aggregation in the Wasserstein space. application to location-scatter families. Bernoulli , 24(4A):3147–3179, 2018.
- 5Ambrosio and Gigli (2013) Luigi Ambrosio and Nicola Gigli. A user’s guide to optimal transport. In Modelling and optimisation of flows on networks , pages 1–155. Springer, 2013.
- 6Bhatia et al. (2018) Rajendra Bhatia, Tanvi Jain, and Yongdo Lim. On the Bures–Wasserstein distance between positive definite matrices. Expositiones Mathematicae , 2018.
- 7Bigot et al. (2012) Jérémie Bigot, Thierry Klein, et al. Consistent estimation of a population barycenter in the Wasserstein space. Ar Xiv e-prints , 2012.
- 8Bigot et al. (2017) Jérémie Bigot, Elsa Cazelles, and Nicolas Papadakis. Penalized barycenters in the Wasserstein space. 2017.
