Matrix versions of the Hellinger distance

Rajendra Bhatia; Stephane Gaubert; Tanvi Jain

arXiv:1901.01378·math-ph·April 9, 2020

Matrix versions of the Hellinger distance

Rajendra Bhatia, Stephane Gaubert, Tanvi Jain

PDF

TL;DR

This paper introduces and analyzes matrix distance functions based on different geometric means, exploring their properties and applications to barycenter computation in positive definite matrices.

Contribution

It extends the concept of matrix Hellinger distances by studying new divergence measures derived from various matrix means, including the Pusz-Woronowicz and log Euclidean means.

Findings

01

Certain divergences are strictly convex functions.

02

Characterizations of barycenters with respect to these divergences.

03

Connections between these divergences and known metrics like Bures-Wasserstein.

Abstract

On the space of positive definite matrices we consider distance functions of the form $d (A, B) = [\tr A (A, B) - \tr G (A, B)]^{1/2},$ where $A (A, B)$ is the arithmetic mean and $G (A, B)$ is one of the different versions of the geometric mean. When $G (A, B) = A^{1/2} B^{1/2}$ this distance is $∥ A^{1/2} - B^{1/2} ∥_{2},$ and when $G (A, B) = (A^{1/2} B A^{1/2})^{1/2}$ it is the Bures-Wasserstein metric. We study two other cases: $G (A, B) = A^{1/2} (A^{- 1/2} B A^{- 1/2})^{1/2} A^{1/2},$ the Pusz-Woronowicz geometric mean, and $G (A, B) = exp (\frac{l o g A + l o g B}{2}),$ the log Euclidean mean. With these choices $d (A, B)$ is no longer a metric, but it turns out that $d^{2} (A, B)$ is a divergence. We establish some (strict) convexity properties of these divergences. We obtain characterisations of barycentres of $m$ positive definite…

Equations300

d (p, q) = ∥ p - q ∥_{2} = [\sum (p_{i} - q_{i})^{2}]^{1/2} = [\sum (p_{i} + q_{i}) - 2 \sum p_{i} q_{i}]^{1/2} .

d (p, q) = ∥ p - q ∥_{2} = [\sum (p_{i} - q_{i})^{2}]^{1/2} = [\sum (p_{i} + q_{i}) - 2 \sum p_{i} q_{i}]^{1/2} .

d_{H} (p, q) = tr A (p, q) - tr G (p, q),

d_{H} (p, q) = tr A (p, q) - tr G (p, q),

∥ A ∥_{2} = (tr A^{*} A)^{1/2} = (\sum ∣ a_{ij} ∣^{2})^{1/2} .

∥ A ∥_{2} = (tr A^{*} A)^{1/2} = (\sum ∣ a_{ij} ∣^{2})^{1/2} .

d_{1} (A, B) = ∥ A^{1/2} - B^{1/2} ∥_{2} = [tr (A + B) - 2 tr A^{1/2} B^{1/2}]^{1/2} .

d_{1} (A, B) = ∥ A^{1/2} - B^{1/2} ∥_{2} = [tr (A + B) - 2 tr A^{1/2} B^{1/2}]^{1/2} .

d_{2} (A, B) = [tr (A + B) - 2 tr (A^{1/2} B A^{1/2})^{1/2}]^{1/2} = [tr (A + B) - 2 tr (A B)^{1/2}]^{1/2} .

d_{2} (A, B) = [tr (A + B) - 2 tr (A^{1/2} B A^{1/2})^{1/2}]^{1/2} = [tr (A + B) - 2 tr (A B)^{1/2}]^{1/2} .

d_{2} (A, B) = min ∥ A^{1/2} - B^{1/2} U ∥_{2},

d_{2} (A, B) = min ∥ A^{1/2} - B^{1/2} U ∥_{2},

A # B = A^{1/2} (A^{- 1/2} B A^{- 1/2})^{1/2} A^{1/2} .

A # B = A^{1/2} (A^{- 1/2} B A^{- 1/2})^{1/2} A^{1/2} .

L (A, B) = exp (\frac{lo g A + lo g B}{2}) .

L (A, B) = exp (\frac{lo g A + lo g B}{2}) .

d_{3} (A, B) = [tr (A + B) - 2 tr (A # B)]^{1/2},

d_{3} (A, B) = [tr (A + B) - 2 tr (A # B)]^{1/2},

d_{4} (A, B) = [tr (A + B) - 2 tr L (A, B)]^{1/2} .

d_{4} (A, B) = [tr (A + B) - 2 tr L (A, B)]^{1/2} .

D Φ (A, X) ∣_{X = A} = 0.

D Φ (A, X) ∣_{X = A} = 0.

D^{2} Φ (A, X) ∣_{X = A} (Y, Y) ⩾ 0 for all Hermitian Y .

D^{2} Φ (A, X) ∣_{X = A} (Y, Y) ⩾ 0 for all Hermitian Y .

Φ (A, B) = φ (A) - φ (B) - D φ (B) (A - B),

Φ (A, B) = φ (A) - φ (B) - D φ (B) (A - B),

Φ_{3} (A, B) = d_{3}^{2} (A, B) and Φ_{4} (A, B) = d_{4}^{2} (A, B)

Φ_{3} (A, B) = d_{3}^{2} (A, B) and Φ_{4} (A, B) = d_{4}^{2} (A, B)

X > 0 min j = 1 \sum m w_{j} d^{2} (X, A_{j})

X > 0 min j = 1 \sum m w_{j} d^{2} (X, A_{j})

Q_{1/2} = (j = 1 \sum m w_{j} A_{j}^{1/2})^{2} .

Q_{1/2} = (j = 1 \sum m w_{j} A_{j}^{1/2})^{2} .

X = j = 1 \sum m w_{j} (X^{1/2} A_{j} X^{1/2})^{1/2} .

X = j = 1 \sum m w_{j} (X^{1/2} A_{j} X^{1/2})^{1/2} .

X = j = 1 \sum m w_{j} G (X, A_{j}),

X = j = 1 \sum m w_{j} G (X, A_{j}),

X^{2} = \frac{2}{π} j = 1 \sum m w_{j} 0 \int \infty (λ X^{- 1} + A_{j}^{- 1})^{- 2} λ d λ .

X^{2} = \frac{2}{π} j = 1 \sum m w_{j} 0 \int \infty (λ X^{- 1} + A_{j}^{- 1})^{- 2} λ d λ .

X = j = 1 \sum m w_{j} (X # A_{j}) .

X = j = 1 \sum m w_{j} (X # A_{j}) .

X = j = 1 \sum m w_{j} L (X, A_{j}) .

X = j = 1 \sum m w_{j} L (X, A_{j}) .

X > 0 min j = 1 \sum m w_{j} δ^{2} (X, A_{j}),

X > 0 min j = 1 \sum m w_{j} δ^{2} (X, A_{j}),

δ (A, B) = ∥ lo g A^{- 1/2} B A^{- 1/2} ∥_{2}

δ (A, B) = ∥ lo g A^{- 1/2} B A^{- 1/2} ∥_{2}

\delta_{S}(A,B):=\Big{[}\log\det\big{(}\frac{A+B}{2}\big{)}-\frac{1}{2}(\log\det A+\log\det B)\Big{]}^{1/2}

\delta_{S}(A,B):=\Big{[}\log\det\big{(}\frac{A+B}{2}\big{)}-\frac{1}{2}(\log\det A+\log\det B)\Big{]}^{1/2}

tr (A # B) ⩽ tr L (A, B) ⩽ tr (A^{1/2} B^{1/2}) ⩽ tr (A B)^{1/2} .

tr (A # B) ⩽ tr L (A, B) ⩽ tr (A^{1/2} B^{1/2}) ⩽ tr (A B)^{1/2} .

d_{3}^{2} (A, B) ⩾ d_{4}^{2} (A, B) ⩾ d_{1}^{2} (A, B) ⩾ d_{2}^{2} (A, B) .

d_{3}^{2} (A, B) ⩾ d_{4}^{2} (A, B) ⩾ d_{1}^{2} (A, B) ⩾ d_{2}^{2} (A, B) .

g (X) = A # X .

g (X) = A # X .

D g (X) (Y) = 0 \int \infty (λ + X A^{- 1})^{- 1} Y (λ + A^{- 1} X)^{- 1} d ν (λ),

D g (X) (Y) = 0 \int \infty (λ + X A^{- 1})^{- 1} Y (λ + A^{- 1} X)^{- 1} d ν (λ),

x^{1/2} = \frac{1}{2} + 0 \int \infty (\frac{λ}{λ ^{2} + 1} - \frac{1}{λ + x}) d ν (λ),

x^{1/2} = \frac{1}{2} + 0 \int \infty (\frac{λ}{λ ^{2} + 1} - \frac{1}{λ + x}) d ν (λ),

D X^{1/2} (Y) = 0 \int \infty (λ + X)^{- 1} Y (λ + X)^{- 1} d ν (λ),

D X^{1/2} (Y) = 0 \int \infty (λ + X)^{- 1} Y (λ + X)^{- 1} d ν (λ),

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Full text

Matrix versions of the Hellinger distance

Rajendra Bhatia

Ashoka University, Sonepat

Haryana, 131029, India

[email protected]

,

Stephane Gaubert

INRIA and CMAP, Ecole Polytechnique, CNRS, 91128

Palaiseau, France

[email protected]

and

Tanvi Jain

Indian Statistical Institute

New Delhi 110016, India

[email protected]

Abstract.

On the space of positive definite matrices we consider distance functions of the form $d(A,B)=\left[\text{\rm tr}\mathcal{A}(A,B)-\text{\rm tr}\mathcal{G}(A,B)\right]^{1/2},$ where $\mathcal{A}(A,B)$ is the arithmetic mean and $\mathcal{G}(A,B)$ is one of the different versions of the geometric mean. When $\mathcal{G}(A,B)=A^{1/2}B^{1/2}$ this distance is $\|A^{1/2}-B^{1/2}\|_{2},$ and when $\mathcal{G}(A,B)=(A^{1/2}BA^{1/2})^{1/2}$ it is the Bures-Wasserstein metric. We study two other cases: $\mathcal{G}(A,B)=A^{1/2}(A^{-1/2}BA^{-1/2})^{1/2}A^{1/2},$ the Pusz-Woronowicz geometric mean, and $\mathcal{G}(A,B)=\exp\big{(}\frac{\log A+\log B}{2}\big{)},$ the log Euclidean mean. With these choices $d(A,B)$ is no longer a metric, but it turns out that $d^{2}(A,B)$ is a divergence. We establish some (strict) convexity properties of these divergences. We obtain characterisations of barycentres of $m$ positive definite matrices with respect to these distance measures.

Key words and phrases:

Geometric mean, matrix divergence, Bregman divergence, relative entropy, strict convexity, barycentre.

2010 Mathematics Subject Classification:

15B48, 49K35, 94A17, 81P45.

1. Introduction

Let $p$ and $q$ be two discrete probability distributions; i.e. $p=(p_{1},\ldots,p_{n})$ and $q=(q_{1},\ldots,q_{n})$ are $n$ -vectors with nonnegative coordinates such that $\sum p_{i}=\sum q_{i}=1.$ The Hellinger distance between $p$ and $q$ is the Euclidean norm of the difference between the square roots of $p$ and $q$ ; i.e.

[TABLE]

This distance and its continuous version, are much used in statistics, where it is customary to take $d_{H}(p,q)=\frac{1}{\sqrt{2}}d(p,q)$ as the definition of the Hellinger distance. We have then

[TABLE]

where $\mathcal{A}(p,q)$ is the arithmetic mean of the vectors $p$ and $q,$ $\mathcal{G}(p,q)$ is their geometric mean, and $\text{\rm tr}\,x$ stands for $\sum x_{i}.$

A matrix/noncommutative/quantum version would seek to replace the probability vectors $p$ and $q$ by density matrices $A$ and $B$ ; i.e., positive semidefinite matrices $A,B$ with $\text{\rm tr}\,A=\text{\rm tr}\,B=1.$ In the discussion that follows, the restriction on trace is not needed, and so we let $A$ and $B$ be any two positive semidefinite matrices. On the other hand, a part of our analysis requires $A$ and $B$ to be positive definite. This will be clear from the context. We let $\mathbb{P}$ be the set of $n\times n$ complex positive definite matrices. The notation $A\geqslant 0$ means that $A$ is positive (semi) definite.

Here we run into the essential difference between the matrix and the scalar case. For positive definite matrices $A$ and $B,$ there is only one possible

arithmetic mean, $\mathcal{A}(A,B)=(A+B)/2.$ However, the geometric mean $\mathcal{G}(A,B)$ could have different meanings. Each of these leads to a different version of the Hellinger distance on matrices. In this paper we study some of these distances and their properties.

The Euclidean inner product on $n\times n$ matrices is defined as $\langle A,B\rangle=\text{\rm tr}\,A^{*}B.$ The associated Euclidean norm is

[TABLE]

Recall that the matrices $AB$ and $BA$ have the same eigenvalues. Thus if $A$ and $B$ are positive definite, then $AB$ is not positive definite unless $A$ and $B$ commute. However, the eigenvalues of $AB$ are all positive as they are the same as the eigenvalues of $A^{1/2}BA^{1/2}.$ Also every matrix with positive eigenvalues has a unique square root with positive eigenvalues. If $A,B$ are positive definite, then we denote by $(AB)^{1/2}$ the square root that has positive eigenvalues. Since $(AB)^{1/2}=A^{1/2}(A^{1/2}BA^{1/2})^{1/2}A^{-1/2},$ the matrices $(AB)^{1/2}$ and $(A^{1/2}BA^{1/2})^{1/2}$ are similar, and hence have the same eigenvalues.

The straightforward generalisation of (1) for positive definite matrices $A,B$ is evidently

[TABLE]

Another version could be

[TABLE]

While it is clear from (3) that $d_{1}$ is a metric on $\mathbb{P},$ it is not obvious that $d_{2}$ is a metric. It turns out that

[TABLE]

where the minimum is taken over all unitary matrices $U.$ It follows from this that $d_{2}$ is a metric. This is called the Bures distance in the quantum information literature and the Wasserstein metric in the literature on optimal transport. It plays an important role in both these subjects. We refer the reader to [18] for a recent exposition, and to [12, 26, 28, 36] for earlier work. The quantity $F(A,B)=\text{\rm tr}(A^{1/2}BA^{1/2})^{1/2}$ is called the fidelity between the states $A$ and $B.$ In the special case when $A=uu^{*},$ $B=vv^{*}$ are pure states, we have $F(A,B)=|u^{*}v|$ and $d_{2}(A,B)=\sqrt{2}(1-|u^{*}v|)^{1/2}.$ For qubit states this is the distance on the Bloch sphere.

For various reasons, theoretical and practical, the most accepted definition of geometric mean of $A,B$ is the entity

[TABLE]

This formula was introduced by Pusz and Woronowicz [32]. When $A$ and $B$ commute $A\#B$ reduces to $A^{1/2}B^{1/2}.$ The mean $A\#B$ has been studied extensively for several years and has remarkable properties that make it useful in diverse areas. One of them is its connection with operator inequalities related to monotonicity and convexity theorems for the quantum entropy. See Chapter 4 of [15] for a detailed exposition. Another object of interest has been the log Euclidean mean $\mathcal{L}(A,B)$ defined as

[TABLE]

This mean too reduces to $A^{1/2}B^{1/2}$ when $A$ and $B$ commute, and has been used in various contexts [7], though it lacks some pleasing properties that $A\#B$ has.

Thus it is natural to consider two more matrix versions of the Hellinger distance, viz,

[TABLE]

and

[TABLE]

In view of what has been discussed, we may expect that $d_{3}$ and $d_{4}$ are metrics on $\mathbb{P}.$ However, it turns out that neither of them obeys the triangle inequality. Examples are given in Section 2. Nevertheless, this is compensated by the fact that the squares of $d_{3}$ and $d_{4}$ both are divergences, and hence they can serve as good distance measures.

inline]SG: inconsistency of notation here $\mathbb{R}_{+}$ is the set of nonnegative numbers, but later is is used for positive numbers, being the one dimensional version of $\mathbb{P}$ . I now use $\mathbb{R}_{++}$ for the positive reals and $\mathbb{R}_{+}$ for the nonnegative reals, and define it at the first occurrence A smooth function $\Phi$ from $\mathbb{P}\times\mathbb{P}$ to the set of nonnegative real numbers, $\mathbb{R}_{+}$ , is called a divergence if

$(i)$

$\Phi(A,B)=0$ if and only if $A=B.$

$(ii)$

The first derivative $D\Phi$ with respect to the second variable color=blue!20]RB: second variable precised vanishes on the diagonal; i.e.,

[TABLE]

$(iii)$

The second derivative $D^{2}\Phi$ is positive on the diagonal; i.e.,

[TABLE]

See [4], Sections 1.2 and 1.3.

The prototypical example is the Euclidean divergence $\Phi(A,B)=\|A-B\|_{2}^{2}.$ The functions $d_{1}^{2}(A,B)$ and $d_{2}^{2}(A,B)$ are also divergences. Another well-known example is the Kullback-Leibler divergence [4]. A special kind of divergence is the Bregman divergence inline]SG: the mother function $\varphi$ should take values in $\mathbb{R}$ rather than $\mathbb{R}_{+}$ , think of $\varphi(x)=x\log x$ corresponding to a strictly convex differentiable color=blue!20]RB: say it is differentiable function $\varphi:\mathbb{P}\to\mathbb{R}.$ If $\varphi$ is such a function, then

[TABLE]

is called the Bregman divergence corresponding to $\varphi.$ Not every divergence arises in this way. In particular, $d_{H}^{2}(p,q),$ the square of the Hellinger distance, on probability vectors is not a Bregman divergence.

Now we describe our main results. We will show that both the functions

[TABLE]

are divergences. We will show that $\Phi_{3}$ and $\Phi_{4}$ are jointly convex in the variables $A$ and $B,$ and strictly convex in each of the variables separately. One consequence of this is that for every $m$ -tuple $A_{1},\ldots,A_{m}$ in $\mathbb{P}$ and

positive weights $w_{1},\ldots,w_{m}$ the minimisation problem

[TABLE]

has a unique solution when $d=d_{3}$ or $d_{4}.$ When $d=d_{1}$ the minimum in (13) is attained at the $1/2$ -power mean

[TABLE]

This is one of the much studied family of classical power means. When $d=d_{2},$ the minimiser in (13) is the Wasserstein mean [2, 18]. This is the unique solution of the matrix equation

[TABLE]

This mean has major applications in optimal transport, statistics, quantum information and other areas. Means with respect to various divergences have also been of interest in information theory. See e.g., [8, 30]. An inspection of (14) and (15) shows a common feature. Both for $d_{1}$ and $d_{2}$ the minimiser in (13) is the solution of the equation

[TABLE]

where $\mathcal{G}$ is the version of the geometric mean chosen in the definition of $d.$ That is, $\mathcal{G}(A,B)=A^{1/2}B^{1/2}$ in the case of $d_{1},$ and $\mathcal{G}(A,B)=(A^{1/2}BA^{1/2})^{1/2}$ in the case of $d_{2}.$ It turns out that this is also the case for $d_{4}$ but not for $d_{3}.$ When $d=d_{3}$ the minimisation problem (13) has a unique solution $X$ which is also the solution of the matrix equation

[TABLE]

This, in general, is different from the solution of the matrix equation

[TABLE]

When $d=d_{4},$ the problem (13) has a unique solution $X$ which is also the solution of the matrix equation

[TABLE]

In the past few years there has been extensive work on the Cartan mean (also known as Karcher or Riemann mean) of positive definite matrices.

This is the solution of the minimisation problem

[TABLE]

where

[TABLE]

is the Cartan metric on the manifold $\mathbb{P}$ .color=blue!20]RB: Cartan added This mean from classical differential geometry has found several important applications [9, 15, 16, 24, 29].

Our analysis of $\Phi_{4}$ leads to some interesting facts about quantum relative entropy. We observe that the convex function $\varphi(A)=\text{\rm tr}\left(A\log A-A\right)$ leads to the Bregman divergence $\Phi(A,B)=\text{\rm tr}\,A(\log A-\log B)-\text{\rm tr}(A-B),$ and the log Euclidean mean is the barycentre with respect to this Bregman divergence. As a related issue, we explore properties of barycentres with respect to general matrix Bregman divergences, and point out similarities and crucial differences between the scalar and matrix case.

Convexity properties of matrix Bregman divergences have been studied in [11, 31], and matrix approximation problems with divergences in [23]. Means with respect to matrix divergences are studied in [22]. In [35] Sra studied a related distance function

[TABLE]

and showed that this is a metric on $\mathbb{P}$ . Several parallels between this metric and the Cartan metric are pointed out in [35].

2. Convexity and derivative computations

Inequalities for traces of matrix expressions have a long history. For the different geometric means mentioned in Section 1, we know [17] that

[TABLE]

It follows that

[TABLE]

Since $d_{1}$ is a metric, this implies that $d_{3}^{2}(A,B)=0$ if and only if $A=B.$ The same is true for $d_{4}^{2}(A,B).$ Thus $\Phi_{3}$ and $\Phi_{4}$ satisfy the first condition in the definition of a divergence. To prove $\Phi_{3}$ is a divergence we need to compute its first and second derivatives. These results are of independent interest.

Proposition 1.

Let $A$ be a positive definite matrix. Let $g$ be the map on $\mathbb{P}$ defined as

[TABLE]

Then the derivative of $g$ is given by the formula

[TABLE]

where $\text{\rm d}\nu(\lambda)=\frac{1}{\pi}\lambda^{1/2}\text{\rm d}\lambda.$

Proof.

We will use the integral representation

[TABLE]

where $\text{\rm d}\nu(\lambda)=\frac{1}{\pi}\lambda^{1/2}\text{\rm d}\lambda.$ See [14] p.143. Using this we see that the derivative of the function $X\to X^{1/2}$ is the linear map

[TABLE]

where $Y$ is any Hermitian matrix. This shows that

[TABLE]

This proves the proposition.

Theorem 2.

Let $D\Phi_{3}$ and $D^{2}\Phi_{3}$ be the first and the second derivatives of $\Phi_{3}.$ Then

[TABLE]

(In other words, the gradient of $\Phi_{3}$ at every diagonal point is [math] and the Hessian is positive.)

Proof.

For a fixed $A,$ let $g$ be the map on $\mathbb{P}$ defined as $g(X)=A\#X.$ When $X=A,$ the expression in (23) reduces to

[TABLE]

Recalling that $\Phi_{3}(A,X)=\text{\rm tr}(A+X)-2\text{\rm tr}g(X),$ we see that

[TABLE]

This establishes (26). Next note that for the second derivative we have

[TABLE]

From (23) we see that

[TABLE]

By definition

[TABLE]

Hence, from (29) we see that $D^{2}(\text{\rm tr}\,g(X))(Y,Z)$ is equal tocolor=blue!20]RB: displayed eq edited

[TABLE]

When $X=A$ and $Z=Y,$ this reduces to give

[TABLE]

This proves (27).

inline]SG: rewrote what follows as the restriction to $\mathbb{R}_{+}$ is unnatural and the notation $\mathbb{R}_{++}$ was undefined Consider maps $f$ defined on $\mathbb{P}$ and taking values in $\mathbb{P}$ or $\mathbb{R}_{++}$ (the set of positive real numbers). We say that $f$ is concave if for all $X,Y$ in $\mathbb{P}$ and $0\leqslant\alpha\leqslant 1$

[TABLE]

It is strictly concave if the two sides of (31) are equal only if $X=Y.$ A map $f$ from $\mathbb{P}\times\mathbb{P}$ into $\mathbb{P}$ or $\mathbb{R}_{+}$ is called jointly concave if for all $X_{1},X_{2},Y_{1},Y_{2}$ in $\mathbb{P}$ and $0\leqslant\alpha\leqslant 1,$

[TABLE]

It is a basic fact in the theory of the geometric mean that $A\#B$ is jointly concave in $A$ and $B$ , see [5, 6]. However, it is not strictly jointly concave. Indeed, even the function $f(a,b)=\sqrt{ab}$ on $\mathbb{R}_{+}\times\mathbb{R}_{+}$ is not strictly jointly concave (its restriction to the diagonal is linear). Our next theorem says that in each of the variables separately, the geometric mean is strictly concave. color=blue!20]RB: para above edited

Theorem 3.

For each $A$ the function

[TABLE]

is strictly concave on $\mathbb{P}.$ This implies that the function $g(X)=A\#X$ is also strictly concave.color=blue!20]RB: last sentence added

Proof.

Suppose

[TABLE]

We have to show that this implies $X=Y.$ Rewrite the above equality as

[TABLE]

By the concavity of $A\#X,$ the expression inside the braces is positive semidefinite. The trace of such a matrix is zero if and only if the matrix itself is zero. Hence

[TABLE]

Using the definition (6) this can be written as

[TABLE]

Cancel the factors $A^{1/2}$ occurring on both sides, then square both sides, and rearrange terms to get

[TABLE]

This is the same as saying

[TABLE]

The square of a Hermitian matrix $Z$ is zero only if $Z=0.$ Hence, we have

[TABLE]

From this it follows that $X=Y.$

Finally, if $X,Y$ are to elements of $\mathbb{P}$ such that $g((X+Y)/2)=(g(X)+g(Y))/2$ , taking traces on both sides, we have, $f((X+Y)/2)=(f(X)+f(Y))/2.$ We have seen that this implies $X=Y$ . color=blue!20]RB: last para added

As a consequence, we observe that

[TABLE]

is jointly convex in $A$ and $B$ and is strictly convex in each of the variables separately.

Now we turn to the analysis of $\Phi_{4}$ on the same lines as above. The arguments we present in this case are quite different. From (22) we know that

[TABLE]

We also know that

[TABLE]

and

[TABLE]

Together, these three relations lead to the conclusion that

[TABLE]

Thus $\Phi_{4}$ satisfies condition (10).

By a theorem of Bhagwat and Subramanian [13]

[TABLE]

One of the several remarkable concavity theorems of Carlen and Lieb, [20, 21] says that the expression $\text{\rm tr}\left(\sum A_{j}^{p}\right)^{1/p}$ is jointly concave in $A_{1},\ldots,A_{m},$ when $0<p\leqslant 1,$ and jointly convex when $1\leqslant p\leqslant 2.$ Using equation (32) we obtain from this the joint concavity of $\text{\rm tr}\mathcal{L}(A,B).$ As a consequence $\Phi_{4}(A,B)$ is jointly convex in $A,B.$ Hence we have proved the following theorem.

Theorem 4.

The function $\Phi_{4}$ is a divergence on $\mathbb{P}.$

We have shown that $\Phi_{3}$ and $\Phi_{4}$ are divergences. But unlike $\Phi_{1}$ and $\Phi_{2}$ they are not the squares of metrics on $\mathbb{P},$ i.e., $d_{3}$ and $d_{4}$ are not metrics. The following two examples show that $d_{3}$ and $d_{4}$ do not satisfy the triangle inequality.

Let

[TABLE]

Then $d_{3}(A,B)\approx 5.0347$ and $d_{3}(A,C)+d_{3}(C,B)\approx 4.6768.$ This example is a small modification of one suggested to us by Suvrit Sra, to whom we are thankful. Let

[TABLE]

Then $d_{4}(A,B)\approx 3.3349$ and $d_{4}(A,C)+d_{4}(C,B)\approx 3.3146.$

Next we study some more properties of $\Phi_{4}$ , like its strict convexity in each of the arguments, and its connections with matrix entropy. To put these in context we recall some facts about Bregman divergence.

Let $\varphi:\mathbb{R}_{+}\to\mathbb{R}$ be a smooth strictly convex function and let

[TABLE]

be the associated Bregman divergence. Then $\Phi$ is strictly convex in the variable $x$ but need not be convex in $y.$ (See, e.g., [23] Section 2.2.)

Given $x_{1},\ldots,x_{m}$ in $\mathbb{R}_{+},$ the minimiser

[TABLE]

always turns out to be the arithmetic mean

[TABLE]

independent of the mother function $\varphi.$

In fact, this property characterises Bregman divergences; see [23, 8]. We can also consider the problem

[TABLE]

In this case, a calculation shows that the solution is the quasi-arithmetic mean (the Kolmogorov mean) associated with the function $\varphi^{\prime}.$ More precisely, the solution of (35), which we may think of as the mean, or the barycentre, of the points $x_{1},\ldots,x_{m}$ with respect to the divergence $\Phi$ is

[TABLE]

We wish to study the matrix version of the problems (34) and (35). Here we run into a basic difference between the one-variable and the several-variables cases. It is natural to replace the derivative $\varphi^{\prime}$ in (36) by the gradient $\nabla\varphi$ in the several-variables case. If $\varphi$ is a differentiable strictly convex function defined on an open interval $I$ of $\mathbb{R}$ , then, its derivative $\varphi^{\prime}$ is a strictly monotone continuous function, and hence a homeomorphism from $I$ to its image $\varphi^{\prime}(I)$ . In particular, $(\varphi^{\prime})^{-1}$ is defined. The appropriate generalisation of these facts to the several-variable case requires the notion of a Legendre type function.

Definition (Section 26 in [33] or Def. 2.8 in [10]).

Suppose $\varphi$ is a convex lower-semicontinuous function from $\mathbb{R}^{n}$ to $\mathbb{R}\cup\{+\infty\}$ , and let $\operatorname{dom}f:=\{x\in\mathbb{R}^{n}\mid\varphi(x)<+\infty\}$ . We say that $\varphi$ is of Legendre type if it satisfies

(i)

$\operatorname{int}\operatorname{dom}\varphi\neq\varnothing$ , 2. (ii)

$\varphi$ is differentiable on $\operatorname{int}\operatorname{dom}\varphi$ , 3. (iii)

$\varphi$ is strictly convex on $\operatorname{int}\operatorname{dom}\varphi$ , 4. (iv)

$\text{\rm lim}_{t\to 0^{+}}\langle\nabla\varphi(x+t(y-x)),y-x\rangle=-\infty$ , for all $x\in\operatorname{bdry}(\operatorname{dom}(\varphi))$ and $y\in\operatorname{int}\operatorname{dom}\varphi$ .

If $\varphi$ is of Legendre type, the gradient mapping $\nabla\varphi$ is a homeomorphism from $\operatorname{int}\operatorname{dom}\varphi$ to $\operatorname{int}\operatorname{dom}\varphi^{\star}$ , where $\varphi^{\star}$ denotes the Legendre-Fenchel conjugate of $\varphi$ . See Theorem 26.5 in [33].

Lemma 5.

If $\varphi$ is of Legendre type, and $\Phi$ is the Bregman divergence associated with $\varphi$ , and $a_{1},\dots,a_{m}\in\operatorname{int}\operatorname{dom}\varphi$ , then the function

[TABLE]

achieves its minimum at a unique point, which belongs to $\operatorname{int}\operatorname{dom}\varphi$ .

The proof is given in Appendix A. We shall apply this lemma in the situation where $\varphi$ is a convex function defined only on $\mathbb{P}$ and taking finite values on this set. The map $\varphi$ trivially extends to a convex lower-semicontinuous function defined on the whole space of Hermitian matrices—set $\varphi(X):=\liminf_{Y\to X,\;Y\in\mathbb{P}}\varphi(Y)$ for $X\in\operatorname{bdry}(\mathbb{P})$ , and $\varphi(X)=+\infty$ if $X\not\in\operatorname{bdry}(\mathbb{P})$ . We shall say that the original function $\varphi$ defined on $\mathbb{P}$ is of Legendre type if its extension is of Legendre type.

Theorem 6.

Let $\varphi$ be a differentiable strictly convex function from $\mathbb{P}$ to $\mathbb{R},$ and let $\Phi$ be the Bregman divergence corresponding to $\varphi.$ Then:

(i)

The minimiser in the problem

[TABLE]

is the arithmetic mean $\sum\limits_{j=1}^{m}\frac{1}{m}A_{j}.$ 2. (ii)

If, in addition, $\varphi$ is of Legendre type, then the problem

[TABLE]

has a unique solution, and this is given by

[TABLE] 3. (iii)

If $\psi$ is any differentiable strictly convex function from $\mathbb{R}_{++}$ to $\mathbb{R}$ and $\Phi$ is the Bregman divergence on $\mathbb{P}$ corresponding to the function $\varphi(X):=\text{\rm tr}\psi(X)$ on $\mathbb{P}$ , then the solution of the minimisation problem (38) is

[TABLE]

Proof.

(i). Since $\Phi$ is given by (12),

[TABLE]

where $\overline{A}$ denotes the arithmetic mean $\sum\limits_{j=1}^{m}\frac{1}{m}A_{j}.$ Hence

[TABLE]

Since $\varphi$ is strictly convex, for every $X\neq\overline{A}$

[TABLE]

This implies that

[TABLE]

which shows that $\overline{A}$ is the unique minimiser of the problem (37). (ii). Let $\Psi$ be the map from $\mathbb{P}$ to $\mathbb{R}_{+}$ defined as

[TABLE]

Then

[TABLE]

5 shows that the minimum of the map $\Psi$ on the set $\mathbb{P}$ is achieved at some point $X\in\mathbb{P}$ , and by the first order optimality condition, $D\Psi(X)=0$ , showing that $X$ satisfies (39).

(iii). If $\psi$ is a differentiable convex function on $\mathbb{R}_{++}$ and $\Phi$ is the Bregman divergence corresponding to $\varphi=\text{\rm tr}\psi,$ then $\nabla\varphi(X)=\psi^{\prime}(X).$ Hence, to show that the minimisation problem (38) has a solution, it suffices to show that the first order optimality condition

[TABLE]

is satisfied for some $X$ in $\mathbb{P}$ . Since $\psi$ is strictly convex, as noted above, $\psi^{\prime}$ is strictly increasing and is a homeomorphism from $\mathbb{R}_{++}$ to the interval $J:=\psi^{\prime}(\mathbb{R}_{++})$ . The spectrum of each matrix $\psi^{\prime}(A_{j})$ belongs to $J$ , and so the spectrum of $\sum\limits_{j=1}^{m}\frac{1}{m}\psi^{\prime}(A_{j})$ also belongs to $J$ , which implies that (41) is solvable.

The assumption that $\varphi$ is of Legendre type is not needed in the tracial case (statement (iii)). 11 in Appendix B shows that this assumption cannot be dispensed with in the case of statement (ii).

The much studied convex function

[TABLE]

on $\mathbb{R}_{+}$ leads to the Bregman divergence

[TABLE]

This is called the Kullback-Leibler divergence. Since $\varphi^{\prime}(x)=\log x,$ the solution of the minimisation problem (35) in this case is

[TABLE]

the geometric mean of $x_{1},\ldots,x_{m}.$

As a matrix analogue of (42) one considers the function on $\mathbb{P}$ defined as

[TABLE]

The associated Bregman divergence then is

[TABLE]

(See [4], p.12). The quantity

[TABLE]

is called the relative entropy and has been of great interest in quantum information. Given $A_{1},\ldots,A_{m}$ in $\mathbb{P},$ their barycentre with respect to the divergence $\Phi,$ i.e., the solution of the minimisation problem (38) is the log Euclidean mean

[TABLE]

It is also of interest to compute the variance of the points $A_{1},\ldots,A_{m}$ with respect to $\Phi,$ i.e., the minimum value of the objective function in (38). This is the quantity

[TABLE]

For the divergence $\Phi$ in (45), $\mu_{\Phi}$ is the log Euclidean mean $\mathcal{L}$ given in (47). So

[TABLE]

In other words

[TABLE]

the difference between the traces of the arithmetic and the log Euclidean means of $A_{1},\ldots,A_{m}.$

In particular, the divergence $\Phi_{4}(A,B)$ can be characterised using (49), as the minimum value

[TABLE]

where $\Phi$ is defined by (45). Using this characterisation we can show that the function $\Phi_{4}(A,B)$ is strictly convex in each of the variables separately. To this end, we recall the following lemma of convex analysis, showing that the “marginal” of a jointly convex function is convex; compare with Proposition 2.22 of [34] where a similar result (without the strictness conclusion) is provided. inline]SG: added last sentence with a ref to [34] as this is known in convex analysis

Lemma 7.

Let $f(x,y)$ be a jointly convex function which is strictly convex in each of its variables separately. Suppose for each $a,b$

[TABLE]

exists. Then the function $g(a,b)$ is jointly convex, and is strictly convex in each of the variables separately.

Proof.

Given $a_{1},a_{2},b_{1},b_{2},$ choose $x_{1}$ and $x_{2}$ such that

[TABLE]

and

[TABLE]

Then

[TABLE]

This shows that $g$ is jointly convex. Now we show that it is strictly convex in the first variable. Let $a_{1},a_{2},b$ be any three points with $a_{1}\neq a_{2}.$ Choose $x_{1}$ and $x_{2}$ such that

[TABLE]

and

[TABLE]

Two cases arise. If $x_{1}=x_{2}=x,$ then

[TABLE]

because of strict convexity of $f$ in the second variable. This implies that

[TABLE]

If $x_{1}\neq x_{2},$ then by strict convexity of $f$ in the first variable,

[TABLE]

and by joint convexity of $f$

[TABLE]

Adding the last two inequalities we get

[TABLE]

Thus $g(a,b)$ is strictly convex in the first variable, and by symmetry it is so in the second variable.

Theorem 8.

For each $A,$ the function $f(X)=\Phi_{4}(X,A)$ is strictly convex on $\mathbb{P}.$

Proof.

One of the fundamental, and best known, properties of the relative entropy $S(A|B)$ is that it is jointly convex function of $A$ and $B.$ (See, e.g., Section IX.6 in [14].) It is also known that if $\varphi$ is strictly convex function on $\mathbb{R}_{+},$ then the function $\text{\rm tr}\,\varphi(X)$ is strictly convex on $\mathbb{P}.$ (See, e.g., Theorem 4 in [19].) It follows from this that $S(A|B)$ is strictly convex in each of the variables separately. Combining these properties of $S(A|B),$ Lemma 7 and the characterisation of $\Phi_{4}(A,B)$ as the minimum value in (50) we obtain Theorem 8.

It might be pertinent to add here that the question of equality in the joint convexity inequality

[TABLE]

has been addressed in [25] and [27]. In [27] Jencova and Ruskai show that the equality holds in (52) if and only if

[TABLE]

On the other hand, Hiai et al [25] show that equality holds in (52) if and only if

[TABLE]

We are thankful to F. Hiai for making us aware of these results.

3. Barycentres

If $f$ is a convex function on an open convex set, then a critical point of $f$ is the global minimum of $f.$ If $f$ is strictly convex, then $f$ can have at most one such critical point. In this section we show that for $d=d_{3}$ and $d_{4},$ the objective function in (13) has a critical point, and hence in both cases the problem (13) has a unique solution.

Theorem 9.

When $d=d_{3},$ the minimum in (13) is attained at a unique point $X$ which is the solution of the matrix equation (17)

[TABLE]

This minimiser is the $1/2$ -power mean $Q_{1/2}$ given by (14) if $Q_{1/2}$ commutes with every $A_{j}.$ In particular, the minimiser is $Q_{1/2}$ if

(i)

all $A_{j}$ ’s commute, or

(ii)

$Q_{1/2}=I.$ **

Proof.

For a fixed positive definite matrix $A,$ define the map $G_{A}$ as

[TABLE]

By Proposition 1, we have

[TABLE]

The objective function in (13) is

[TABLE]

Using the definition of $\Phi_{3}$ we have

[TABLE]

Then using the above expression for $DG_{A_{j}}(X)$ we see that

[TABLE]

At the last step above we use the cyclicity of the trace function. Hence the critical point of $f$ is the matrix $X_{0}$ if and only if $X_{0}$ satisfies the matrix equation

[TABLE]

Taking congruence with $X$ on both sides we see that (53) is equivalent to (17). We now show that there exists a positive definite matrix $X_{0}$ that satisfies (17). Let $\alpha,\beta>0$ such that $\alpha I\leqslant A_{j}\leqslant\beta I$ for all $j=1,\ldots,m,$ and let $\mathcal{K}$ be the compact set $\mathcal{K}=\{X\in\mathbb{P}(n):\alpha I\leqslant X\leqslant\beta I\}.$ Define the map $F:\mathcal{K}\to\mathbb{P}(n)$ as

[TABLE]

Since $X,A_{j}\in\mathcal{K},$ $(\lambda+1)\alpha^{-1}\geqslant(\lambda X^{-1}+A_{j}^{-1})\geqslant(\lambda+1)\beta^{-1}.$ Thus we have $\alpha^{2}/(\lambda+1)^{2}\leqslant(\lambda X^{-1}+A_{j}^{-1})^{-2}\leqslant\beta^{2}/(\lambda+1)^{2}.$ We know that $\int_{0}^{\infty}\text{\rm d}\nu(\lambda)/(\lambda+1)^{2}=1/2.$ This gives $F(X)\in\mathcal{K}.$ By the Brouwer fixed point theorem, we get that $F$ has a fixed point $X_{0}$ in $\mathcal{K}.$ This fixed point $X_{0}$ is the solution of (17). Suppose $Q_{1/2}$ commutes with every $A_{j},$ $1\leqslant j\leqslant m.$ We show that $Q_{1/2}$ satisfies (17). Differentiating (24) we get

[TABLE]

Using $Q_{1/2}A_{j}^{-1}=A_{j}^{-1}Q_{1/2}$ in (53) and using (54) we get

[TABLE]

This proves the second statement of the theorem. If (i) holds, it follows from (14) that $Q_{1/2}$ commutes with $A_{j}$ ’s. The same is trivially true if (ii) holds.

Theorem 10.

When $d=d_{4}$ the minimum in (13) is attained at a unique point $X$ which satisfies the matrix equation (19)

[TABLE]

Proof.

Start with the integral representation

[TABLE]

This shows that for all $X>0$ and all Hermitian $Y$ we have

[TABLE]

For a fixed $A,$ let

[TABLE]

Then

[TABLE]

The log Euclidean mean $\mathcal{L}(A,X)=\text{\rm e}^{g(X)}.$ So, by the chain rule and Dyson’s formula (see [14] p. 311), we have

[TABLE]

This shows that

[TABLE]

using the cyclicity of trace. Using (55) and the cyclicity once again, we obtain

[TABLE]

Hence, for the function

[TABLE]

we have

[TABLE]

The objective function in (13) is

[TABLE]

So, we have

[TABLE]

where

[TABLE]

This shows that $Df(X)=0$ if and only if

[TABLE]

Choose an orthonormal basis in which $X=\operatorname{diag}(x_{1},\ldots,x_{n}),$ and let $Z=\begin{bmatrix}z_{ij}\end{bmatrix}$ in this basis. Then the condition (57) says that

[TABLE]

This shows that $Z$ is diagonal, and

[TABLE]

Thus $X=Z=\sum\limits_{j=1}^{m}w_{j}\mathcal{L}(A_{j},X),$ as claimed. We should also show that the equation (19) has a unique solution. Let $\alpha,\beta$ be positive numbers such that $\alpha I\leqslant A_{j}\leqslant\beta I$ for all $1\leqslant j\leqslant m.$ Let $\mathcal{K}$ be the compact convex set $\mathcal{K}=\{X\in\mathbb{P}:\alpha I\leqslant X\leqslant\beta I\}.$ The function $\log X$ is operator monotone. So for all $X$ in $\mathcal{K}$ we have $\log\alpha I\leqslant\log X\leqslant\log\beta I.$ Hence $\mathcal{L}(X,A_{j})$ is in $\mathcal{K}$ for all $1\leqslant j\leqslant k.$ This shows that the function $F(X)=\sum\limits_{j=1}^{m}w_{j}\mathcal{L}(X,A_{j})$ maps $\mathcal{K}$ into itself. By Brouwer’s fixed point theorem $F$ has a unique fixed point $X$ in $\mathcal{K}.$ This $X$ is a solution of (19) and therefore must be unique.

Finally, we remark that in the case of $d_{1},$ the barycentre is given explicitly by the formula (14). For $d_{2},$ $d_{3},$ $d_{4}$ it has been given implicitly as solution of the equations (15),(17),(19), respectively. When $m=2$ and $w_{1}=w_{2}=1/2$ , color=blue!20]RB: $w$ precised the solution of (15) is the Wasserstein mean of $A_{1}$ and $A_{2}$ defined as

[TABLE]

See [18].

Acknowledgements: The authors thank F. Hiai and S. Sra for helpful comments and references, and the anonymous referee for a careful reading of the manuscript. The first author is grateful to INRIA and École polytechnique, Palaiseau for visits that facilitated this work, and to CSIR(India) for the award of a Bhatnagar Fellowship.

Appendix A Proof of 5

We make a variation of the proof of Theorem 3.12 in [10], dealing with a related problem (the minimisation of $\Phi$ over a closed convex set).

Since $\varphi$ is of Legendre type, Theorem 3.7(iii) of [10] shows that for all $a\in\operatorname{int}\operatorname{dom}\varphi$ , the map $x\mapsto\Phi(x,a)$ is coercive, meaning that $\text{\rm lim}_{\|x\|\to\infty}\Phi(x,a)=+\infty$ . A sum of coercive functions is coercive, and so the map

[TABLE]

is coercive. The infimum of a coercive lower-semicontinuous function on a closed non-empty set is attained, so there is an element $\bar{x}\in\operatorname{clo}\operatorname{int}\operatorname{dom}\varphi$ such that $\inf_{x\in\operatorname{clo}\operatorname{int}\operatorname{dom}\varphi}\Phi(x)=\Phi(\bar{x})<+\infty$ . Suppose that $\bar{x}$ belongs to the boundary of $\operatorname{int}\operatorname{dom}\varphi$ . Let us fix an arbitrary $z\in\operatorname{int}\operatorname{dom}\varphi$ , and let $g(t):=\Psi((1-t)\bar{x}+tz)$ , defined for $t\in[0,1)$ . We have

[TABLE]

Using property (iv) of the definition of Legendre type functions, we get that $\text{\rm lim}_{t\to 0^{+}}g^{\prime}(t)=-\infty$ , which entails that $g(t)<g(0)=\Psi(\bar{x})$ for $t$ small enough. Since $(1-t)\bar{x}+tz\in\operatorname{int}\operatorname{dom}\varphi$ for all $t\in(0,1)$ , this contradicts the optimality of $\bar{x}$ . So $\bar{x}\in\operatorname{int}\operatorname{dom}\varphi$ , which proves 5.

Appendix B Examples

In the last statement of 6, dealing with tracial convex functions, we required $\varphi$ to be differentiable and strictly convex on $\mathbb{P}$ . In the second statement, dealing with the non tracial case, we made a stronger assumption, requiring $\varphi$ to be of Legendre type. We now give an example showing that the Legendre condition cannot be dispensed with. To this end, it is convenient to construct first an example showing the tightness of 5.

Need for the Legendre condition in 5

Let us fix $N>3$ , let $e=(1,1)^{\top}\in\mathbb{R}^{2}$ ,

[TABLE]

and consider the affine transformation $g(x)=e+Lx$ . Let $a=(N,0)^{\top}$ , $b=(0,N)^{\top}$ , and

[TABLE]

Observe that $\bar{a},\bar{b}\in\mathbb{R}_{++}^{2}$ since $N>3$ .

Consider now, for $p>1$ , the map $\varphi(x):=\|x\|_{p}^{p}=|x_{1}|^{p}+|x_{2}|^{p}$ defined on $\mathbb{R}^{2}$ and $\bar{\varphi}(x)=\varphi(g(x))$ . Observe that $\varphi$ is strictly convex and differentiable. Let $\bar{\Phi}$ denote the Bregman divergence associated with $\bar{\varphi}$ , and let $\bar{\Psi}(x):=\frac{1}{2}(\bar{\Phi}(x,\bar{a})+\bar{\Phi}(x,\bar{b}))$ . We claim that [math] is the unique point of minimum of $\bar{\Psi}$ over $\mathbb{R}_{+}^{2}$ . Indeed,

[TABLE]

from which we get

[TABLE]

It follows that $\nabla\bar{\Psi}(0)\in\mathbb{R}_{++}^{2}$ if $p>1$ is chosen close enough to $1$ , so that $1-N^{p-1}/2>0$ . Then, since $\bar{\Psi}$ is convex, we have

[TABLE]

showing the claim.

Consider now the modification $\hat{\varphi}$ of $\bar{\varphi}$ , so that $\hat{\varphi}(x)=\bar{\varphi}(x)$ for $x\in\mathbb{R}_{+}^{2}$ , and $\hat{\varphi}(x)=+\infty$ otherwise. The function $\hat{\varphi}$ is strictly convex, lower-semicontinuous, and differentiable on the interior of its domain, but not of Legendre type, and the conclusion of 5 does not apply to it.

The geometric intuition leading to this example is described in the figure.

Need for the Legendre condition in 6

We next construct an example showing that the Legendre condition in the second statement of 6 cannot be dispensed with. Observe that the inverse of the linear operator $L$ in (60) is given by

[TABLE]

In particular, it is a nonnegative matrix.

We set $\tau=\left(\begin{smallmatrix}0&1\\ 1&0\end{smallmatrix}\right)$ , and consider the “quantum” analogue of $L$ , i.e.,

[TABLE]

Then,

[TABLE]

is a completely positive map leaving $\mathbb{P}$ invariant. The analogue of the map $g$ is

[TABLE]

where $I$ denotes the identity matrix.

We now consider the map $\varphi(X):=\|X\|_{p}^{p}=\operatorname{tr}(|X|^{p})$ defined on the space of Hermitian matrices. The function $\varphi$ is differentiable and strictly convex, still assuming that $p>1$ . We set $\bar{A}:=\operatorname{diag}(\bar{a})\in\mathbb{P}$ , $\bar{B}:=\operatorname{diag}(\bar{b})\in\mathbb{P}$ , and now define $\bar{\Phi}$ to be the Bregman divergence associated with $\bar{\varphi}:=\varphi\circ G$ . Let

[TABLE]

We then have the following result.

Proposition 11.

The minimum of the function $\bar{\Psi}$ on the closure of $\mathbb{P}$ is achieved at point [math]. Moreover, the equation

[TABLE]

has no solution $X$ in $\mathbb{P}$ .

Proof.

From [3] (Theorem 2.1) or [1] (Theorem 2.3), we have

[TABLE]

where $X=U|X|$ is the polar decomposition of $X$ . In particular, if $X$ is diagonal and positive semidefinite,

[TABLE]

Then, by a computation similar to the one in the scalar case above, we get

[TABLE]

We conclude, as in (61), that

[TABLE]

where now $\langle\cdot,\cdot\rangle$ is the Frobenius scalar product on the space of Hermitian matrices. It follows that [math] is the unique point of minimum of $\bar{\Psi}$ on $\operatorname{clo}\mathbb{P}$ .

Moreover, if the equation (62) had a solution $X\in\mathbb{P}$ , the first order optimality condition for the minimisation of the function $\bar{\Psi}$ over $\mathbb{P}$ would be satisfied, showing that $\bar{\Psi}(Y)\geqslant\bar{\Psi}(X)$ for all $X\in\mathbb{P}$ , and by density, $\bar{\Psi}(0)\geqslant\bar{\Psi}(X)$ , contradicting the fact that [math] is the unique point of minimum of $\bar{\Psi}$ over $\operatorname{clo}\mathbb{P}$ .

Note added to the second version: In the earlier version of this paper posted on January 5, 2019 that appeared in Letters in Mathematical Physics, 109, (2019) 1777-1804, , we made an unfortunate error. Theorem 9 in that version wrongly claimed that for the case $d=d_{3}$ the solution of the minimisation problem (13) is also the solution of the matrix equation (18). The mistake in the statement and in the proof has been pointed in J. Pitrik and D. Virosztek, Quantum Hellinger distances revisited, arXiv: 1903.10455v3. In this paper some more general divergence functions are considered, the barycentre equations are derived, and an example is given to show that the solution to the matrix equations (17) and (18) need not be the same.

Bibliography36

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] T.J. Abatzoglou, Norm derivatives on spaces of operators , Math. Ann., 239 (1979), 129-135.
2[2] M. Agueh and G. Carlier, Barycenters in the Wasserstein space , SIAM J. Math. Anal. Appl. 43 (2011), 904-924.
3[3] J.G. Aiken, J.A. Erdos, J.A. Goldstein Unitary approximation of positive operators , Illinois J. Math., 24 (1980), 61-72.
4[4] S. Amari, Information Geometry and its Applications , Springer (Tokyo), 2016.
5[5] T. Ando, Concavity of certain maps on positive definite matrices and applications to Hadamard products , Linear Algebra Appl. 26 (1979), 203-241. color=blue!20]RB: ref added
6[6] T. Ando, C.-K. Li and R. Mathias, Geometric means , Linear Algebra Appl. 385 (2004), 305-334.
7[7] V. Arsigny, P. Fillard, X. Pennec and N. Ayache, Geometric means in a novel vector space structure on symmetric positive-definite matrices , SIAM J. Math. Anal. Appl. 29 (2007), 328-347.
8[8] A. Banerjee, S. Merugu, I. S. Dhillon and J. Ghosh, Clustering with Bregman divergences , J. Mach. Learn. Res. 6 (2005), 1705-1749.

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Videos

Matrix versions of the Hellinger distance

Abstract.

Key words and phrases:

2010 Mathematics Subject Classification:

1. Introduction

2. Convexity and derivative computations

Proposition 1**.**

Proof.

Theorem 2**.**

Proof.

Theorem 3**.**

Proof.

Theorem 4**.**

Definition** (Section 26 in [33] or Def. 2.8 in [10]).**

Lemma 5**.**

Theorem 6**.**

Proof.

Lemma 7**.**

Proof.

Theorem 8**.**

Proof.

3. Barycentres

Theorem 9**.**

Proof.

Theorem 10**.**

Proof.

Appendix A Proof of 5

Appendix B Examples

Need for the Legendre condition in 5

Need for the Legendre condition in 6

Proposition 11**.**

Proof.

Proposition 1.

Theorem 2.

Theorem 3.

Theorem 4.

Definition (Section 26 in [33] or Def. 2.8 in [10]).

Lemma 5.

Theorem 6.

Lemma 7.

Theorem 8.

Theorem 9.

Theorem 10.

Proposition 11.