Matricial Wasserstein-1 Distance
Yongxin Chen, Tryphon T. Georgiou, Lipeng Ning, and Allen Tannenbaum

TL;DR
This paper introduces a matrix-valued extension of the Wasserstein 1-metric, enabling efficient computation and unbalanced mass transport interpretation for matrix probability densities.
Contribution
It develops a novel matrix analogue of the Wasserstein 1-metric using duality theory, expanding the applicability of optimal transport to matrix-valued data.
Findings
Provides a duality-based formulation for matrix Wasserstein-1 distance
Enables easier computation of matrix optimal transport distances
Offers an unbalanced interpretation of mass transport for matrices
Abstract
In this note, we propose an extension of the Wasserstein 1-metric () for matrix probability densities, matrix-valued density measures, and an unbalanced interpretation of mass transport. The key is using duality theory, in particular, a "dual of the dual" formulation of . This matrix analogue of the Earth Mover's Distance has several attractive features including ease of computation.
| 77.85 | 77.76 | 137.36 | |
| 249.40 | 162.03 | 199.78 | |
| 210.93 | 110.25 | 113.46 |
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGeometric Analysis and Curvature Flows · Advanced Differential Geometry Research · Geometry and complex manifolds
Matricial Wasserstein-1 Distance
Yongxin Chen, Tryphon T. Georgiou, Lipeng Ning, and Allen Tannenbaum Y. Chen is with the Department of Medical Physics, Memorial Sloan Kettering Cancer Center, NY; email: [email protected]. T. Georgiou is with the Department of Mechanical and Aerospace Engineering, University of California, Irvine, CA; email: [email protected]. Ning is with Brigham and Women’s Hospital (Harvard Medical School), MA.; email: [email protected]. Tannenbaum is with the Departments of Computer Science and Applied Mathematics & Statistics, Stony Brook University, NY; email: [email protected]
Abstract
We propose an extension of the Wasserstein 1-metric () for matrix probability densities, matrix-valued density measures, and an unbalanced interpretation of mass transport. We use duality theory and, in particular, a “dual of the dual” formulation of . This matrix analogue of the Earth Mover’s Distance has several attractive features including ease of computation.
I Introduction
Optimal mass transport (OMT) has proven to be a powerful methodology for numerous problems in physics, probability, information theory, fluid mechanics, econometrics, systems and control, computer vision, and signal/image processing [14, 26, 1, 30, 29, 20]. Developments along purely controls-related issues ensued when it was recognized that mass transport may be naturally reformulated as a stochastic control problem; see [15, 21, 16, 11, 5, 6, 7, 8, 9] and the references therein.
Historically, the problem of OMT [26, 30] began with the question of minimizing the effort of transporting one distribution to another, typically with a cost proportional to the Euclidean distance between starting and ending points of the mass being transported. However, the control-theoretic reformulation [1] which was at the root of the aforementioned developments was based on the choice of a quadratic cost. The quadratic cost allowed the interpretation of the transport effort as an action integral and gave rise to a Riemannian structure on the space of distributions [18, 13, 25]. The originality in our present work is two-fold. First, we formulate the transport problem with an cost in a similar manner, as a control problem with an -path cost functional, and secondly, we develop theory for shaping flows of matrix-valued distributions which is a non-trivial generalization of classical OMT.
The relevance of OMT on flows of matrix-valued distributions was already recognized in [22, 23] and was cast as a control problem as well, albeit in a quadratic-cost setting. At that point, interest in the geometry of matrix-valued distributions stemmed from applications to spectral analysis of vector-valued time series (see [22] and the references therein). Yet soon it became aparent that flows of matrix-valued distributions represent evolution of quantum systems. In fact, there has been a burst of activity in applying ideas of quantum mechanics to OMT of matrix-valued densities as well as, utilizing an OMT framework to study the dynamics of quantum systems: three groups [3, 4, 19] independently and simultaneously developed quantum mechanical frameworks for defining a Wasserstein-2 distance on matrix-valued densities (normalized to have trace 1), via a variational formalism generalizing the work of [1]. We note that [3, 4, 19] develop matrix-valued generalizations of the Wasserstein 2-metric () and explore the Riemannian-like structure for studying the entropic flows of quantum states.
Thus, in our present note, we develop a natural extension of the Wasserstein 1-metric to matrix-valued densities and matrix-valued measures. Our point of view is somewhat different from the earlier works on matricial Wassserstein-2 metrics. We mainly use duality theory [12, 30]. Further, we do not employ the Benamou and Brenier [1] control formulation of OMT, but rather the Kantorovich-Rubinstein duality. This new scheme is computationally more attractive and, moreover, it is especially appealing when specialized to weighted graphs (discrete spaces) that are sparse (few edges), as is the case for many real-world networks [27, 28, 31].
The present paper is structured as follows. Section II is a quick review of several different formulations of Wasserstein-1 distance in the scalar setting. Using the quantum gradient operator defined in Section III, we generalize the Wasserstein-1 metric to the space of density matrices in Section IV. The case where the two marginal matrices have different traces is discussed in Section V. We finally extend the framework to deal with matrix-valued densities in Section VI, which may find applications in multivariate spectra analysis as well as comparing stable multi-inputs multi-outputs (MIMO) systems. The paper concludes with an academic example in Section VII.
II Optimal mass transport
We begin with duality theory, explained for scalar densities, upon which our matricial generalization of the Wasserstein-1 metric is based.
Given two probability densities and on , the Wasserstein-1 distance between them is
[TABLE]
where denotes the set of couplings between and . The Wasserstein-1 distance has a dual formulation via the following result due to Kantorovich and Rubinstein [12, 26, 30]:
[TABLE]
where denotes the Lipschitz constant. When is differentiable, . It follows that,
[TABLE]
Starting from (3), by once again considering the dual, we readily obtain the very important reformulation
[TABLE]
where the (Lagrange) optimization variable now represents flux. Alternatively, this can be written as the control-optimization problem in the Benamou-Brenier style [1]
[TABLE]
This “dual of the dual” formulation turns the Kantorovich and Rubinstein into a control problem to determine a suitable velocity (control vector) . We remark that from a computational standpoint, when applied to discrete spaces (graphs), this formulation leads to a very substantial computational benefit in the case of sparse graphs; this is due to the fact that (1) involves solving systems of the order of the square of the number of nodes, while equation (4), solving systems of the order of the number of edges.
III Gradient on space of Hermitian matrices
We closely follow the treatment in [4]. In particular, we will need a notion of gradient on the space of Hermitian matrices and its dual, i.e. the divergence.
Denote by and the set of Hermitian and skew-Hermitian matrices, respectively. We will assume that all of our matrices are of fixed size . Next, we denote the space of block-column vectors consisting of elements in and as and , respectively. We also let and denote the cones of nonnegative and positive-definite matrices, respectively, and
[TABLE]
We note that the tangent space of at any is given by
[TABLE]
and we use the standard notion of inner product, namely
[TABLE]
for both and . For (),
[TABLE]
Given (), (), set
[TABLE]
and
[TABLE]
For a given we define
[TABLE]
to be the gradient operator. By analogy with the ordinary multivariable calculus, we refer to its dual with respect to the Hilbert-Schmidt inner product as the (negative) divergence operator, and this is
[TABLE]
i.e., is defined by means of the identity
[TABLE]
A standing assumption throughout, is that the null space of , denoted by , contains only scalar multiples of the identity matrix.
IV Wassertein-1 distance for density matrices
In this section, we show that both (3) and (4) have natural counterparts for probability density matrices, i.e. matrices in . This set-up obviously works for matrices in of equal trace.
We treat (3) as our starting definition and define the distance in the space of density matrices as
[TABLE]
Here is the operator norm. The above is well-defined since by assumption, the null space of is spanned by the identity matrix . As above, we have that
[TABLE]
This should be compared to the Connes spectral distance [10], which is given by
[TABLE]
It is not difficult to see that the dual of (11) is
[TABLE]
which is the counterpart of (4). Here denotes the nuclear norm [2]. In particular, we have the following theorems.
Theorem 1
Notation as above. Then
[TABLE]
Proof:
We start from (12) and use the fact that
[TABLE]
It follows
[TABLE]
This implies that (11) and (12) are dual to each other. Since both of them are strictly feasible, the duality gap is zero. Therefore . ∎
Theorem 2
The distance defined as in (11) is a metric on the space of density matrices .
Proof:
Obviously holds with equality if and only if . The symmetric property that is also clear from the definition. Here we prove the triangle inequality. That is, for any , we have
[TABLE]
It is easier to see this from the dual formulation (12). Let , be the optimal fluxes for and respectively. Then is a feasible flux for , namely,
[TABLE]
It follows that
[TABLE]
which completes the proof. ∎
V Wassertein-1 distance: the unbalanced case
In this section, we extend the definition of Wasserstein-1 distance to the space nonnegative matrices i.e., we remove the constraint of both matrices having equal traces. Compare also with some very interesting recent work [17] on fast computational methods for in the unbalanced scalar case.
In order to compare matrices of unequal trace we relax the constraint in (12), which forces , by introducing a “source” term . That is, we replace our continuity equation (12) with
[TABLE]
With this added source, we define a Wasserstein-1 distance in as follows. Given , we define
[TABLE]
Here measures the relative significance between and .
Another natural way to compare is by finding having equal trace that are close to in some norm (here taken to be the nuclear norm), as well as close to one another. More specifically, we seek to minimize
[TABLE]
Putting the two terms together we obtain the following definition of Wasserstein-1 distance
[TABLE]
It turns out these two relaxations of are in fact equivalent.
Theorem 3
With notation and assumptions as above,
[TABLE]
Proof:
Clearly, . On the other hand, let be a minimizer of (14), and with , i.e., are the negative and positive parts of respectively, then together with is a feasible solution to (16). With this solution,
[TABLE]
which implies that . This completes the proof. ∎
Theorem 4
The formula (14) defines a metric on .
Proof:
The proof follows exactly the same lines as in Theorem 2. ∎
Using the technique of Lagrangian multipliers one can deduce the dual formulation of (12) and establish the following:
Theorem 5
Notation as above. Then
[TABLE]
Proof:
Straight calculation gives
[TABLE]
This together with the strong duality completes the proof. ∎
VI Wasserstein-1 distance for matrix-valued densities
With little effort we are able to generalize the definition of Wasserstein-1 distance to the space of matrix-valued densities. Examples of matrix-valued densities include power spectra of multivariate time series, stress tensors, diffusion tensors and so on, and hence our motivation in considering matrix-valued distribution on possibly more than a one dimensional spatial coordinates.
Given two matrix-valued densities satisfying
[TABLE]
we can define their Wasserstein-1 distance as
[TABLE]
or through its dual
[TABLE]
For more general densities where condition (19) may not be valid, we define
[TABLE]
or, equivalently,
[TABLE]
One can introduce positive coefficients to trade-off the relative importance of and in establishing correspondence between the two distributions as follows:
[TABLE]
or, equivalently,
[TABLE]
VII Example
We use our framework to compare power spectra of multivariate time series (in discrete time). Evidently, the distance between two power spectra induces a distance between corresponding linear modeling filters and, thereby, can be used to compare (stable) MIMO systems [22].
Consider the three power spectra as shown in Figure 1 (in different colors). What is shown in the three subplots are power spectra of two time series (in subplots (a) and (c)) and their cross-spectrum (in subplot (b)) as functions of time (the phase of the cross spectra are not shown). Thus, the three different colors represent the three different matrix-valued power spectra given by:
[TABLE]
where
[TABLE]
The distances between the each pair for different values, , and the choice with
[TABLE]
are tabulated in Table I. We observe that when the penalty on the rotation part is large (), we have and . On the other hand, when the penalty on translation is large relative to the cost of rotation (), we have and . These findings are in agreement with the intuition when observing the relative frequency directionality of power in the three spectra. More specifically, requires a significant drift in directionality before we can match it with the other two, while this is less important when comparing and . For this latter case, it is the actual frequency where the power resides that distinguishes the two while the directionality is more in agreement.
What this example underscores is the ability of the metric to be tailored to applications where we need to trade off and compromise, in a principled way, between two vastly different features of matrix-valued distributions, i.e., spatial location versus directionality of the “intensity.” What was achieved in this paper is the construction of a suitable and easily computable metric that can be utilized for this purpose.
VIII Future research
We introduced generalization of the scalar distance to matrices and matrix-valued measures. This new metric, , is computationally simpler and more attractive than earlier metrics, based on quadratic cost criteria. In fact, our “dual of the dual” formulation makes the metric especially attractive when comparing matrix-valued data on a discrete space (graph, network).
We note that the Wasserstein 1-metric has been used as a tool in defining curvature [24] and in analyzing the robustness of complex networks derived from scalar-valued data [27, 28]. The formalism presented in the current work, suggests alternative notions of curvature and robustness when the nodes of a network carry matrix-valued data, e.g., in diffusion tensor imaging. We plan to pursue such issues in future work.
Acknowledgements
This project was supported by AFOSR grants (FA9550-15-1-0045 and FA9550-17-1-0435), grants from the National Center for Research Resources (P41- RR-013218) and the National Institute of Biomedical Imaging and Bioengineering (P41-EB-015902), National Science Foundation (NSF), and National Institutes of Health (P30-CA-008748 and 1U24CA18092401A1).
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1[1] J.-D. Benamou and Y. Brenier, “A computational fluid mechanics solution to the Monge–Kantorovich mass transfer problem,” Numerische Mathematik 84 (2000), pp. 375-393.
- 2[2] E. Candes and T. Tao, “The power of convex relaxation: Near- optimal matrix completion,” IEEE Trans. Inform. Theory 56:5 (2009), pp. 2053-2080.
- 3[3] E. Carlen and J. Maas, “Gradient flow and entropy inequalities for quantum Markov semigroups with detailed balance,” https://arxiv.org/abs/1609.01254, 2016.
- 4[4] Y. Chen, T. T. Georgiou, A. Tannenbaum, “Matrix optimal mass transport: a quantum mechanical approach,” https://arxiv.org/abs/1610.03041, 2016.
- 5[5] Y. Chen, T. T. Georgiou, M. Pavon, “On the relation between optimal transport and Schrödinger bridges: A stochastic control viewpoint,” Journal of Optimization Theory and Applications 169:2 (2016), pp. 671-691.
- 6[6] Y. Chen, T. T. Georgiou, M. Pavon, “Optimal transport over a linear dynamical system,” IEEE Transactions on Automatic Control , to appear.
- 7[7] Y. Chen, T. T. Georgiou, M. Pavon, “Optimal steering of a linear stochastic system to a final probability distribution, Part I,” IEEE Transactions on Automatic Control 61:5 (2016), pp. 1158-1169.
- 8[8] Y. Chen, T. T. Georgiou, M. Pavon, “Optimal steering of a linear stochastic system to a final probability distribution, Part II,” IEEE Transactions on Automatic Control 61:5 (2016), pp. 1170-1180.
