Hausdorff and Wasserstein metrics on graphs and other structured data
Evan Patterson

TL;DR
This paper extends optimal transport metrics, including Wasserstein and Hausdorff, from set matching to structured data like graphs, enabling efficient, structure-preserving comparisons across diverse data types.
Contribution
It introduces a framework for Wasserstein and Hausdorff metrics on structured data, generalizing optimal transport to graphs and other structures via category theory.
Findings
Wasserstein metrics on structured data are convex relaxations of Hausdorff metrics.
The Wasserstein metric on $C$-sets is computable via linear programming.
The approach applies to various graph types and other structured data.
Abstract
Optimal transport is widely used in pure and applied mathematics to find probabilistic solutions to hard combinatorial matching problems. We extend the Wasserstein metric and other elements of optimal transport from the matching of sets to the matching of graphs and other structured data. This structure-preserving form of optimal transport relaxes the usual notion of homomorphism between structures. It applies to graphs, directed and undirected, labeled and unlabeled, and to any other structure that can be realized as a -set for some finitely presented category . We construct both Hausdorff-style and Wasserstein-style metrics on -sets and we show that the latter are convex relaxations of the former. Like the classical Wasserstein metric, the Wasserstein metric on -sets is the value of a linear program and is therefore efficiently computable.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Hausdorff and Wasserstein metrics on graphs and other structured data
Evan Patterson
Stanford University, Statistics Department
Abstract.
Optimal transport is widely used in pure and applied mathematics to find probabilistic solutions to hard combinatorial matching problems. We extend the Wasserstein metric and other elements of optimal transport from the matching of sets to the matching of graphs and other structured data. This structure-preserving form of optimal transport relaxes the usual notion of homomorphism between structures. It applies to graphs, directed and undirected, labeled and unlabeled, and to any other structure that can be realized as a -set for some finitely presented category . We construct both Hausdorff-style and Wasserstein-style metrics on -sets and we show that the latter are convex relaxations of the former. Like the classical Wasserstein metric, the Wasserstein metric on -sets is the value of a linear program and is therefore efficiently computable.
1. Introduction
How do you measure the distance between two graphs or, in a broader sense, quantify the similarity or dissimilarity of two graphs? Metrics and other dissimilarity measures are useful tools for mathematicians who study graphs, and also for practitioners in statistics, machine learning, and other fields who analyze graph-structured data. Many distances and dissimilarities have been proposed as part of the general study of graph matching [16, 39], yet current methods tend to suffer from one of two problems. Methods that fully exploit the graph structure, such as graph edit distances and related distances based on maximum common subgraphs [6], are generally NP-hard to compute and must be approximated by heuristic search algorithms. Methods based on efficiently computable graph substructures, such as random walks [26], shortest paths [4], or graphlets [43], are computationally tractable by design but are only sensitive to the particular substructure under consideration. Most kernel methods for graphs or other discrete structures fall into this category [21, 53]. Our aim is to construct a metric on graphs that fully accounts for the graph structure, but attains computational tractability in a principled way through convex relaxation.
The fundamental difficulty in graph matching is that the optimal correspondence of vertices between two graphs is unknown and must be estimated from a combinatorially large set of possibilities. The theory of optimal transport [51, 52, 37], now routinely used to find probabilistic matchings of metric spaces [35], suggests itself as a general strategy to circumvent this combinatorial problem. Several authors have proposed specific methods to match graphs or other structured data using optimal transport [2, 11, 36, 49].
Simplifying somewhat, current applications of optimal transport to graph matching draw on two major ideas, the Wasserstein distance between measures supported on a common metric space and the Gromov-Wasserstein distance between metric measure spaces [48, 35]. If you have a way of embedding the vertices of two graphs into a common metric space, say a Euclidean space, then you can compute the Wasserstein distance between these two subspaces [36]. Alternatively, if you have a way of converting each vertex set into its own metric space, then you can compute the Gromov-Wasserstein distance between these disjoint spaces. In the latter case, the distance between two vertices in a graph is often defined to be the length of the shortest path between them, although there are other possibilities. The two approaches, via the Wasserstein and Gromov-Wasserstein distances, can also be combined [49].
Methods of this style reduce the problem of matching graphs to that of matching metric spaces, and then apply the usual tools of optimal transport for metric matching. While this may suffice for some purposes, it is conceptually unsatisfying for the simple reason that graphs are not identical with metric spaces. Any information that cannot be encoded in the metric is lost to optimal transport. Thus, if we take the shortest path distance on vertices, the optimal coupling of vertices depends on the graph’s edges only through the lengths of the shortest paths.
Here we describe a form of optimal transport between graphs that makes no such reduction. Probabilistic mappings are established between both vertex and edge sets, and compatibility between the mappings is enforced according to the nearest analogue of graph homomorphism. These probabilistic graph homomorphisms are defined as solutions to linear programs and are therefore efficiently computable.
Our methodology is not ultimately about graphs, but about how the idea of a homomorphism, or structure-preserving map, can be deformed both probabilistically and metrically. We set forth a general notion of structure-preserving optimal transport that applies to a limited but important class of structures. This class encompasses directed, undirected, and bipartite graphs; graphs with vertex attributes, edge attributes, or both; simplicial sets, the higher-dimensional generalization of graphs; other variants of graphs, such as hypergraphs; and unrelated structures. The ensuing optimization problems are in all cases linear programs.
Other authors have proposed convex or otherwise tractable relaxations of graph matching, based on spectral methods [10], semidefinite programming [42], and doubly stochastic matrices [1]. Closest to ours is the last method, which relaxes the vertex permutation of a graph isomorphism into a doubly stochastic matrix. Unlike ours, this method does not straightforwardly generalize from graph isomorphism to graph homomorphism or to metrics on graphs, nor to graphs with vertex labels or to structures other than graphs.
Let us outline in more detail the major concepts and sections of the paper. In the mainly expository Section 2, we review -sets, the class of structures treated throughout. We give numerous examples of -sets, including several kinds of graphs. We also introduce their functorial semantics, a device we use incessantly to equip -sets with probabilistic and metric structure, beginning in Section 3. There we relax the -set homomorphism problem by replacing functions with their probabilistic analogue, Markov kernels.111Readers unfamiliar with Markov kernels will find a brief review in Appendix A. We arrive at a feasibility problem reminiscent of optimal transport, although it is expressed using Markov kernels instead of couplings.
We devote the rest of the paper to studying metrics on -sets, exact and probabilistic. Section 4 isolates the purely metric aspects of the problem. We establish a general method for lifting metrics on the hom-sets of a category to a metric on -sets in , and we instantiate the theorem to define a Hausdorff-style metric on -sets. This metric, which generalizes the classical Hausdorff metric on subsets of a metric space, is generally hard to compute, but may be of theoretical interest.
We then set out to construct a Wasserstein-style metric on -sets, using the same framework. To do this, we must take a detour in Section 5 to define a Wasserstein metric on Markov kernels. The definition strikes us as very natural, though we can find no source for it in the literature. It is possibly of independent interest, being the probabilistic analogue of the metrics and the functional analogue of the usual Wasserstein metric. Finally, in Section 6, we bring together these threads to construct a Wasserstein metric on -sets. It is a convex relaxation of the Hausdorff metric on -sets and it is, like the classical optimal transport problem, expressible as a linear program.
The relations between the major concepts are summarized in Figure 1.1, which also serves as a dependency graph for the sections of the paper. If the meaning of this diagram is not now entirely clear, we hope that it will become so by the end.
2. Graphs, -sets, and functorial semantics
Graphs belong to a class of algebraic structures known as -sets. As a logical system, this class is extremely simple, yet it is broad enough to encompass a range of useful and important structures, such as directed and undirected graphs and their higher-dimensional generalizations. It also easily accommodates the attachment of extra data to arbitrary substructures, as in vertex- or edge-attributed graphs.
In this section we describe the essential elements of -sets, their morphisms, and their functorial semantics. We give many examples that should be useful in applications. Most of what we say appears in the literature on category theory [32, 33, 34, 38, 46, 47], but we assume of the reader nothing more than the definitions of a category, a functor, and a natural transformation.
Definition 2.1** (-sets).**
Let be a small category. A -set222-sets are more commonly called “presheaves” and defined to be contravariant functors . The name “-sets” [38] arises from category actions, which generalize group actions (-sets). is a functor from the category to the category of sets and functions.
Thus, a -set consists of, for every object in , a set , and for every morphism in , a function , such that the assignment of functions preserves composition and identities.
Our categories will always have finite presentations, which we regard as logical theories. A -set is then an instance or model of that theory. A few examples will bring this out.
Example 2.2* (Graphs).*
The theory of graphs is the category with two objects and two parallel morphisms:
[TABLE]
A functor consists of a vertex set and an edge set , together with source and target maps which assign the source and target vertices of each edge. Thus, a -set is simply a graph.
In this paper, a “graph” without qualification is a directed graph, possibly with multiple edges and self-loops (Figure 2.1).333Such graphs are also called “directed pseudographs,” by graph theorists, and “quivers,” by representation theorists. Different kinds of graphs arise as -sets for different categories .
Example 2.3* (Symmetric graphs).*
The theory of symmetric graphs extends the theory of graphs with an edge involution:
[TABLE]
A -set, or symmetric graph, is a graph endowed with an involution on edges, that is, a self-inverse, orientation-reversing edge map . Loosely speaking, a symmetric graph is a graph in which every edge has a matching edge going in the opposite direction (Figure 2.2). Symmetric graphs are essentially the same as undirected graphs.444In the absence of self-loops, symmetric graphs correspond one-to-one with undirected graphs, but a self-loop in an undirected graph has two possible representations in a symmetric graph: it can be fixed or not under the involution.
Before giving further examples, we define the notion of homomorphism appropriate for -sets.
Definition 2.4** (-set morphisms).**
Let be a small category and let and be -sets. A morphism of -sets from to is a natural transformation .
Thus, a -set morphism assigns a function to each object in such a way that for every morphism in , there is a commutative diagram
[TABLE]
A morphism of graphs, according to this definition, is a graph homomorphism as ordinarily understood, consisting of a vertex map and an edge map that preserves the assignment of source and target vertices:
[TABLE]
In the commutative diagrams, we adopt the convention of writing simply for or where no confusion can arise. Similarly, a morphism of symmetric graphs is a graph homomorphism that preserves the edge involution.
The next example shows that different categories can define essentially the same -sets while yielding genuinely different -set morphisms.
Example 2.5* (Reflexive graphs).*
The theory of reflexive graphs is
[TABLE]
A reflexive graph is a graph whose every vertex is endowed with a distinguished loop (Figure 2.3). As objects, reflexive graphs are the same as graphs, inasmuch as they are in one-to-one correspondence with each other. However, morphisms of reflexive graphs can “collapse” edges into vertices by mapping them onto distinguished loops, a possibility not permitted of a graph homomorphism. For this reason reflexive graph morphisms are sometimes called “degenerate maps.”
Symmetry and reflexivity combine straightforwardly in symmetric reflexive graphs, in which the distinguished loops are fixed by the edge involution. Bipartite graphs form another important class of graphs.
Example 2.6* (Bipartite graphs).*
The theory of bipartite graphs is
[TABLE]
A bipartite graph consists of two vertex sets, and , and a set of edges with sources in and targets in (Figure 2.4). A morphism of bipartite graphs has two vertex maps, and , and an edge map that preserves the source and target vertices.
We have not exhausted the list of graph-like structures that can be defined as -sets. For example, hypergraphs, which generalize graphs by allowing edges with multiple sources and multiple targets, are -sets [18, 46]. So are simplicial sets, the higher-dimensional analogue of graphs and combinatorial analogue of simplicial complexes.
Example 2.7* (Semi-simplicial sets).*
The semi-simplicial category, truncated to two dimensions, is
[TABLE]
A -set, or two-dimensional semi-simplicial set, is a collection of triangles, edges, and vertices. Each triangle has three edges, in a definite order, and each edge has two vertices, also in a definite order, in such a way that the induced assignment of vertices to triangles is consistent, according to the simplicial identities (Figure 2.5).
Semi-simplicial sets up to any dimension , or in all dimensions , can be defined as -sets, as can several other kinds of simplicial sets [17, 19, 46]. We will not present the simplicial categories here, but we summarize the idea that graphs are one-dimensional simplicial sets in the table below.
[TABLE]
We take this list of examples to establish that many graph-like structures can be represented as -sets. The next example is rather trivial, but later we use its attributed variant to recover the classical Hausdorff and Wasserstein distances as special cases of metrics on -sets.
Example 2.8* (Sets).*
If is the discrete category on one object, then -sets are sets and morphisms of -sets are functions.
Example 2.9* (Bisets).*
If is the discrete category on two objects, then -sets are pairs of sets and morphisms of -sets are pairs of functions.
Example 2.10* (Dynamical systems).*
The theory of discrete dynamical systems is
[TABLE]
A discrete dynamical system is a set together with a function . The set is the state space of the system and the transformation defines the transitions between states.
In applications, graphs and other structures often bear additional data in the form of discrete labels or continuous measurements. Data attributes are easily attached to -sets by extending the theory .
Example 2.11* (Attributed sets).*
The theory of attributed sets is
[TABLE]
An attributed set is thus a set equipped with an map that assigns to each element an attribute value .
Example 2.12* (Vertex-attributed graphs).*
The theory of vertex-attributed graphs is
[TABLE]
A vertex-attributed graph is a graph equipped with a map that assigns an attribute to each vertex. Such graphs are usually called “vertex-labeled” when the attribute set is discrete.
Edge-attributed, or vertex- and edge-attributed, graphs can be defined similarly. Indeed, any number of attributes can be attached to any substructure of a -set, making the class of -sets closed under attachment of extra data.
Often all the -sets under consideration take attributes in a common space , such as a fixed set of labels or a Euclidean space. In this case, we restrict the -set morphisms to those whose attribute maps are the identity . We will note when this restriction is in force by describing the attribute space as fixed.
At this juncture, the reader may wonder what is gained by the formalism of -sets over, say, an equational fragment of first-order logic or even ordinary, informal mathematics. This question has many valid answers, but the most pertinent is that viewing theories as categories in their own right makes it extremely simple to define models with extra structure, be it topological, metric, measure-theoretic, or otherwise. We simply replace the category of sets and functions with a category having that extra structure.
Definition 2.13** (Functorial semantics).**
Let be a small category and let be any category. The functor category has functors as objects and natural transformations between them as morphisms. We call the objects of this category -valued -sets or -sets in .
Functorial semantics goes back to Lawvere’s pioneering thesis [30]. When , we recover the original definitions of -sets and -set morphisms. So, in our main example, the category of graphs and graph homomorphisms is the category of functors , that is,
[TABLE]
The starting point for much subsequent development is the category of measurable spaces and measurable functions, defined more carefully below. A measurable -set, or -set in , is a -set whose internal sets are equipped with -algebras and whose internal maps are measurable with respect to these -algebras. We will introduce other categories as we need them. Throughout the paper we explore the consequences of enriching graphs and other -sets with metrics, measures, and Markov morphisms.
3. Markov morphisms of measurable -spaces
On the topic of matching of -sets, the seemingly most elementary question one can ask is:
Problem 3.1** (-set homomorphism).**
Given -sets and , does there exist a -set morphism ?
For , the problem is trivial. A function exists if and only if the codomain is nonempty. But for other categories the problem is computationally hard. The graph homomorphism problem, occurring when , is a famous NP-complete problem.555The graph homomorphism problem usually refers to undirected graphs, but it is no easier for directed graphs [22, Proposition 5.10]. In the case of reflexive graphs, the homomorphism problem becomes trivial again, because there is a reflexive graph morphism if and only if codomain is nonempty (contains a vertex). Yet the same cannot be said of similar matching problems, such as:
Problem 3.2** (-set isomorphism).**
Given -sets and , does there exist a -set isomorphism ?
The isomorphism problem for reflexive graphs is equivalent to the graph isomorphism problem, so is once again computationally hard. In summary, while the complexity depends on the category and on the specific -sets under consideration, it is generally computationally intractable to find -set morphisms, to say nothing of enumerating them or optimizing over them.
A popular strategy for solving hard combinatorial problems, especially when inexact solutions are acceptable, is to relax the problem to a continuous one that is easier to solve. Functorial semantics offer a simple way to implement this strategy: replace the category with a category having better computational properties. In what will be a recurring theme, we replace categories of functions with categories of Markov kernels, which are the probabilistic analogue of functions. The reader unfamiliar with Markov kernels will find references and a short review in Appendix A.
Definition 3.3** (Category of Markov kernels).**
The category has Polish measurable spaces666In other words, the objects are topological spaces homeomorphic to a complete, separable metric space, equipped with their Borel -algebras. Many results hold under weaker or no assumptions on the measurable spaces, but for simplicity we assume this regularity condition everywhere. as objects and Markov kernels between them as morphisms.
Example 3.4* (Markov chains).*
A discrete dynamical system in is a Markov chain. Morphisms of Markov chains, as stipulated by this definition, are known to probabilists as intertwinings [55, 14].
Functions are Markov kernels that contain no randomness. To be more formal, a Markov kernel is deterministic if for every , the distribution is a Dirac delta measure. Given any measurable function , a deterministic Markov kernel is defined by , and every deterministic Markov kernel arises uniquely in this way. Measurable functions can therefore be identified with deterministic Markov kernels. Moreover, the identification is functorial. Given composable measurable functions , one easily checks that . Also, . We summarize these statements by saying that is an identity-on-objects embedding functor. In what follows we will not always distinguish notationally between a function and its corresponding Markov kernel .
Let us now make precise the relaxation of the -set homomorphism problem. Strictly speaking, the relaxation is not from -sets, but from measurable -sets.
Definition 3.5** (Measurable -spaces).**
The category has Polish measurable spaces as objects and measurable functions as morphisms.
A measurable -space is a -set in .
The sought-after relaxation is a nearly immediate consequence of the embedding in .
Definition 3.6** (Markov morphisms).**
A Markov morphism of measurable -spaces and consists of a Markov kernel for each object , such that for every morphism in , the diagram
[TABLE]
in commutes.
Proposition 3.7** (Relaxation of -set homomorphism).**
Given measurable -spaces and , the problem of finding a Markov morphism is a convex relaxation of the problem of finding a (measurable) morphism .
Proof.
The proposition makes two assertions, concerning relaxation and convexity.
To establish the relaxation, observe that if is a measurable morphism, then is a Markov morphism, because, by functoriality, the embedding preserves naturality squares:
[TABLE]
Thus, if the measurable morphism problem has a solution, so does the Markov morphism problem. To state the argument more pithily, the functor induces a functor by post-composition.
As for the convexity, the Markov morphism problem,
[TABLE]
is a convex feasibility problem, possibly in infinite dimensions. The variables, namely Markov kernels indexed by , form a convex space, and the constraints are linear in the variables. ∎
One way to think about this result is that the constraints defining a -set morphism, which are only formally linear, become actually linear upon relaxation. In the case of greatest practical interest, when the -sets are finite, the result is a linear program.
To make this as transparent as possible, let us write out the linear program. Given finite -sets and , identify a function with a binary matrix whose rows sum to 1 and identify a Markov kernel with a right stochastic matrix . The Markov morphism problem is then
[TABLE]
where denotes the usual matrix multiplication and denotes the column vector of all 1’s (whose dimensionality is left implicit in the notation). This feasibility problem is a linear program with linear equality constraints and nonnegativity constraints.777The category may contain infinitely many morphisms, but it suffices to enforce naturality on a generating set of morphisms. Thus, assuming is finitely presented, we can always write the linear program with finitely many constraints.
It will be helpful to see how Markov morphisms behave in a concrete situation.
Example 3.8* (Markov morphisms of graphs).*
Let and be finite graphs. A Markov morphism is a Markov kernel on vertices, , and a Markov kernel on edges, , such that the two diagrams
[TABLE]
in commute. Since the vertex and edge maps are nondeterministic, it does not make sense to ask that the source and target vertices be preserved exactly, as in a graph homomorphism. The naturality squares assert the next best thing, that for every edge in , the distribution of ’s source vertex under is equal to that of the source vertices in the edge distribution of under , and similarly for target vertices.
We bring out the difference between graph homomorphisms and Markov morphisms in a series of examples. Between the graphs and of Figure 3.1, there are two graph homomorphisms , corresponding to the two directed paths in . Both are, of course, Markov morphisms, as is any mixture , where . In this case, every Markov morphism is a mixture of graph homomorphisms, so little is lost (or gained) by the relaxation. Figure 3.2 presents a similar picture. The graph is a loop and the graph is a cycle, though not a directed one. There are no Markov graph morphisms from to , deterministic or otherwise.
Figure 3.3 looks superficially similar, with the graph now a directed cycle, but the outcome is more interesting. As before, there is no graph homomorphism from to , but there is a Markov morphism. In fact, if is the directed cycle of length , then for any , a Markov morphism is given by assigning the uniform distributions on all vertices and edges:
[TABLE]
In particular, it is sometimes possible to find a Markov graph morphism that is not a mixture of deterministic morphisms, proving that the notion of Markov homomorphism is genuinely weaker than graph homomorphism. That should not be surprising, given that the graph homomorphism problem is NP-hard, while the Markov graph morphism problem is a linear program, hence solvable in polynomial time.
Finally, Figure 3.4 shows the terminal graph for both deterministic and Markov morphisms. Any graph has a unique graph homomorphism, indeed a unique Markov morphism, into the loop .
One might suppose that the strategy for relaxing -set homomorphism carries over directly to -set isomorphism, but that is not so. The constraints imposed by isomorphism are bilinear, not linear, so convexity would be lost in a direct translation. The problem is not just computational, though, as the following result shows.
Proposition 3.9** (Isomorphism in [3]).**
All isomorphisms in are deterministic. That is, any Markov kernels and between Polish measurable spaces satisfying and have the form and for measurable functions and .
Proof.
Under the given assumptions, the extreme points of the convex set of probability measures on are exactly the point masses [45, Example 8.16]. If is an isomorphism in , then it acts as a linear isomorphism on and hence preserves the extreme points of . Thus, for every , there exists such that , which proves that is deterministic. ∎
As there is nothing to be gained, computationally or mathematically, by looking at the isomorphisms in , we will formulate the isomorphism problem in a different way. Let and be finite -sets and equip every set and with the counting measure. A -set morphism is an isomorphism if and only if each component is an isomorphism of sets (a bijection), and this happens if and only if each component preserves the counting measure, that is, for every subset of , the set and its preimage are of the same size. With this motivation, we define:
Definition 3.10** (Measure -spaces).**
The category has -finite measures on Polish measurable spaces as objects and measurable maps as morphisms. The category has the same objects and Markov kernels as morphisms.
A measure -space is a -set in .
Recall that a measurable map of measure spaces is measure-preserving if
[TABLE]
Similarly, a Markov kernel between measure spaces is measure-preserving if , as expressed in the operator notation of Definition A.3. When the measure spaces coincide, that is, and , the measure is also called an invariant measure of or . Let and denote the subcategories of and whose morphisms preserve measure.
Example 3.11* (Invariant dynamics).*
A discrete dynamical system in is a measure-preserving dynamical system, the basic object of study in ergodic theory. A discrete dynamical system in is a Markov chain together with an invariant measure.
Problem 3.12** (Measure-preserving -space homomorphism).**
Given measure -spaces and , does there exist a measure-preserving morphism , i.e., a morphism whose components are all measure-preserving?
Due to the motivating case of finite sets and counting measures, the problem of measure -space homomorphism is no easier to solve than -set isomorphism; however, its relaxation to measure-preserving Markov morphism is easier to solve, being a convex feasibility problem.
Proposition 3.13** (Relaxation of measure-preserving -set morphism).**
Given measure -spaces and , the problem of finding a measure-preserving Markov morphism is a convex relaxation of the problem of finding a measure-preserving morphism .
Proof.
For any function on a measure space , we have . Thus, measure-preserving functions correspond to measure-preserving deterministic kernels, and the embedding restricts to an embedding . The relaxation follows by the argument of Proposition 3.7. To prove the convexity, observe that the measure-preserving Markov morphism problem,
[TABLE]
merely adds linear equality constraints to the Markov morphism problem. ∎
As before, when the -sets are finite, the feasibility problem is a linear program.
Insisting that Markov kernels preserve measure brings us closer to classical optimal transport, formulated in terms of couplings. For any Markov kernel preserving finite measures and , the product measure has marginals and and hence is a coupling of and . Conversely, if is a product measure on with marginals and , then by the disintegration theorem (Theorem A.4), there exists a Markov kernel , unique up to sets of -measure zero, such that , and any such kernel satisfies . So, up to null sets, couplings and measure-preserving Markov kernels are the same. They are even the same as morphisms of measure spaces. Couplings have a standard composition law, known in optimal transport as the gluing lemma [51, Lemma 7.6], and the composition laws for couplings and kernels are compatible.
Despite this equivalence, we formulate the content of this paper entirely using Markov kernels, not couplings. In order to compose couplings, one must first compute disintegrations, and, while disintegration is a linear operation, introducing it complicates the optimization problem. Also, and more importantly, we routinely use Markov kernels that are not measure-preserving. There is no correspondence between general Markov kernels and couplings. The difference, roughly speaking, is that Markov kernels are the probabilistic analogue of functions, while couplings are an analogue of bijections, or of correspondences [35].
Incidentally, workers in optimal transport have long observed that the measure-preserving property of a coupling, which in particular requires that the coupled measures have equal mass, is burdensome in applications. In response various notions of unbalanced optimal transport have been proposed [37, Section 10.2]. Markov kernels offer another alternative to couplings.
4. Hausdorff metric on metric -spaces
The -set homomorphism problem is too stringent for practical matching of graphs and other structures. Morphisms of -sets, even Markov morphisms, are all-or-nothing: either they exist or they do not, and when they do exist, they are distinguished only by coarse qualitative distinctions, like that of homomorphism versus isomorphism. This is problematic in scientific and statistical applications, in which the data, be it structural or numerical, is generally subject to randomness and measurement error. To be tolerant to noise, we should use an inexact, quantitative measure of structural similarity or dissimilarity. One approach to dissimilarity, possibly the most important, is via the ubiquitous mathematical concept of metric.
Throughout the rest of this paper, we develop the metric approach to matching -sets. We will eventually, in Section 6, construct a computationally tractable, Wasserstein-style metric on -sets. In this section, we focus on the purely metric aspects of the problem. The Hausdorff-style metric on -sets that we propose is generally hard to compute, but is helpful in isolating the metric concepts from the probabilistic. It may also be interesting in its own right, as a theoretical tool.
The central idea is to weaken the constraints defining a -set homomorphism from exact equality to approximate equality, with the quality of approximation determined by a metric on morphisms. Schematically, if and are -sets and is a transformation, not necessarily natural, then for each morphism in , we have a “lax” naturality square888The mapping can be seen as an enriched lax natural transformation. In this section we occasionally use the language of enriched category theory [31, 27], but always in such a way that the meaning of the terms is clear from context. We assume no knowledge of this subject.
[TABLE]
where the double arrow represents the value of some metric defined on functions . We aggregate these values over all morphisms, or all morphism generators, in to obtain an total nonnegative weight for the transformation. The Hausdorff distance is the weight attained by optimizing over all transformations .
As a first step in rendering this idea precise, we recall the basic concepts of metric spaces and metrics on function spaces. It is convenient to work with a definition of metric that is more general than the classical notion.
Definition 4.1** (Metric spaces).**
A Lawvere metric space, which we call simply a metric space, is a set together with a function , taking values in the extended nonnegative real numbers, that satisfies the identity law, for all , and the triangle inequality,
[TABLE]
A metric space is classical if three further axioms are satisfied:999Metrics that fail to satisfy one or more of these axioms occur often in metric geometry [5, 7], under names like “extended metrics,” “pseudometrics,” and “quasimetrics.” The term “Lawvere metric space” derives from Lawvere’s study [31] of metric spaces as categories enriched in .
- (i)
Finiteness: for all ; 2. (ii)
Positive definiteness: whenever ; 3. (iii)
Symmetry: for all .
The category has metric spaces as objects and functions between them as morphisms.
Unless otherwise noted, all metrics in this paper are in the above generalized sense.
Example 4.2* (Shortest path distance).*
To reprise an example from Section 1, if is a graph, then a metric is defined on its vertices by letting be the length of the shortest directed path from to . The metric is finite if is finite and strongly connected, it is always positive definite, and it is symmetric if is a symmetric graph. Generalizing slightly, if is a weighted graph, where each edge carries a nonnegative weight, a metric on its vertices is defined by the shortest weighted path. This metric is positive definite if the edge weights are strictly positive.
As outlined above, we will work with -sets in categories admitting a measure of distance between morphisms.
Definition 4.3** (Metric categories).**
A metric category is a category enriched in , i.e., a category whose hom-sets each have the structure of a metric space.
Under the supremum metric, is itself a metric category. Recall that for any set and metric space , the supremum metric on functions is defined by
[TABLE]
When is a classical metric space, the supremum metric is also classical when restricted to bounded functions, i.e., to functions such that
[TABLE]
for some (and hence any) .
Another prominent example of a metric category comes from metric measure spaces. Later, in Section 6, we will see other examples.
Definition 4.4** (Metric measure spaces).**
A metric measure space, or mm space, is a Polish measurable space together with a metric and a -finite measure . We do not assume that is a classical metric or that it metrizes the topology of , although we do require that be lower-semicontinuous with respect to the topology of and so, in particular, be Borel measurable.101010Our definition of an mm space is weaker than usual. Most authors assume that metrizes the topology of [20, 52, 44], and some also require that has full support [35]. Here, the Polish topology of serves only as a regularity condition to exclude pathological -algebras.
The category has metric measure spaces as objects and measurable functions as morphisms.
Under any of the metrics, is a metric category. Recall that for any measure space and metric space , the metric on measurable functions is
[TABLE]
The essential supremum metric differs from the supremum metric only in being insensitive to sets of -measure zero.
When is a classical metric space, the metrics are also classical when restricted to the spaces. In this context, the space consists of equivalence classes of functions that have finite moments of order ,
[TABLE]
when , or that are essentially bounded, when . The equivalence relation is that of equality -almost everywhere.
Definition 4.5** (Metric -spaces).**
A metric -space is a -set in . Likewise, a metric measure -space, or mm -space, is a -set in .
As and are metric categories, we will be able to define quantitative measures of dissimilarity between metric -spaces and between metric measure -spaces. However, in order that the dissimilarity measures be metrics, we must restrict the -set transformations to those whose components do not increase distances. We now formulate this requirement abstractly, for a general metric category .
Definition 4.6** (Short morphisms).**
A morphism in a metric category is short if it does not increase distances upon pre-composition or post-composition. In other words, we have for all morphisms in , and we have for all morphisms in .
The class of morphisms in that do not increases distances upon pre-composition is clearly closed under composition and includes the identities, and likewise for post-composition. The short morphisms in therefore form a subcategory of , which we denote by .
Characterizing the short morphisms of metric spaces and metric measure spaces is straightforward. The first example shows that our terminology is consistent with standard usage.
Proposition 4.7** (Short morphisms of metric spaces).**
The short morphisms of metric spaces are short maps,111111Other names for short maps are “contractions,” “distance-decreasing maps,” “metric maps,” and “nonexpansive maps.” Alternatively, short maps are Lipschitz functions with Lipschitz constant 1. The category of metric spaces and short maps was first studied by Isbell [23], in the case of classical metric spaces, and by Lawvere [31], in the case of generalized metric spaces. namely functions of metric spaces such that
[TABLE]
Consequently, is the category of metric spaces and short maps.
Proof.
For any functions and , we have
[TABLE]
If, moreover, is a short map, then for any functions ,
[TABLE]
Thus short maps are short morphisms of . Conversely, if is a short morphism, let denote the unique map from the terminal space onto . Then for any ,
[TABLE]
proving that is a short map. ∎
Example 4.8* (Short maps of graphs [22, Corollary 1.2]).*
Let and be graphs with the shortest path distance making the vertex sets into metric spaces. For any graph homomorphism , the vertex map is a short map of metric spaces, since transforms any path in into a path in of the same length. On the other hand, not every short map can be extended to a graph homomorphism (consider a constant map). Thus, shorts map of graphs are a weakening of graph homomorphisms.
Proposition 4.9** (Short morphisms of mm spaces).**
The short morphisms between metric measure spaces and are measurable maps that are both measure-decreasing,121212A contravariance is involved in describing as being measure-decreasing. In fact, it is the induced set map that decreases measure; the map does not act on measurable sets.
[TABLE]
and distance-decreasing,
[TABLE]
Consequently, is the category of mm spaces and distance- and measure-decreasing maps.
Proof.
We prove only the case .
First, we show that is measure-decreasing if and only if pre-composing with decreases distances. If is measure-decreasing, then for measurable functions ,
[TABLE]
Conversely, if is not measure-decreasing, then there exists a set such that . Choose a classical metric space with at least two points and , let be the constant function at , and let be the function equal to on and equal to outside of . Then, by construction, .
Now we show that is distance-decreasing (a short map of metric spaces) if and only if post-composing with decreases distances. If is distance-decreasing, then for any measurable functions ,
[TABLE]
For the converse direction, let be the singleton probability space and let denote the generalized elements, as in the proof of Proposition 4.7. Then for any ,
[TABLE]
proving that is distance-decreasing. ∎
We saw in Section 3 that a measure-preserving function is a surrogate for a bijection. Likewise, a measure-decreasing function is a surrogate for an injection, since a function on finite sets is measure-decreasing with respect to counting measure if and only if it is injective. For functions on finite measure spaces of equal mass, particularly probability spaces, being measure-decreasing is the same as being measure-preserving. The elementary version of this fact is that on finite sets of equal size, injections are the same as bijections.
By definition, a metric category is a category enriched in . The enrichment extends to short morphisms in that for any category enriched in , the subcategory is enriched in . We digress slightly to clarify this statement.
Proposition 4.10**.**
Let be a category. The following are equivalent:
- (i)
The category is enriched in ; 2. (ii)
For any morphisms * {X}$${Y}$${Z}$$\scriptstyle{f}$$\scriptstyle{g}$$\scriptstyle{h}$$\scriptstyle{k} * in ,
[TABLE] 3. (iii)
*For any morphisms * {X}$${Y}$${Z}$$\scriptstyle{f}$$\scriptstyle{h}$$\scriptstyle{k} * in , we have , and for any morphisms * {X}$${Y}$${Z}$$\scriptstyle{f}$$\scriptstyle{g}$$\scriptstyle{h} , we have .
Proof.
Conditions (i) and (ii) are equivalent by the definition of an enriched category. We prove that (ii) and (iii) are equivalent. If (ii) holds, then taking gives . Similarly, taking gives . Conversely, if (iii) holds, then by the triangle inequality,
[TABLE]
We now define a Hausdorff-style metric on -sets in a general metric category . We instantiate it with and below and with other metric categories in Section 6. In the definition, fix and write the norm on numbers in as the binary operator , meaning that when and when .
Definition 4.11** (Hausdorff metric on -sets).**
Let be a finitely presented category and let be a metric category. The Hausdorff metric on -sets in is given by, for ,
[TABLE]
where the sum is over a fixed, finite generating set of morphisms in and the infimum is over all transformations , not necessarily natural, whose components in are short morphisms (so belong to ).
The Hausdorff metric is indeed a metric, albeit not a classical one.
Theorem 4.12**.**
As defined above, the Hausdorff metric on -sets in is a metric.
Proof.
For any -set in , taking the identity transformation gives
[TABLE]
proving that .
So we need only verify the triangle inequality. Let , , and be -sets in . Fixing a morphism in , define the weight of a transformation at to be the number . We will show that, for any composable transformations with components in , we have a triangle inequality
[TABLE]
The proof then follows readily. By the triangle inequalities for the weights and for the sum,
[TABLE]
Take the infimum over the transformations and to conclude that .
To prove the triangle inequality for the weights, let be composable transformations and consider a “pasting diagram” of form
[TABLE]
Here is the argument encoded informally by this diagram, which is easy to understand if cumbersome to write down. Since the components of and are short morphisms, we have
[TABLE]
and
[TABLE]
Therefore, by the triangle inequality in the hom-space ,
[TABLE]
This completes the proof of the triangle inequality for and hence also for . ∎
The assumptions of the theorem can be weakened or strengthened with concomitant effects on the conclusion.
Remark 4.13* (General cost functions).*
As the proof shows, the restriction to short morphisms is what guarantees the triangle inequality. Thus, in situations where the triangle inequality is inessential, we are free to take the infimum over arbitrary transformations with components in . We may as well also allow the hom-sets of to carry general cost functions, not necessarily metrics.
Remark 4.14* (Generators and composites).*
From the coordinate-free perspective of categorical logic, the restriction to a finite generating set of morphisms in is open to criticism, since it makes the Hausdorff metric depend on how the category is presented. Can anything be said about the deviation from naturality for a generic morphism in ? In general, no. If, however, we strengthen the assumptions to include and being -sets in , not merely in , then for any composable morphisms in , an argument by pasting diagram of form
[TABLE]
shows that for all transformations ,
[TABLE]
Thus, for -sets and in , bounds on the weights of at generators yield bounds at composites.
We emphasize that the Hausdorff metric on -valued -sets is not a classical metric. Even when the underlying metrics on are symmetric, the Hausdorff metric is not, although it can be symmetrized in one of the usual ways, such as
[TABLE]
As a more fundamental matter, the Hausdorff metric is not positive definite, since whenever there exists a -set homomorphism with components in . Similar statements apply to the symmetrized metrics. Indeed, under any reasonable definition, the distance between isomorphic -sets should be zero, so we cannot expect to get positive definiteness without passing to equivalence classes of isomorphic -sets. This is not a matter we will pursue.
We turn now to concrete examples. We frequently need to make a metric space out of a set with no metric structure. There are several generic ways to do this, but the most useful is the discrete metric, defined on a set as
[TABLE]
A map out of a discrete metric space is always short.131313The discrete metric is usually taken to be 1, not , off the diagonal. Both metrics generate the same topology—the discrete topology—but only the infinite discrete metric satisfies a universal property in .
We can make a -set into a metric -space by equipping every set , , with the discrete metric. On such discrete metric -spaces, the Hausdorff metric reduces to the -set homomorphism problem:
[TABLE]
It can therefore be at least as hard to compute the Hausdorff metric on metric -spaces as it is to solve the -set homomorphism problem, which is generally NP-hard. Overcoming such computational difficulties is a major motivation for Sections 5 and 6.
More interesting things happen when the underlying metrics are not all discrete. As a first example, we show that the classical Hausdorff metric on subsets of a metric space is a special case of the Hausdorff metric on -sets, justifying our terminology.
Example 4.15* (Classical Hausdorff metric).*
Let be the theory of attributed sets, as in Example 2.11. Let and be attributed sets under the discrete metric, with attributes valued in a fixed metric space . Then and are metric -spaces and the Hausdorff distance between them is
[TABLE]
where the first infimum is over all functions . If and are injective, we can identify and with subsets of the metric space . The Hausdorff distance then simplifies to
[TABLE]
which is the classical Hausdorff metric in non-symmetric form. Assuming is a symmetric metric space, we recover the standard Hausdorff metric upon symmetrization:
[TABLE]
In the next two examples, we define several possible Hausdorff metrics on graphs.
Example 4.16* (Hausdorff metric on attributed graphs).*
Let and be vertex-attributed graphs, as in Example 2.12, with discrete metrics on the vertex and edge sets and arbitrary metrics on the attribute sets. Then and are attributed graphs in and the Hausdorff distance between them is
[TABLE]
where the infimum is over graph homomorphisms and short maps .
This metric optimizes both the graph homomorphism and the matching of the attribute spaces and . If instead we fix a metric space for the attributes, the Hausdorff distance is
[TABLE]
where the infimum is now only over graph homomorphisms .
Example 4.17* (Weak Hausdorff metric on graphs).*
Let and be graphs, with discrete metrics on the edge sets, the shortest path distances of Example 4.2 on the vertex sets, and counting measures on both vertex and edge sets. Then and are graphs in and, for , the Hausdorff distance between them is
[TABLE]
where the infimum is over injective maps of the edges and injective short maps of the vertices. If is monomorphic to , that is, there is an injective graph homomorphism from to , then . But, unlike the previous example, we can still have even when is not homomorphic to , because the edge map is allowed violate the source and target constraints.
A concrete example is shown in Figure 4.1. The 2-cycle is plainly not homomorphic to the 4-cycle , but, due to the pictured transformation, the Hausdorff distance from to is . In general, if is the directed cycle of length , then it can be shown that for any . This quantity is the length of the shortest path between the endpoints of an -path on the -cycle. In the other direction, when since there is no longer any injection from to .
A stream of further examples can be generated by combining the features above, namely data attributes and metric weakening of the homomorphism constraints, in graphs or in other -sets, such as symmetric graphs, reflexive graphs, and their higher-dimensional generalizations.
5. Wasserstein metric on Markov kernels
Our goal is now to define a Wasserstein-style metric on metric measure -spaces, thus bringing together the threads of the two preceding sections. As a first step, we define a metric on Markov kernels, to serve the same role for the Wasserstein metric as the supremum or metrics do for the Hausdorff metric. Defining a metric on Markov kernels is more subtle than defining a metric on functions, and will be the subject of this section.
The Wasserstein metric on Markov kernels generalizes the Wasserstein metric on probability distributions. Our development parallels that of the classical metric theory for optimal transport, to be found, for instance, in [51, Chapter 7] or [52, Chapter 6]. In this spirit, we begin with two notions of coupling for Markov kernels.141414What we call “products” of Markov kernels are called “couplings” in the literature on Markov chains [15, Section 19.1.2], while our notion of “coupling” seems to have no established name.
Definition 5.1** (Couplings and products).**
A coupling of Markov kernels and is any Markov kernel with marginal along and marginal along . That is, and , where and are the canonical projection maps.
A product of Markov kernels and is any Markov kernel with marginal along and and marginal along and . That is, and , where and are the evident projections.
The set of all couplings of Markov kernels and is denoted by and the set of all products by .
To phrase it differently, a Markov kernel is a coupling of and if for every , the probability distribution is a coupling of and . Similarly, a Markov kernel is a product of and if for every and , the distribution is a coupling of and . In the special case where and are singleton sets, couplings and products of Markov kernels coincide and amount to couplings of probability measures.
The set of products of Markov kernels and is never empty, because one can always take the independent product,
[TABLE]
given by the pointwise independent product of probability measures. When the kernels and share a common domain , products are extensions of couplings, because any product gives rise to a coupling by pre-composing with the diagonal map . In particular, the set of couplings is never empty either.
Definition 5.2** (Wasserstein metrics on Markov kernels).**
Let and be metric measure spaces. For any exponent , the Wasserstein metric of order on Markov kernels is
[TABLE]
This metric generalizes two famous constructions in analysis. When the kernels are deterministic, we recover the metric on functions between metric measure spaces, reviewed in Section 4. When is a singleton set, the kernels and can be identified with probability measures and , and we recover the classical Wasserstein metric on probability measures,
[TABLE]
The relationships between the base metric and its derived metrics are summarized in the diagram of Figure 5.1.
It is possible to define a Wasserstein metric on Markov kernels in the case , generalizing the metric on functions and the metric on probability measures [9], [41, Section 3.2]. We do not pursue this case here, as the optimization problem ceases to be linear in the coupling .
We need to verify that the proposed metric on Markov kernels is actually a metric. As in the proof for the classical Wasserstein metric, the main property to verify is the triangle inequality, and crucial step in doing so is establishing a gluing lemma. Loosely speaking, the gluing lemma says that Markov kernels into and that share a common marginal along can be glued along to form a Markov kernel into .
Lemma 5.3** (Gluing lemma for Markov kernels).**
Let be a measurable space and be Polish measurable spaces. Suppose , , and are Markov kernels, and and are couplings thereof. Then there exists a Markov kernel with marginals along and along .
Proof.
By the disintegration theorem for Markov kernels (Theorem A.5), there exist Markov kernels and such that, using a mild abuse of notation,
[TABLE]
and
[TABLE]
Then the Markov kernel defined by
[TABLE]
satisfies the desired properties. ∎
By a variant of the usual gluing argument, we show that the Wasserstein metric on Markov kernels is indeed a metric.
Theorem 5.4**.**
Let and be metric measure spaces. For any , the Wasserstein metric of order on Markov kernels is a metric.
Moreover, if is a classical metric space, then the Wasserstein metric is also classical when restricted to equivalence classes of Markov kernels with finite moments of order , i.e., to Markov kernels such that
[TABLE]
for some (and hence any) , where we regard Markov kernels and as equivalent if for -almost every .
Proof.
We prove the triangle inequality first. Let be Markov kernels and let and be couplings thereof. By the gluing lemma, there exists a Markov kernel such that and , where , , are the evident projections. Forming the coupling , we estimate
[TABLE]
where we apply the triangle inequality in the second inequality and Minkowski’s inequality on in the third inequality. Since the resulting inequality holds for any couplings and , we conclude that
[TABLE]
Moreover, for any Markov kernel , the deterministic coupling , where is the diagonal map, yields
[TABLE]
proving that . Thus is a metric in generalized sense.
For the second part of the theorem, suppose is a classical metric space. If Markov kernels and satisfy , then, assuming for the moment that the infimum is achieved, there exists a coupling such that
[TABLE]
Thus, for -almost every ,
[TABLE]
For each such , since the metric is positive definite, is concentrated on the diagonal. Thus for some probability measure on and hence . But, by assumption, , and so . Thus for -a.e. , and we conclude that is positive definite. Next, is symmetric since the base metric is. Finally, by taking the independent coupling and using Minkowski’s inequality again, it is easy to show that is finite on Markov kernels with finite moments of order . ∎
In the second half of the proof, we assumed the following result.
Proposition 5.5**.**
The infimum in the Wasserstein metric on Markov kernels is attained. That is, for any Markov kernels on mm spaces , there exists a coupling of and such that
[TABLE]
Proof.
According to a known existence theorem for optimal couplings (Theorem A.6), the infimum can even be achieved simultaneously at every point . That is, there exists a coupling such that for every ,
[TABLE]
The proof even shows that the Wasserstein metric on Markov kernels can be written in terms of the Wasserstein metric on probability measures as
[TABLE]
Thus, if we view a Markov kernel as a function of metric spaces, the Wasserstein metric on Markov kernels reduces to the familiar metric. However, this result depends on the existence theorem for optimal couplings (Theorem A.6), the proof of which is non-trivial. Wherever possible we prefer to work with the original Definition 5.2 in terms of couplings of Markov kernels.
When at least one of the kernels is deterministic, the Wasserstein metric has a simple, closed-form expression.
Proposition 5.6** (Wasserstein metric on deterministic kernels).**
Let be metric measure spaces. For any Markov kernel and measurable functions and ,
[TABLE]
In particular, for any measurable functions ,
[TABLE]
Proof.
To prove the first statement, notice that and have a single coupling, at once deterministic and independent. By Tonelli’s theorem,
[TABLE]
The second statement follows from the first by taking and . ∎
6. Wasserstein metric on metric measure -spaces
We are at last ready to construct a Wasserstein-style metric on metric measure -spaces, combining the general metric theory for -sets with the Wasserstein metric on Markov kernels.
Let be the category with metric measures spaces as objects and Markov kernels as morphisms. In Section 3, we identified measurable functions with deterministic Markov kernels, obtaining an embedding functor and thus a relaxation of the -set homomorphism problem. In exactly the same way, the category of metric measure spaces is functorially embedded inside . We denote this embedding also by . Just as we relaxed the -set homomorphism problem, so will we relax the Hausdorff metric on mm -spaces.
As a first step, we make into a metric category compatible with the metrics. By Theorem 5.4, is a metric category under the Wasserstein metric of order , for any . Furthermore, by Proposition 5.6, this metric agrees with the corresponding metric on deterministic Markov kernels. Thus, the embedding is an isometry of metric categories.
In Section 4, we characterized the short morphisms of under its metric. The next proposition extends this characterization to , formally reducing to Proposition 4.9 when all the morphisms are deterministic Markov kernels. Consequently, the isometric embedding functor restricts to an embedding of short morphisms.
Proposition 6.1** (Short Markov kernels).**
Let be metric measure spaces and let be a Markov kernel.
- (a)
* for all Markov kernels if and only if is measure-decreasing,151515When the measure spaces coincide, that is, and , the measure is also called a subinvariant measure of [15, Definition 1.4.1].***
[TABLE] 2. (b)
* for all Markov kernels if and only if is distance-decreasing of order , i.e., there exists a product of with itself such that*
[TABLE]
Consequently, a Markov kernel is a short morphism if and only if it is distance-decreasing and measure-decreasing. is the category of mm spaces and distance- and measure-decreasing Markov kernels.
Proof.
Towards proving part (a), suppose is measure-decreasing. For any coupling of and , the composite is a coupling of and . Therefore,
[TABLE]
Since is arbitrary, we have .
For the converse direction, suppose that is not measure-decreasing. Choose a set such that . Let be a classical metric space with at least two points and , let be the constant function at , and let be the function equal to on and equal to outside of . The composite is also constant, so by Proposition 5.6 we have
[TABLE]
To prove part (b), suppose is distance-decreasing, and let be a product of with itself that attains the bound. For any coupling of and , the composite is a coupling of and . Therefore,
[TABLE]
Since is arbitrary, we obtain .
Conversely, suppose this inequality holds for all Markov kernels . As in the proofs of Propositions 4.7 and 4.9, let be the singleton probability space and let denote the generalized element at . For any , take and to obtain . Using Theorem A.6, we can construct such that
[TABLE]
Conclude that is distance-decreasing. ∎
The Wasserstein metric on mm -spaces is the Hausdorff metric (Definition 4.11) on -sets in the metric category . In concrete terms, the definition is:
Definition 6.2** (Wasserstein metric on mm -spaces).**
Let be a finitely presented category. For any , the Wasserstein metric of order on metric measure -spaces is given by, for ,
[TABLE]
where the sum is over a fixed, finite generating set of morphisms in and the infimum is over all Markov transformations , not necessarily natural, whose components are short morphisms (so belong to ).
In view of the concepts combined, this metric should perhaps be called the “Hausdorff-Wasserstein metric” or even, if we are taking the history seriously [50], the “Hausdorff-Kantorovich-Rubinstein metric.” However, we find these names too cumbersome to contemplate.
The proposed metric is indeed a metric, by Theorem 4.12. Like any Hausdorff metric on -sets, it is not generally finite, positive definite, or symmetric, but the failure of positive definiteness is more severe than before, since whenever there exists a Markov morphism with components in . More generally, the Wasserstein metric is a relaxation of the Hausdorff- metric from Section 4, as clarified by the following inequality.
Proposition 6.3** (Wasserstein metric as convex relaxation).**
For any , the Wasserstein metric of order on mm -spaces is a convex relaxation of the Hausdorff- metric on mm -spaces. That is, for all mm -spaces and ,
[TABLE]
where
[TABLE]
is the Hausdorff metric on -sets in the metric category with its metric. Moreover, the problem of computing can be formulated as a convex optimization problem, possibly in infinite dimensions.
Proof.
We first prove the relaxation. As in the proof of Theorem 4.12, we write for the weight of a transformation at a morphism , and, similarly, we write for the weight of a Markov transformation. By the discussion preceding Proposition 6.1, for any transformation with components in , there is a Markov transformation with components in and of the same weight, . Consequently,
[TABLE]
As for the convexity, since the infimum in the Wasserstein metric on Markov kernels is attained (Proposition 5.5), the value is the infimum of
[TABLE]
taken over the Markov kernels
[TABLE]
indexed by objects and morphisms generating , and subject to the constraints, for all ,
[TABLE]
In this optimization problem, the optimization variables belong to convex spaces, the objective is linear, and all the constraints, including those for couplings and products, are linear equalities or inequalities. The problem is therefore convex. ∎
As in Section 3, the optimization problem is a linear program provided the -sets are finite. Let and be finite metric measure -spaces. Identifying Markov kernels with right stochastic matrices, measures with row vectors, and exponentiated metrics with column vectors , the distance is the value of the linear program:
[TABLE]
We leave implicit in the notation the dimensionalities of the projection operators and of column vectors consisting of all 1’s.
While it appears forbidding, this linear program simplifies in certain common situations. When has the discrete metric, any Markov kernel is distance-decreasing, so the product and associated constraints can be eliminated from the program. When and is fixed to be the identity, as happens for fixed attribute sets, then for any morphism , the Wasserstein distance between and has a closed form (Proposition 5.6). The coupling and associated constraints can thus be eliminated.
Both kinds of simplification occur in the next two examples.
Example 6.4* (Classical Wasserstein metric).*
Continuing Examples 2.11 and 4.15, let and be attributed sets, equipped with discrete metrics and any probability measures, and taking attributes in a fixed mm space . Then and are attributed sets in and the Wasserstein metric of order between them is
[TABLE]
where the second equality follows by the disintegration theorem (Theorem A.4). In particular, is the value of an optimal transport problem. When and are injective, and can be identified with subsets of and we recover the classical Wasserstein metric, namely .
Example 6.5* (Wasserstein metric on attributed graphs).*
Continuing Examples 2.12 and 4.16, let and be vertex-attributed graphs taking attributes in a fixed mm space . Equip the vertex and edge sets with discrete metrics and with any fully supported measures. Then and are attributed graphs in and the Wasserstein metric of order between them is
[TABLE]
where the infimum is over Markov graph morphisms with measure-decreasing components and .
This metric takes an infinite value whenever no Markov graph morphism exists. The next example features a weaker metric, presented, for simplicity, in the case of unattributed graphs.
Example 6.6* (Weak Wasserstein metric on graphs).*
Let and be graphs. Continuing Example 4.17, equip the edge sets with discrete metrics, the vertex sets with shortest path distances, and the vertex and edge sets with counting measures. The Wasserstein metric of order 1 is then
[TABLE]
where the infimum is over measure-decreasing Markov kernels and distance- and measure-decreasing Markov kernels . By Proposition 6.3, this metric relaxes the Hausdorff metric of Example 4.17. It is genuinely weaker: on directed cycles, we have when , as witnessed by the uniform Markov graph morphisms of Example 3.8. (They do not increase distances, like any Markov kernels equal everywhere to a constant distribution.) We still have when due to the measure-decreasing constraint.
7. Conclusion
We have introduced Hausdorff and Wasserstein metrics on graphs and other -sets and illustrated them through a variety of examples. That being said, we have established only the most basic properties of the concepts involved. Possibilities abound for extending this work, in both theoretical and practical directions. Let us mention a few of them.
Although encompassing graphs, simplicial sets, dynamical systems, and other structures, the formalism of -sets remains possibly the simplest equational logic, sitting at the bottom of a hierarchy of increasingly expressive systems [29, 12]. By admitting categories with extra structure, such as sums, products, or exponentials, more complicated structures can be realized as structure-preserving functors from into or some other category . For example, categories with finite products describe monoids, groups, rings, modules, and other familiar algebraic structures. This is the original setting of categorical logic [30].
Many of the ideas developed here extend to categories with extra structure. The pertinent questions are whether the extra structure can be accommodated in the category and its variants and how this affects the computation. Sums (coproducts) and units (terminal objects) are easily handled. has finite sums and a unit, and since the direct sum of Markov kernels is a linear operation, it preserves the class of linear optimization problems. Products are more important and more delicate. does not have categorical products, and its natural tensor product, the independent product, is not a linear operation. In keeping with the spirit of this paper, products in should be translated into optimal couplings in , resulting in a larger optimization problem. We leave a proper development of this idea to future work.
A linear program is solvable in polynomial time and therefore improves dramatically in tractability over an NP-hard combinatorial problem. Nevertheless, solving a generic linear program is not always practical. Indeed, the recent surge in popularity of optimal transport is due partly to the introduction, by Cuturi and others, of specialized algorithms for solving the optimal transport program, which far outperform generic interior-point solvers [13, 37]. It would be useful to know whether and how these fast algorithms for optimal transport can be adapted to the linear programs in this paper.
In the new algorithms for optimal transport, the optimization objective is augmented by a term proportional to the negative entropy of the coupling, a technique known as entropic regularization. With this addition, the optimal transport problem improves from merely convex to strongly convex and, in particular, has a unique solution. Besides being useful for optimization, entropic regularization has a statistical interpretation as Gaussian deconvolution [40].
For the Markov morphism feasibility problem of Section 3, adding entropic regularization yields an optimization problem whose solution is the Markov morphism of maximum entropy. For instance, in Figure 3.1, the maximum entropy Markov morphism is the uniform mixture of the two graph homomorphisms and . In Figure 3.3, the unique Markov morphism has the maximum possible entropy, with all vertex and edge distributions being uniform. As these examples show, entropic regularization is antithetically opposed to the recovery of deterministic solutions. Entropic regularization should be investigated more systematically in this context, for algorithmic reasons and for its intrinsic interest.
Appendix A Markov kernels
A Markov kernel is the probabilistic analogue of a function, assigning to every point in its domain not a single point in its codomain but a whole probability distribution over its codomain. Markov kernels are fundamental objects in probability theory and statistics [8, 24, 25, 28]. In this appendix, we recall the definition and basic properties of Markov kernels, as well as a few more obscure results from the literature.
Definition A.1** (Markov kernels).**
Let and be measurable spaces. A Markov kernel from to , also known as a probability kernel or stochastic kernel, is a function such that
- (i)
for all points , the map is a probability measure on , and 2. (ii)
for all sets , the map is measurable.
We usually write instead of , in agreement with the standard notation for conditional probability.
Equivalently, a Markov kernel from to is a measurable map , where is the space of all probability measures on under the -algebra generated by the evaluation maps , [24, Lemma 1.40]. With this perspective, it is natural to denote a Markov kernel simply as and to write for the distribution of under .
If Markov kernels are probabilistic functions, then they ought to be composable. They are, according to a standard definition [28, Definition 14.25].
Definition A.2** (Composition of Markov kernels).**
The composition of a Markov kernel with another Markov kernel is the Markov kernel defined by
[TABLE]
The identity with respect to this composition law is the usual identity function regarded as a Markov kernel, .
A third perspective on Markov kernels is that they are linear operators on spaces of measures or probability measures (see, for instance, [8, Theorem 5.2 and Lemma 5.10] or [54, Section 3.3]).
Definition A.3** (Markov operators).**
Let be a Markov kernel. Given a measure on , define the measure on by
[TABLE]
With this definition, is a Markov operator: writing for the space of all finite nonnegative measures on , the Markov kernel acts as linear map that preserves the total mass, . In particular, acts as a linear map on spaces of probability measures.
Let be a measure on and let be a Markov kernel. Besides applying to , yielding the measure on , we can also take the product of and , yielding a measure on defined on measurable rectangles by
[TABLE]
where and . The product measure has marginal along and marginal along . In the special case where equals a fixed measure for all , we recover the usual product measure .
It is often useful to know that the product operation is invertible. The inverse operation is called disintegration.
Theorem A.4** **(Disintegration of measures
[25, Theorem 1.23]).
Let be a measurable space and be a Polish space. For any -finite measure on , with -finite marginal along , there exists a Markov kernel such that . Moreover, if is any other Markov kernel with this property, then for -almost every .
What is less well known is that Markov kernels into product spaces can also be disintegrated. We use this result in Section 5 to prove a gluing lemma for Markov kernels.
Theorem A.5** **(Disintegration of Markov kernels
[25, Theorem 1.25]).
Let be a measurable space and let and be Polish spaces. For any Markov kernel with marginal along , there exists a Markov kernel such that , where the product is defined on measurable rectangles by
[TABLE]
where , , and .
In the special case where is a singleton, a Markov kernel is a probability distribution on and we recover a version of the previous theorem.
Under mild assumptions, optimal couplings of probability measures exist, according to a standard result in optimal transport [52, Theorem 4.1]. Markov kernels have optimal couplings under the same assumptions, though proving it is more involved. Versions of the following theorem appear in the literature as [15, Theorem 20.1.3], [52, Corollary 5.22], and [56, Theorem 1.1].
Theorem A.6** (Optimal coupling of Markov kernels).**
Let be a measurable space, be a Polish space, and be a nonnegative, lower semi-continuous cost function. For any Markov kernels , there exists a Markov kernel such that, for every ,
- (i)
* is a coupling of and , and* 2. (ii)
* is optimal with respect to the cost function , i.e.,*
[TABLE]
We invoke this theorem in Section 5 to show that the infimum defining the Wasserstein metric on Markov kernels is attained.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1[1] Yonathan Aflalo, Alexander Bronstein and Ron Kimmel “On convex relaxation of graph isomorphism” In Proceedings of the National Academy of Sciences 112.10 , 2015, pp. 2942–2947
- 2[2] David Alvarez-Melis, Tommi Jaakkola and Stefanie Jegelka “Structured optimal transport” In International Conference on Artificial Intelligence and Statistics , 2018, pp. 1771–1780 ar Xiv: 1712.06199
- 3[3] Roman V. Belavkin “Optimal measures and Markov transition kernels” In Journal of Global Optimization 55.2 , 2013, pp. 387–416
- 4[4] Karsten M. Borgwardt and Hans-Peter Kriegel “Shortest-path kernels on graphs” In Fifth IEEE International Conference on Data Mining (ICDM’05) , 2005, pp. 8–pp IEEE
- 5[5] Martin R. Bridson and André Haefliger “Metric spaces of non-positive curvature” Springer, 1999
- 6[6] Horst Bunke “On a relation between graph edit distance and maximum common subgraph” In Pattern Recognition Letters 18.8 , 1997, pp. 689–694
- 7[7] Dmitri Burago, Yuri Burago and Sergei Ivanov “A course in metric geometry” American Mathematical Society, 2001
- 8[8] N.. Čencov “Statistical decision rules and optimal inference”, Translations of Mathematical Monographs 53 American Mathematical Society, 1982
