Scalable Unbalanced Sobolev Transport for Measures on a Graph
Tam Le, Truyen Nguyen, Kenji Fukumizu

TL;DR
This paper introduces a scalable unbalanced Sobolev transport method for measures on a graph, addressing computational complexity and mass imbalance issues in optimal transport, with theoretical and empirical validation.
Contribution
It extends Sobolev transport to unbalanced measures on graphs, providing a closed-form formula, geometric insights, and kernel design for efficient comparison of measures.
Findings
Fast computation of the proposed UST method
Comparable performance to existing transport baselines
Effective kernel design for unbalanced measures
Abstract
Optimal transport (OT) is a popular and powerful tool for comparing probability measures. However, OT suffers a few drawbacks: (i) input measures required to have the same mass, (ii) a high computational complexity, and (iii) indefiniteness which limits its applications on kernel-dependent algorithmic approaches. To tackle issues (ii)--(iii), Le et al. (2022) recently proposed Sobolev transport for measures on a graph having the same total mass by leveraging the graph structure over supports. In this work, we consider measures that may have different total mass and are supported on a graph metric space. To alleviate the disadvantages (i)--(iii) of OT, we propose a novel and scalable approach to extend Sobolev transport for this unbalanced setting where measures may have different total mass. We show that the proposed unbalanced Sobolev transport (UST) admits a closed-form formula for…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAsphalt Pavement Performance Evaluation
Scalable Unbalanced Sobolev Transport for Measures on a Graph
Tam Le ∗,†,‡
Truyen Nguyen ∗,⋄
Kenji Fukumizu †
The Institute of Statistical Mathematics †
The University of Akron ⋄
RIKEN AIP ‡
Abstract
Optimal transport (OT) is a popular and powerful tool for comparing probability measures. However, OT suffers a few drawbacks: (i) input measures required to have the same mass, (ii) a high computational complexity, and (iii) indefiniteness which limits its applications on kernel-dependent algorithmic approaches. To tackle issues (ii)–(iii), Le et al., (2022) recently proposed Sobolev transport for measures on a graph having the same total mass by leveraging the graph structure over supports. In this work, we consider measures that may have different total mass and are supported on a graph metric space. To alleviate the disadvantages (i)–(iii) of OT, we propose a novel and scalable approach to extend Sobolev transport for this unbalanced setting where measures may have different total mass. We show that the proposed unbalanced Sobolev transport (UST) admits a closed-form formula for fast computation, and it is also negative definite. Additionally, we derive geometric structures for the UST and establish relations between our UST and other transport distances. We further exploit the negative definiteness to design positive definite kernels and evaluate them on various simulations to illustrate their fast computation and comparable performances against other transport baselines for unbalanced measures on a graph.
1 INTRODUCTION
Optimal transport (OT) has become a popular approach and its theory lays out a compelling toolkit for data analysis on probability distributions. OT has been leveraged in several research areas such as machine learning (Peyré and Cuturi,, 2019; Nadjahi et al.,, 2019; Titouan et al.,, 2019; Bunne et al.,, 2019, 2022; Janati et al.,, 2020; Muzellec et al.,, 2020; Paty et al.,, 2020; Mukherjee et al.,, 2021; Altschuler et al.,, 2021; Fatras et al.,, 2021; Le et al., 2021a, ; Le et al., 2021b, ; Liu et al.,, 2021; Nguyen et al., 2021b, ; Scetbon et al.,, 2021; Si et al.,, 2021; Takezawa et al.,, 2022; Fan et al.,, 2022), computer vision (Nguyen et al., 2021a, ; Saleh et al.,, 2022; Wang et al., 2022b, ), and statistics (Mena and Niles-Weed,, 2019; Weed and Berthet,, 2019; Liu et al.,, 2022; Nguyen et al.,, 2022; Nietert et al.,, 2022; Wang et al., 2022a, ) to name a few. Nevertheless, it has some fundamental disadvantages.††∗: Two authors contributed equally.
One drawback of OT is that it requires input measures having the same mass for the transportation. To address this problem, several proposals have been developed in the recent literature. For examples, the partial optimal transport (POT) (Caffarelli and McCann,, 2010; Figalli,, 2010) constraints a fixed amount of mass for transportation; the optimal entropy transport (OET) (Liero et al.,, 2018; Chizat et al., 2018b, ; Kondratyev et al.,, 2016) optimizes a sum of a transport functional and two convex entropy functionals. Additionally, there are various other approaches, e.g., the Kantorovich-Rubinstein discrepancy (Hanin,, 1992; Guittet,, 2002; Lellmann et al.,, 2014; Sato et al.,, 2020), the unbalanced mass transport (Benamou,, 2003), the generalized Wasserstein distance (Piccoli and Rossi,, 2014, 2016), the unnormalized optimal transport (Gangbo et al.,, 2019), and the entropy partial transport (Le and Nguyen,, 2021). These approaches are either special cases of the OET (e.g., by using some specific instances of entropy functional such as the total variation distance, distance), or a variant of OET (e.g., by using the distance, partial transport in place of the entropy functional, transport functional respectively). It is worth pointing out that the unbalanced setting for measures with unequal mass has been applied in several application domains and learning problems, e.g., color transfer and shape matching (Bonneel et al.,, 2015); multi-label learning (Frogner et al.,, 2015); positive-unlabeled learning (Chapel et al.,, 2020); natural language processing and topological data analysis (Le and Nguyen,, 2021). In particular, the unbalanced approach becomes essential when supports of input measures are subject to noise or have outliers since such supports are not desirably aligned in the matching problem (Frogner et al.,, 2015; Balaji et al.,, 2020; Mukherjee et al.,, 2021).
Another drawback of standard OT is that it has a high computational complexity. This disadvantage also exists in the unbalanced optimal transport (UOT), which hinders its applications, especially for large-scale settings. For examples, let us consider the OET with Kullback-Leibler divergence for the entropy functional which is widely used in applications. For this, one can leverage the entropic regularization to derive efficient Sinkhorn-based algorithmic approach (Frogner et al.,, 2015; Chizat et al., 2018a, ; Séjourné et al.,, 2019) which has a quadratic complexity (Pham et al.,, 2020). Another popular approach to scale up UOT is to exploit geometric structures of supports, e.g., one-dimensional structure (Bonneel and Coeurjolly,, 2019; Séjourné et al.,, 2022), tree structure (Le and Nguyen,, 2021; Sato et al.,, 2020). More concretely, Bonneel and Coeurjolly, (2019) proposed the sliced partial optimal transport (SPOT) by projecting supports into a random one-dimensional space. By assuming a unit mass on each support, they developed an efficient algorithmic approach with a quadratic complexity for the worst case. Nonetheless, SPOT suffers a curse of dimensionality since using one-dimensional projections for supports limits its ability to capture topological structures of distributions, especially in a high-dimensional space. Le and Nguyen, (2021) proposed the entropy partial transport (EPT) by exploiting a tree structure to remedy the curse of dimensionality for SPOT. Moreover, EPT yields the first closed-form solution among various variants of UOT (i.e., its complexity is linear to the number of edges in a tree) for fast computation which is applicable for large-scale settings. However, tree structure may be a restricted condition which narrows down its practical usage in applications.
The aforementioned circumstances motivate us to consider measures with unequal mass and supported on a graph metric space which has more degrees of freedom (i.e., graph structure rather than tree structure) and appears more popularly in applications. Inspired by the Sobolev transport (Le et al.,, 2022) for probability measures on a graph, we propose a novel and scalable approach to leverage graph structure and extend Sobolev transport for the unbalanced setting. At a high level, our contributions are three-fold as follow:
- •
we propose a novel -order unbalanced Sobolev transport (UST) () for measures with unequal mass and supported on a graph metric space. We prove that UST admits a closed-form formula for a fast computation and it is negative definite;
- •
we derive geometric structures for the UST and propose positive definite kernels built upon the UST. Additionally, we establish relations between UST and the EPT on a graph;
- •
we empirically illustrate that UST is fast for computation (i.e., closed-form solution of UST). Also various simulations demonstrate that the performances of the proposed kernels for UST compare favorably with other unbalanced transport baselines for measures with unequal mass on a graph.
The paper is organized as follows: we introduce notations and the problem setup in §2. In §3, we extend and derive the EPT for unbalanced measures on a graph. We then present our main contribution: the UST for measures with unequal mass on a graph in §4 and derive its properties in §5. In §6, we evaluate the proposed kernel for UST against other unbalanced transport baselines for measures with unequal mass on a graph on various simulations. We conclude our work in §7. The detailed proofs for our theoretical results are placed in Appendix §A.2. Furthermore, we have released code for our proposals.111https://github.com/lttam/UnbalancedSobolevTransport
2 PRELIMINARIES
In this section, we introduce our problem setting, notations, and review relevant definitions.
We consider the same graph setting where are sets of nodes and edges respectively as in (Le et al.,, 2022) for Sobolev transport. More precisely, is an undirected, connected and physical graph in the sense that and each edge is the standard line segment in connecting the two corresponding end-points of . Graph has positive edge lengths and is imposed a graph metric which equals to the length of the shortest path on . Following a convention in (Le et al.,, 2022), by graph , we mean the set of all nodes in and all points forming the edges in , i.e., the continuous setting for graph . We also assume that there exists a fixed root node such that for every , is attained by the unique shortest path connecting and , i.e., the uniqueness property of the shortest paths (Le et al.,, 2022).
Given a point (resp. an edge in ), we denote (resp. ) as the collection of all points such that the unique shortest path in connecting the root node and contains the point (resp. the edge ). That is,
[TABLE]
[TABLE]
where we write for the shortest path in connecting the root node and .
We denote (resp. ) as the set of all nonnegative Borel measures on (resp. ) with a finite mass. By continuous function on , we mean that is continuous w.r.t. the topology on induced by the Euclidean distance. Similar adoption is also applied for continuous functions on . We denote as the collection of all continuous functions on .
Given a scalar , a function is called -Lipschitz w.r.t. the graph metric if
[TABLE]
For , we denote as its conjugate, i.e., s.t., . For a nonnegative Borel measure on , let denote the space of all Borel measurable functions satisfying . When , we assume that is bounded -a.e. instead. Functions are considered to be the same if for -a.e. . Then, is a normed space with the norm defined by
[TABLE]
[TABLE]
Recall that Sobolev transport for probability measures on a graph is an instance of integral probability metrics (IPM) (Müller,, 1997). Intuitively, the definition of Sobolev transport is based on the dual form of the -order Wasserstein distance, but its Lipschitz constraint for the critic function is considered in the graph-based Sobolev space (see (Le et al.,, 2022, §3) for the detail). As a consequence, it may not possible to directly leverage approaches for standard OT (e.g., partial OT, entropy (partial) transport) to extend Sobolev transport for unbalanced measures on a graph.
In this paper, we propose a detour to develop unbalanced Sobolev transport for measures with unequal mass on a graph. We first take a step back to leverage the EPT (for unbalanced measures on a tree) (Le and Nguyen,, 2021) and extend it for unbalanced measures on a graph (§3). Although it is still a great challenge to efficiently compute the EPT for unbalanced measures on a graph, this novel extension (especially its dual form) plays a cornerstone in deriving a scalable approach for the proposed unbalanced Sobolev transport (UST) (§4).
3 ENTROPY PARTIAL TRANSPORT ON A GRAPH
The entropy partial transport (EPT) (Le and Nguyen,, 2021) is developed for unbalanced measures on a tree. In this section, we propose an extension of EPT for unbalanced measures on a graph. Intuitively, EPT optimizes a sum of a transport function and two convex entropy functions in a similar spirit to the OET (Liero et al.,, 2018; Chizat et al., 2018b, ). We first consider the primal formulation of EPT on a graph. We then derive its dual formulation which is the main result of this section. This novel dual formulation paves the way for our development of the UST (§4).
Given two measures which may have different total mass, consider the set
[TABLE]
where and respectively denote the first and second marginals of ; by , we mean that for every Borel set . Similar convention is used when we write .
For , let and respectively be the Radon-Nikodym derivatives of w.r.t. and of w.r.t. , i.e., and . Then, we have -a.e., and -a.e. The weighted relative entropies of w.r.t. and of w.r.t. are defined by
[TABLE]
[TABLE]
where are convex and lower semicontinuous entropy functions; and are given nonnegative weight functions.
Given a continuous cost function with , a constant and a fixed scalar where , we consider the primal formulation of EPT problem on a graph:
[TABLE]
Following (Le and Nguyen,, 2021), we consider
[TABLE]
for the entropy functions in (3) and form a Lagrange multiplier conjugate to the constraint . As a result, we instead study the problem
[TABLE]
where is defined as
[TABLE]
The connection between problem (3) with mass constraint and problem (4) with Lagrange multiplier is given in Theorem A.1 (Appendix §A.1). Also, from Theorem A.1, we see that solving the auxiliary problem (4) gives us a solution to the original problem (3). We now derive a novel dual formulation for problem (4) which paves the way for our proposed UST (§4).
Theorem 3.1** (Dual formula for general cost).**
For , nonnegative weights , and two input measures , we have
[TABLE]
where {\mathbb{K}}\triangleq\Big{\{}(u,v):\,u\leq w_{1},\,-b\lambda+\inf_{x\in{\mathbb{G}}}[b\,c(x,y)-w_{1}(x)]\leq v(y)\leq w_{2}(y),\,u(x)+v(y)\leq b[c(x,y)-\lambda]\Big{\}}.
The main idea of proving this result is to attach to the graph a new point , and then suitably and carefully extend the cost and the input distributions to the set inspired by an observation in (Caffarelli and McCann,, 2010). The key point of this extension is to ensure that the extended input distributions on have the same total mass and the value of the new balanced OT between extended input distributions on is equal to that of the original EPT on graph (i.e., the unbalanced setting). We then exploit the dual theory for the new balanced OT problem on to establish the dual formulation for our EPT problem on graph (see Appendix §A.2 for detailed proof). When the ground cost is the graph metric , the dual formula in Theorem 3.1 can be rewritten in a simpler and more symmetric form as follows.
Corollary 3.2** (Dual formula for graph metric).**
Assume that and the nonnegative weight functions are -Lipschitz w.r.t. . For simplicity, let . Then, we have
[TABLE]
where \mathbb{U}\triangleq\big{\{}f\in C({\mathbb{G}}):-w_{2}-\frac{b\lambda}{2}\leq f\leq w_{1}+\frac{b\lambda}{2},\,|f(x)-f(y)|\leq b\,d_{\mathbb{G}}(x,y)\big{\}}.
Remark 3.3**.**
We remark that one cannot directly use the dual formulation in (Le and Nguyen,, 2021), or that of (Piccoli and Rossi,, 2014, 2016) for unbalanced measures on a graph since the considered problem does not satisfy the conditions imposed in these approaches for duality.
In principal, for input unbalanced measures on a graph, it is simpler to learn the optimal in dual form (6) than to learn the optimal in primal form (4). This is due to the fact that the critic is a function on the lower dimensional space compared to . Moreover, the Lipschitz constraint for is easier to handle than the constraint for . Nevertheless, it is still a challenge to effectively compute using (6).
As illustrated in (Le et al.,, 2019; Le and Nguyen,, 2021) for transport problems on a tree, the Lipschitz constraint for the critic can be effectively optimized by leveraging the tree structure supports. Furthermore, the Lipschitz constraint is linked with the -order Wasserstein distance via the Kantorovich duality formulation. Due to the different nature of duality for -order Wasserstein distance when , it is however unknown that one can extend the fast computational results in (Le et al.,, 2019; Le and Nguyen,, 2021) to -order Wasserstein distance with , even for measures on a tree.
To alleviate this, we propose in the next section an efficient -order unbalanced Sobolev transport for measures with unequal mass on a graph for any .
4 UNBALANED SOBOLEV TRANSPORT
As pointed out in §3, it is a great challenge to efficiently compute (i.e., the EPT problem) for unbalanced measures on a graph using either the primal form (4) or the dual form (6). To overcome this issue, we propose in this section an efficient variant called unbalanced Sobolev transport (UST) distance. We further derive a novel closed-form formula which allows a fast computation for the proposed transport distance, especially for large-scale settings.
Our strategy in defining the UST is based on the dual formulation (6) (in Corollary 3.2) but by simultaneously relaxing the two constraints for critic function in the set . This approach is partially adopted in (Le and Nguyen,, 2021) for the EPT problem for measures on a tree, but they only relax the first corresponding constraint for in the set (i.e., the bounded constraint for the critic function ). However, keeping the Lipschitz constraint for limits the approach in (Le and Nguyen,, 2021) to be extended to more general structures rather than tree structure (e.g., graph structure). We note that the Lipschitz constraint is about bounding the derivative of and hence it is more fundamental and relevant than the first constraint. In this paper, we propose to also relax the Lipschitz constraint by leveraging a notion of Sobolev functions. This approach relies on the following concept of derivatives for functions on graphs introduced by Le et al., (2022), which can be viewed as a generalized version of the fundamental theorem of calculus for a graph.
Definition 4.1** (Graph-based Sobolev space (Le et al.,, 2022)).**
Let be a nonnegative Borel measure on , and let . A continuous function is in the Sobolev space if there exists a function satisfying
[TABLE]
Such function is unique in and is called the graph derivative of w.r.t. the measure . Hereafter, this graph derivative of is denoted by .
From Definition 4.1 and the property of space, we have
[TABLE]
whenever . In particular, is the smallest space and is the largest space. Additionally, we prove that contains the space of all Lipschitz continuous functions, and both spaces coincide when is a tree (see Lemma A.2 in Appendix §A.1 for the detail). Hereafter, let denote the length measure on as defined in (Le et al.,, 2022, §4.1) (see Appendix §B.1 for a review). We propose to regularize the transport in (6) by relaxing the constraint set for critic function in two ways:
Firstly, we replace the Lipschitz condition for the critic function in the set (in Corollary 3.2) by instead considering this constraint in the graph-based Sobolev space, i.e., with . This has the following advantages: (i) we can enlarge the constraint set on the Sobolev space by decreasing the value of parameter ; (ii) we can vary the constraint set by choosing a suitable measure on . The measure can be interpreted as a cost of moving a unit mass from one location to another, and this cost is the same as the graph metric when is chosen as the length measure of . Even when and , this relaxation viewpoint still has the fundamental benefit: it allows us to extend most of the main results in (Le and Nguyen,, 2021) for tree structure to graph structure.
We emphasize that extending the approach in (Le and Nguyen,, 2021) (i.e., EPT problem for measures on a tree) to EPT problem for measures on a graph is problematic. In this special case, we know from Lemma A.2 (Appendix §A.1) that our corresponding Sobolev constraint is equivalent the Lipschitz constraint when is a tree. However, Lemma A.2 also implies that the Sobolev constraint set is possibly larger for a general graph . This flexibility of Sobolev functions enables us to overcome the limitation of the approach in (Le and Nguyen,, 2021) (i.e., for a tree structure) and gives us an effective way to exploit the graph structure by working with critic function of a specific form in Sobolev space (see Definition 4.1). Our obtained results in this section reveal that critic of Sobolev type in the sense of Definition 4.1 is more suitable for EPT problem for measures on a graph than critic of the Lipschitz type.
Secondly, we relax the first condition for in the set (i.e., the bounded constraint for the critic function ) by using the following observation. According to Definition 4.1, any function can be represented as
[TABLE]
If in addition , then by Hölder inequality, the second term on the right hand side is controlled by b\,\omega\big{(}[z_{0},x]\big{)}^{\frac{1}{p}}. Thus, instead of requiring
[TABLE]
as in the definition of , we suggest to constrain only the first term .
Putting these two ways of regularization together, we propose to consider the following constraint set as a relaxation of the constraint set for the critic function in Corollary 3.2. Note that the choice of corresponds to our above discussion. Here, we generalize our theoretical development for a more general to allow an extra degree of freedom which might be potentially useful in practical applications, e.g., by tuning for further improvement.
Definition 4.2** (The regularized set for critic function).**
For and , let be the collection of all functions satisfying
[TABLE]
and
[TABLE]
Equivalently, is the collection of all functions of the form
[TABLE]
with and with being some function satisfying
[TABLE]
It is clear from Definition 4.2 that (see Corollary 3.2 for set ). The requirement is to ensure that the interval is nonempty. By constraining critic to the relaxed set and noting that the last term in (6) is simply a constant depending on the total masses of and , we propose the following regularization of the transport in Corollary 3.2, namely unbalanced Sobolev transport (UST).
Definition 4.3** (Unbalanced Sobolev transport).**
Let be a nonnegative Borel measure on graph . Given and . For , the unbalanced Sobolev transport is defined as follow
[TABLE]
The measure used for representing critic in (see (7)) acts as the ground cost of moving masses on graph from one location to another. Especially, when is chosen as the length measure of graph , we have (see Lemma B.2 in Appendix §B.1).
We then show the connection between -order UST and the dual formulation of EPT on graph with the Lipschitz constraint, but the bounded constraint only applied on the critic function at root node . Precisely, we obtain:
Lemma 4.4**.**
Recall that be the length measure of graph . For , we have
[TABLE]
where \mathbb{U}_{0}\triangleq\Big{\{}f\in C({\mathbb{G}}):\,-w_{2}(z_{0})-\frac{b\lambda}{2}\leq f(z_{0})\leq w_{1}(z_{0})+\frac{b\lambda}{2},\,|f(x)-f(y)|\leq b\,d_{\mathbb{G}}(x,y)\Big{\}}. Moreover, the inequality in (8) becomes the equality if is a tree.
We next state our fundamental result, which demonstrates that the proposed UST (Definition 4.3) for measures with unequal mass on a graph is computationally effective. We in fact obtain a closed-form formula for UST in terms of an integral explicitly depending on the input measures. This yields a substantial computational advantage in comparison with the EPT approach for unbalanced measures on a graph (i.e., ) which requires to solve sophisticated optimization problems either in the primal (4) or its dual (6). To our knowledge, the proposed UST is the first approach which yields a closed-form solution among available variants of unbalanced OT for measures with unequal mass on a graph.
Proposition 4.5**.**
Let be a nonnegative measure on graph . Let and . Then, for two input measures , we have
[TABLE]
where is defined by (1) and
[TABLE]
The constant depends on and unless or . The integral in the above expression can be computed explicitly and efficiently as in the following corollary when the two input distributions are supported on nodes of the graph (i.e., the node set of graph ).
Corollary 4.6**.**
Under the same assumptions as in Proposition 4.5 and assume in addition that for every . Suppose that are supported on nodes in of graph .222We discuss an extension for measures supported in in Appendix §B.2. Then, we have
[TABLE]
Remark 4.7** (UST for non-physical graph).**
We have assumed that is a physical graph as in §2. However, Corollary 4.6 shows that the -order unbalanced Sobolev transport does not depend on this physical assumption when input measures are supported on nodes. Precisely, it only depends on the graph structure and edge weights . Thus, can be applied for non-physical graph .
We next describe a preprocessing step on graph and analyze the time complexity in computing .
Preprocessing step. To compute , we apply a preprocessing step to form the set for each edge in graph by identifying shortest paths from the root node to other nodes (e.g., by Dijkstra algorithm with a complexity where are the numbers of egdes and nodes of graph respectively). Especially, observe that any edge with does not contribute to the computation of . Therefore, one can remove such edge in the summation in (4.6). We emphasize that this preprocessing step only involves the graph structure itself and is independent of input measures.
Computational complexity. Let E_{\mu,\nu}\triangleq\left\{e\in E\mid e\subset[z_{0},z]\mbox{ for some z\in }\text{supp}(\mu)\cup\text{supp}(\nu)\right\}, where are respectively the support of measures . Then, the computational complexity of is linear to the number of edges in .
Related work. Beyond the pure graph of supports, the metric structure inherited from the graph metric space plays an important role in our work. More precisely, an edge weight is considered as a cost to move a unit mass from one node to the other node of edge (i.e., graph metric distance between two edge nodes). Therefore, one should distinguish our approach with the unbalanced diffusion earth mover’s distance (Tong et al.,, 2022) which uses an affinity between two edge nodes in their graph.
** Relation with Sobolev transport (ST) (Le et al.,, 2022).** We emphasize that ST is only valid for measures with equal mass on a graph. It cannot be applied for our considered problem where input measures may have different total mass. Even though both ST and the proposed UST are instances of integral probability metrics (IPM), it is nontrivial to effectively extend ST for unbalanced measures on a graph by defining a function set for the critic. The theoretical results of EPT on a graph in §3 play the fundamental role in developing our proposed UST.
Remark 4.8** (The special case of balanced mass).**
When input measures have the same mass, from Lemma A.6 of §A.1.5, the proposed unbalanced Sobolev transport (with ) coincides with the balanced Sobolev transport (Le et al.,, 2022, Definition 3.2).
** Relation with EPT on a tree (Le and Nguyen,, 2021).** As we discussed previously, extending the approach in (Le and Nguyen,, 2021) for EPT on a tree to our considered problem (i.e., EPT on a graph) is problematic. We see from Lemma A.2 (Appendix §A.1) and Lemma 4.4 that the Sobolev constraint set in our approach is possibly larger than the Lipschitz constraint set for a general graph , but these two constraint sets coincide when is a tree. Our results illustrate that it is more efficient to exploit graph structure for critic of Sobolev type (as in our approach) than critic of the Lipschitz type (as in EPT on a tree).
5 PROPERTIES OF UNBALANCED SOBOLEV TRANSPORT
In this section, we derive geometric structures together with bounds for UST and prove its negative definiteness. Consequently, we develop positive definite kernels upon UST, required in many kernel-dependent frameworks.
We first show that possess the metric property. Moreover, it makes the space of measures a geodesic space. Thus, inherits all geometric properties of the geodesic space.
Proposition 5.1** (Geometric structures of ).**
Let be a nonnegative Borel measure on . Assume that . For and , then we have
- i)
, . 2. ii)
* is a divergence333I.e., , and if and only if . and satisfies the triangle inequality:*
[TABLE] 3. iii)
If in addition , then is a metric and is a complete metric space. Moreover, it is a geodesic space in the sense that for every two points and in there exists a path with such that , , and
[TABLE]
In Proposition A.4 (Appendix §A.1), we also establish a comparison between for different exponent . We next derive a lower bound for in terms of . In fact, a more general estimate holds true for every and is given in Proposition A.5 (Appendix §A.1). As a consequence of Corollary 3.2 and Lemma 4.4 and since , we obtain:
Proposition 5.2** (Lower bound for ).**
Recall that is the length measure on . Assume that are -Lipschitz w.r.t. . For , , we have
[TABLE]
We emphasize that when is a tree, our EPT on a graph (i.e., and ) coincide with the ones defined in (Le and Nguyen,, 2021). Furthermore, we have:
Proposition 5.3** (Lower bounds).**
Assume that is a tree and . The followings hold true:
- i)
. Also for , we have
[TABLE]
where is defined in (Le and Nguyen,, 2021, Eq. (9)). 2. ii)
If , then for , we have
[TABLE]
where is the -order Wasserstein distance444The definition of is recalled in Appendix §B.1.* with cost . Moreover, the equality is attained when .*
We next prove the negative definiteness for UST. This important property allows us to build positive definite kernels upon UST, required for kernel-dependent machine learning algorithmic approaches.
Proposition 5.4**.**
Under the same assumptions as in Corollary 4.6 and . Then, is negative definite on for any .
From Proposition 5.4 and by using (Berg et al.,, 1984, Theorem 3.2.2), we obtain that the kernel
[TABLE]
is positive definite on for any given and .
6 EXPERIMENTS
In this section, we illustrate the fast computation (i.e., closed-form solution) of the proposed UST and comparable performances of the proposed positive definite kernel associated to UST against other popular unbalanced transport baselines and their corresponding kernels. More concretely, we evaluate for measures with unequal mass on a given graph under two simulations: document classification and topological data analysis (TDA).
Document classification. We consider four traditional document datasets: TWITTER, RECIPE, CLASSIC, and AMAZON. Their characteristics are summarized in Figure 1. We represent each document as a measure by considering each word in the document as its support with a unit mass. Therefore, documents with different lengths have different total mass. We employ the same word embedding procedure as in (Le and Nguyen,, 2021) to embed words into vectors in .
TDA. We carry out two tasks: orbit recognition on Orbit dataset and object shape recognition on MPEG7 dataset. For Orbit dataset, it is synthesized as in (Adams et al.,, 2017) for link twist map which are discrete dynamical systems to model flows in DNA microarrays (Hertzsch et al.,, 2007). There are five classes of orbits in the dataset. For each class, we generated orbits where each orbit contains points. For MPEG7 dataset (Latecki et al.,, 2000), we consider its 10-class subset where each class has samples as in (Le and Yamada,, 2018). The characteristics of the considered Orbit and MPEG7 datasets are summarized in Figure 2. We use the same procedure as in (Le and Nguyen,, 2021) to extract persistence diagram (PD) for orbits and object shapes. PD are multisets of points in . Each point in PD summarizes the lifespan (i.e., birth and death time) of a topological feature (e.g., connected component, ring, cavity). We represent each PD as a measure by regarding each -dimensional point in PD as its support with a unit mass. Consequently, persistence diagrams having a different number of topological features are represented as measures with different total mass.
Notice that supports in document classification simulations are in high-dimensional spaces (i.e., in ) while supports in TDA simulations are in low-dimensional spaces (i.e., in ). Therefore, we can observe the effects of dimensions to the proposed UST and other unbalanced transport baselines from these simulations. We next describe various graph settings (i.e., the assumed graph metric spaces for measures) considered in our experiments.
Graph settings. We use the same graph settings (i.e., and ) employed in (Le et al.,, 2022, §5) for our simulations on document classification and TDA. For these graphs, we consider the number of nodes: . We note that these graphs satisfy the assumptions in §2. Similar to the observations in (Le et al.,, 2022), each node in these graphs has a high probability to satisfy the root node condition, i.e., the uniqueness property of the shortest path (see Appendix §B.2 for a further discussion).
Root node for UST. The UST is defined over graph with a root node . From Definition 4.1, the root node imposes its own geometry by characterizing the graph derivative of functions on . To alleviate this dependency, we follow the sliced approach in (Le et al.,, 2022) for Sobolev transport by averaging over different choices of the root node in graph , which can be viewed as a sliced variant for UST.
Baselines, and experimental setup. We consider two typical UOT approaches for measures with unequal mass and supported on a graph metric space as baselines: (i) the Sinkhorn-based UOT (Frogner et al.,, 2015; Chizat et al., 2018a, ) ()555Séjourné et al., (2019) derived a debiased version for which may be helpful in applications. The debiased version is also empirically indefinite and has the same complexity as . with a graph metric ground cost, and (ii) the distance of EPT on a tree (Le et al.,, 2022, Eq. (9)) (see Proposition 5.3 for its relation with ) where the tree structures are randomly sampled from graph . From results in Lemma 4.4 and Proposition 5.2 and for simplicity, we consider and (and for EPT on a tree as in (Le and Nguyen,, 2021))666One may tune these parameters for further improvements.. We further note that there are different approaches for simulations on document classification and TDA. However, that is not the purpose of our empirical simulations which compare different unbalanced transports for measures with unequal mass on a graph in the same settings.
We apply the kernel approach in the form , where is a discrepancy for unbalanced measures on a graph and , with support vector machines (SVM) for the simulations on document classification and TDA. Note that kernels for and are positive definite, but kernels for is empirically indefinite (see (Peyré and Cuturi,, 2019, §8.3)). Similar to (Le and Nguyen,, 2021), we regularized the Gram matrices for kernels with by adding a sufficiently large diagonal term.
For simplicity, we employ the same setup for the EPT problem in (Le and Nguyen,, 2021), i.e., using for the EPT. From Corollary 4.6 and Proposition 5.4, we consider the weight functions where and .
For kernel SVM, we use the same setting as in (Le and Nguyen,, 2021). In each dataset, we randomly split it into for training and test with repeats. We use 1-vs-1 strategy for SVM with multiclass data. Hyperparameters are typically chosen by cross validation. For kernel hyperparameter, we choose from with where is the quantile of a random subset of corresponding distances on training data. For SVM regularization hyperparameter, we choose it from . For , we choose the entropic regularization from . The reported time consumption for each kernel matrices also includes the corresponding preprocessing, e.g., compute shortest paths on graph for and , or sampling random tree structures from for of EPT on a tree.
Results of SVM, time consumption and discussions. We illustrate the SVM results and time consumption for kernel matrices for document classification and TDA in Figure 1 and Figure 2 with for document datasets, for Orbit and for MPEG7 for graph . The performances of kernels for our proposed UST compare favorably with other approaches (except on RECIPE). Additionally, the time consumption of and is several-order faster than that of . Recall that kernels for is indefinite, which may affect performances of in some datasets (e.g., Orbit, TWITTER). In Figure 3, we illustrate the effects of the number of slices (i.e., the number of root nodes used for averaging) for and for TDA. Generally, performances of those approaches are improved with more slices but with a trade-off on time consumption. We observe that slices give a good trade-off in applications. Extensive further empirical results can be seen in Appendix §B.3, e.g., for various graph structures, graph sizes , and different orders of UST.
7 CONCLUSION
In this work, we proposed unbalanced Sobolev transport (UST) for measures with unequal mass on a graph. UST is the first variant of UOT having a closed-form formula for a fast computation. Additionally, UST is negative definite which allows to build positive definite kernels, required for kernel-dependent frameworks. Since UST exploits the graph metric structure of supports, it may restrict to applications with prior graph structures, or applications where one can build graphs from supports. On the other hand, we have not forseen any negative societal impacts of our work.
Acknowledgements
We thank anonymous reviewers and area chairs for their comments. KF has been supported in part by Grant-in-Aid for Transformative Research Areas (A) 22H05106. The research of TN is supported in part by a grant from the Simons Foundation (). TL gratefully acknowledges the support of JSPS KAKENHI Grant number 20K19873. Finally, this research was enabled in part by computational support provided by Makoto Yamada.
Appendix A PROOFS AND ADDITIONAL THEORETICAL RESULTS
In this section, we give detailed proofs for the theoretical results in the main manuscript. We also provide some additional results for the unbalanced Sobolev transport (UST).
A.1 Further Theoretical Results
We include here some additional results for the transport problems and the unbalanced Sobolev transport .
A.1.1 The Connection between Problem (3) and Problem (4)
We show the connection between problem (3) and problem (4) for EPT on a graph by following a similar reasoning as EPT on a tree (Le and Nguyen,, 2021). It is a direct extension of results in (Le and Nguyen,, 2021).
Theorem A.1**.**
Let for , and denote
[TABLE]
for the set of all subgradients of at . Also, set . Then, we have
- i)
* is a convex function on , and*
[TABLE]
where we write for a set of all optimal plans . Also if , then for every and .
- ii)
* is differentiable at if and only if every optimal plan in has the same mass. When this happens, we also have*
[TABLE]
for any .
- iii)
If there exists a constant such that
[TABLE]
for all , then . Moreover,
[TABLE]
when , and for .
The proof is placed in §A.2.1.
For any , part iii) of Theorem A.1 implies that there exists such that . It then follows from part i) of this theorem that for some . It is also clear that this is an optimal plan for , and
[TABLE]
Thus solving the auxiliary problem (4) gives us a solution to the original problem (3). When is differentiable, the relation between and is given explicitly as
[TABLE]
Note that the above selection of is unique only if the function is strictly convex. Nevertheless, it enjoys the following monotonicity regardless of the uniqueness: if , then . Indeed, we have and for some and . Since , one has by i) of Theorem A.1.
A.1.2 versus Lipschitz space
We describe the connection between the Sobolev space and the space of Lipschitz continuous functions. The definition of the length measure is reviewed in §B.1.1).
Lemma A.2**.**
Let be the length measure on graph , and let be a function. We have:
- i)
If , , then with . 2. ii)
Assume in addition that is a tree. Then, with implies that for every .
The proof is placed in §A.2.2.
Remark A.3**.**
Our proof for Lemma A.2 (in §A.2.2) also shows that the result in part ii) of Lemma A.2 in fact holds for every measure . Precisely, let be a nonnegative Borel measure on a tree . Then, we have with implies that for every .
A.1.3 Comparison between Sobolev Spaces with Diferent Exponents
We derive a comparison between UST with different exponent , and its proof is a direct consequence of our closed-form formula given in Proposition 4.5.
Proposition A.4** (Relation for different ).**
Assume that is a nonnegative Borel measure on . Then for any and , we have
[TABLE]
where is the constant defined by (11).
Proof of Proposition A.4.
The case is trivial, so let us consider . Then by using Proposition 4.5 and Hölder’s inequality, we obtain
[TABLE]
∎
A.1.4 Lower Bound for
We derive a lower bound for which is a generalization of the result for in Proposition 5.2.
Proposition A.5** (Lower bound for ).**
Let be the length measure on , and assume that and are -Lipschitz w.r.t. . Then by taking , we have for every that
[TABLE]
for every . Here is the constant defined by (11).
Proof.
This is a consequence of Corollary 3.2, Lemma 4.4, and Proposition A.4. ∎
A.1.5 The Special Case of Balanced Mass
Observe that for the case , the constraint in the definition of is redundant. Indeed, we have:
Lemma A.6**.**
Let be a nonnegative Borel measure on . Assume that satisfy . Then,
[TABLE]
In particular, is independent of the parameters , and the weights , .
Proof.
This follows from the fact that Definition 4.3 is unchanged in the case when the critic function is translated by a constant. ∎
From Lemma A.6, we see that for the case , our proposed unbalanced Sobolev transport with coincides with the balanced Sobolev transport (defined in (Le et al.,, 2022, Definition 3.2)).
A.1.6 Infinite Divisibility for Unbalanced Sobolev Transport Kernel
Recall that given and , the unbalanced Sobolev transport kernel is positive definite (see §5 and Proposition 5.4).
For , the kernel is positive definite. Additionally, . Therefore, is indefinitely divisible following (Berg et al.,, 1984, Definition 2.6 in §3).
Hence, one does not need to recompute the Gram matrix for unbalanced Sobolev transport kernel for different values of . Indeed, it is suffice to compute the Gram matrix of once for some fixed and leverage its indefinite divisibility for other values of .
A.2 Detailed Proofs
In this section, we give detailed proofs for our theoretical results.
A.2.1 Proof of Theorem A.1
Proof of Theorem A.1.
We employ a similar reasoning for EPT on a tree (Le and Nguyen,, 2021) to prove the relation between problem (3) and problem (4) for EPT on a graph as follow:
i) Note that is a concave function since it is the infimum of a family of concave functions in . Therefore, is convex on . In particular, is differentiable almost everywhere on .
Let , recall the definition of in Equation (3). Then for any , we have
[TABLE]
This implies that
[TABLE]
We next show that the opposite inclusion is also true, i.e., \big{\{}b\,\gamma({\mathbb{G}}\times{\mathbb{G}}):\gamma\in\Gamma^{0}(\lambda)\big{\}}=\partial H(\lambda). This is obviously holds if is singleton, which holds for example when is differentiable at . Hence we only need to consider for which the convex set has more than one element.
Let , then can be expressed as a convex combination of extreme points of , i.e., with and . As is an extreme point of , there exists a sequence such that is a differentiable point of and .
Let , then . By compactness, there exists a subsequence and such that weakly. It follows that , and hence we must have . We have
[TABLE]
and for any , there holds
[TABLE]
We thus deduce that . These together with the lower semicontinuity of give
[TABLE]
Therefore, with mass . Due to the convexity of , we have with . That is,
[TABLE]
and we thus infer that \big{\{}b\,\gamma({\mathbb{G}}\times{\mathbb{G}}):\gamma\in\Gamma^{0}(\lambda)\big{\}}=\partial H(\lambda) for all .
In order to prove the second part of i), let and be arbitrary. We have
[TABLE]
Hence by combining with (13), we deduce that
[TABLE]
which yields . This together with the above characterization of implies the second part of i).
ii) If is differentiable at , then is a singleton set. However, as \partial H(\lambda)=\big{\{}b\,\gamma({\mathbb{G}}\times{\mathbb{G}}):\gamma\in\Gamma^{0}(\lambda)\big{\}} by i), we thus infer that the mass must be the same for every .
Next assume that every element in has the same mass, say . For , let and . Then, we claim that
[TABLE]
Assume the claim for the moment, and let . Then, as in (13)–(A.2.1), we have
[TABLE]
It follows that
[TABLE]
This together with claim (15) gives . By the same argument, we also have . Thus, we infer that is differentiable at with . Therefore, it remains to prove claim (15).
Indeed, by compactness there exists a subsequence, still labeled by , and such that weakly as . As in i), we can show that . Then, as the mass functional is weakly continuous, we obtain . We in fact have shown that any subsequence of has a further subsequence converging to the same number . Therefore, the full sequence must converge to , and hence (15) is proved.
iii) For any , we have by i) that \partial H(\lambda)=\big{\{}b\,\gamma({\mathbb{G}}\times{\mathbb{G}}):\gamma\in\Gamma^{0}(\lambda)\big{\}}\subset[0,b\,\bar{m}]. Thus, we only need to prove . First, note that as is a compact and convex set, it must be a finite and closed interval. Therefore, if we let
[TABLE]
then it follows from ii) that \partial H(\lambda)=\big{[}b\,\gamma^{\lambda}_{min}({\mathbb{G}}\times{\mathbb{G}}),b\,\gamma^{\lambda}_{max}({\mathbb{G}}\times{\mathbb{G}})\big{]} for every . From Equation (3), it is clear that for negative enough. Indeed, if we take , then as , we have for all . Then, we obtain from Equation (3) that for every and the strict inequality holds if . Thus, which gives and .
We next show that for positive enough. Since is bounded due to its continuity on , we can choose such that for all . Let . We claim that either or . Indeed, since otherwise we have and for some Borel sets . Let . Then, for any Borel set we have
[TABLE]
Likewise, for any Borel set . Thus . On the other hand, it is clear from (3) and the facts , , and that . This is impossible and so the claim is proved. That is, either or . It follows that for every , and hence due to i). This also means that is differentiable at with .
Therefore, it remains to show that
[TABLE]
Assume by contradiction that there exists such that for every . For convenience, we adopt the following notation: for sets and , we write if for every , and if for every and . Let us consider the following two sets
[TABLE]
Then if is negative enough, and if is positive enough. For any and , we have , and hence by the monotonicity in i). That is, and so we obtain
[TABLE]
If , then for any we have and . Therefore, and . Hence, we can find such that and . Thus, due to the convexity of the set . This contradicts our hypothesis, and we conclude that .
We next select sequences and such that and . For each , let
[TABLE]
By compactness, there exist subsequences, still labeled as and , and such that weakly and weakly. By arguing exactly as in i), we then obtain , , and . As due to , we must have . Likewise, we have as for all . Hence, . Since , we infer that . This is a contradiction and the proof is complete. We note that since , we have from the monotonicity in i) that
[TABLE]
for every . By sending to infinity, it follows that for every . That is, and . ∎
A.2.2 Proof of Lemma A.2
Proof of Lemma A.2.
Let us define
[TABLE]
and
[TABLE]
i) The statement of this part is equivalent to showing that . Let . Then is continuous on , and
[TABLE]
On each edge and similar to the real line, the Lipschitz condition (17) implies that there exists a function with the following properties: for -a.e. , and
[TABLE]
where we recall that denotes the line segment in connecting and (noting that for general graph, might not be the same as the shortest path ). Let us glue them together by taking if is an interior point of an edge . Then is a function satisfying: for -a.e. . That is, with . Moreover, for every edge in we have
[TABLE]
Now let be arbitrary. Let us break the unique shortest path connecting and into sub line segments such that each of them is contained in exactly one edge. Then by applying (18) to each of these sub line segments, we obtain
[TABLE]
Thus, we have proved that
[TABLE]
Therefore, according to Definition 4.1 we conclude that with . It then follows that , and hence as desired.
ii) Assume that is a tree. We can and will assume that is the root of this tree. We need to show that . For this, let . Then by Definition 4.1, we have and
[TABLE]
Thus for any two points , we obtain
[TABLE]
Let be the deepest node on the tree that belongs to both path and path . Due to the tree structure, the joining of path and path constitutes the shortest path connecting the points and . These together with (19) imply that
[TABLE]
By the property of the length measure given in Lemma B.2, we then infer that for every . It follows that . Therefore, we have proved that as desired. ∎
A.2.3 Proof of Theorem 3.1
The proof of Theorem 3.1 is based on two auxiliary lemmas. Before stating these lemmas, let us describe the the setting and associated problem.
First, in order to investigate problem (4), we recast it as the standard complete OT problem by using an observation in (Caffarelli and McCann,, 2010). More precisely, let be a point outside graph and consider the set . We next extend the cost function to as follow
[TABLE]
The measures are extended accordingly by adding a Dirac mass at the isolated point : and . As have the same total mass on , we can consider the standard complete OT problem between as follow
[TABLE]
where
[TABLE]
This reformulation under an observation in (Caffarelli and McCann,, 2010) helps us to transform an unbalanced optimal transport (EPT) on a graph into a corresponding standard complete OT. Therefore, we can not only bypass all the issues coming from the unbalanced setting, but also rely on many results in the standard setting for OT.
We then adapt the procedure in (Caffarelli and McCann,, 2010) to derive the dual formulation for the EPT on a graph.
Additionally, we have a one-to-one correspondence between and as follow
[TABLE]
Indeed, if , then it is clear that defined by (21) satisfies . The converse is guaranteed by the next technical result.
Lemma A.7**.**
For , let be the restriction of to . Then, relation (21) holds and .
Proof.
We first observe for any Borel set that
[TABLE]
For the same reason, we have for any set Borel set . Also,
[TABLE]
Since (21) is obviously true for sets of the form with being Borel sets, we only need to verify it for sets of the following three forms: , , for Borel sets . We check it case by case as follows.
(i) For : Using the above observation, we have
[TABLE]
Therefore, (21) holds in this case.
(ii) For : (21) is also true for this case because
[TABLE]
(iii) For : (21) is true as well since
[TABLE]
Now as (21) holds, we obviously have for any Borel set . Likewise, for any Borel set . Therefore, . ∎
These observations in particular display the following connection between the EPT problem on a graph (4) and the corresponding standard complete OT problem (20).
Lemma A.8** (EPT on a graph versus its corresponding complete OT).**
For every , we have . Moreover, relation (21) gives a one-to-one correspondence between optimal solution for EPT problem (4) and optimal solution for standard complete OT problem (20).
Proof.
We derive two parts as follow:
(i) We show that :
For any , let be given by (21). Then, and
[TABLE]
It follows that .
(ii) We show that :
To see this, for any we let be the restriction of to . Then by Lemma A.7, we have and (21) holds. Consequently,
[TABLE]
By taking the infimum over , we infer that .
Thus, from the above two parts, we obtain
[TABLE]
The relation about the optimal solutions also follows from the above arguments. ∎
Given the above two lemmas, we are ready to present the proof of Theorem 3.1.
Proof of Theorem 3.1 .
From Lemma A.8 and the dual formulation for proved in (Caffarelli and McCann,, 2010, Corollary 2.6), we have
[TABLE]
Therefore, it is enough to prove that where
[TABLE]
For satisfying , and , we extend it to by taking and . Then, it is clear that for , and
[TABLE]
It follows that . In order to prove the converse, let be a maximizer for . Then, by considering , we can assume that . Also, if we let , then is still in the admissible class for and . This implies that is also a maximizer for . For these reasons, we can assume w.l.g. that the maximizer has the following additional properties: and
[TABLE]
In particular, . For convenience, define and consider the following two possibilities.
(i) For :
Since and , we have .
Also, for all . For each , by using the facts and we get
[TABLE]
Thus and
[TABLE]
(ii) For :
By arguing as in the above case (i), we have and
[TABLE]
Let . Then, it is obvious that and . Since , there exists such that . Thus, and hence . As , we infer further that . We also have
[TABLE]
This together with (22) gives
[TABLE]
Now let for . Then, for . For each , by using the facts and we also get
[TABLE]
It follows that and
[TABLE]
Thus we conclude that and the theorem follows. ∎
A.2.4 Proof of Corollary 3.2
Proof of Corollary 3.2.
Notice that as () is -Lipschitz w.r.t. , we have for every that
[TABLE]
Let be the set defined in the statement of Theorem 3.1. Then for each , let
[TABLE]
By using and (23), we obtain for every that
[TABLE]
We also have is -Lipschitz, i.e., . Indeed, let . Then for any , there exists such that
[TABLE]
It follows that
[TABLE]
Since this holds for every , we get
[TABLE]
By interchanging the role of and , we also obtain . Thus,
[TABLE]
Hence, we have shown that with
[TABLE]
We next claim . For this, it is clear from the definition that . On the other hand, from the Lipschitz property of we obtain
[TABLE]
which gives . Thus, we conclude that as claimed.
From these, we obtain that
[TABLE]
This together with Theorem 3.1 in the main text implies that
[TABLE]
To prove the converse, let . Define and . Then, we have
[TABLE]
[TABLE]
and
[TABLE]
Also, the Lipschitz property of gives
[TABLE]
Thus , and hence we obtain from Theorem 3.1 in the main text that
[TABLE]
As this holds for every , we get
[TABLE]
Thus, we have shown that
[TABLE]
Now consider . Then, if and only if . Moreover,
[TABLE]
Therefore, the conclusion of the corollary follows from (24). ∎
A.2.5 Proof of Lemma 4.4
Proof of Lemma 4.4.
By using part i) of Lemma A.2, we see that
[TABLE]
As a consequence, we obtain
[TABLE]
Thus the first statement of the lemma is proved. Now if is a tree. Then Lemma A.2 implies that the inclusion in (25) is actually the equality. That is, . Therefore, we get the desired identity
[TABLE]
∎
A.2.6 Proof of Proposition 4.5
Proof of Proposition 4.5.
It follows from Definition 4.3 and the representation (7) for that
[TABLE]
The first supremum equals to if and equals to if .
On the other hand, by the same arguments as in the proof of (Le et al.,, 2022, Proposition 3.5) we see that the second supremum equals to . Putting them together, we obtain the desired formula for . ∎
A.2.7 Proof of Corollary 4.6
Proof of Corollary 4.6.
We first recall that denotes the line segment in connecting two points , while means the same line segment but without its two end-points. Then as for every , we have
[TABLE]
Since and are supported on nodes, we can rewrite the above identity as
[TABLE]
For and , we observe that belongs to if and only if . It follows that , and thus we deduce from the above identity that
[TABLE]
This together with Proposition 4.5 yields the postulated result. ∎
A.2.8 Proof of Proposition 5.1
We begin with the following auxiliary result.
Lemma A.9**.**
Let . Then, if and only if for every in .
Proof.
It is obvious that implies that for every in . Now assume that for every in . We first claim that for any . Let be arbitray. Then there are two possibility for : either is a node or is an interior point of an edge. We consider these two cases saperately.
(i) is an interior point of an edge (i.e. is not a node):
Let be a sequence of distinct points on the same edge as such that for every and as . It follows that and as . As a consequence, we have
[TABLE]
But as for every in , we thus obtain
[TABLE]
as claimed.
(ii) is a node:
We can assume that is a common node for edges . Then for each , let be a sequence of distinct points on edge such that for every and as . These choices yield and as . Using this and the assumption for every in , we obtain
[TABLE]
Thus, we have proved the claim that for every .
On the other hand, for any points belonging to the same edge
[TABLE]
where denotes the line segment in connecting two points but without its right end-point (while include both end-points).
Thus, by combining them, we infer further that for any and for any edge . It follows that , and the proof is complete. ∎
Proof of Proposition 5.1.
We note first that the quantity depends only on the values of the weights at the root of the graph. This comes from the fact that only and are used in the definition of .
i) This follows immediately from Proposition 4.5 in the main text.
ii) It follows from Definition 4.3 that and satisfies the triangle inequality. As the constant function belongs to the constraint set , we also have . Next, assume that . Then by Proposition 4.5 in the main text, we get
[TABLE]
As by our assumption of , we must have
[TABLE]
Therefore, for every . By using Lemma A.9, we then conclude that .
iii) Due to the assumption we have if and only if . Hence we obtain from Definition 4.3 that . This together with ii) implies that is a metric space. Its completeness follows from (Piccoli and Rossi,, 2014, Proposition 4). As a complete metric space, it is well known that is a geodesic space if and only if for every there exists such that
[TABLE]
To verify the latter, take . Then using Definition 4.3 in the main text, we obtain
[TABLE]
and
[TABLE]
∎
A.2.9 Proof of Proposition 5.3
Proof of Proposition 5.3.
i) From its definition, we have with being the set defined in (Le and Nguyen,, 2021, Section 3.2). As a consequence, we obtain . On the other hand, Proposition A.4 yields for any that
[TABLE]
Therefore, we conclude that
[TABLE]
By moving and combining terms we arrive at
[TABLE]
ii) Let . From the definition of the -Wasserstein distance, we have
[TABLE]
where
[TABLE]
Therefore, the first statement will follow if we can show that
[TABLE]
Since , we have from Lemma A.6 that
[TABLE]
Hence by taking , we can rewrite this identity as
[TABLE]
where is the balanced Sobolev transport distance defined in (Le et al.,, 2022, Definition 3.2). On the other hand, we have by (Le et al.,, 2022, Lemma 4.3). Therefore, we obtain (26) as desired.
Alternatively, we can derive (26) as follows. By using as in the proof of part i) and the observation about the translation invariant in the proof of Lemma A.6, we see that
[TABLE]
Then due to Lemma A.2, we can further rewrite as
[TABLE]
On the other hand, part i) above gives
[TABLE]
Therefore, we obtain
[TABLE]
for every .
For , the equality happens since and
[TABLE]
Thus, the second statement follows.
∎
A.2.10 Proof of Proposition 5.4
Proof of Proposition 5.4.
We first prove that distance is negative definite for , where
[TABLE]
It is easy to see that the function is negative definite for . Using this and by applying (Berg et al.,, 1984, Corollary 2.10), the function is negative definite for .
Therefore, for , the function is negative definite since it is a sum of negative definite functions. Using this and by applying (Berg et al.,, 1984, Corollary 2.10), we have is negative definite for .
We are now ready to prove the Proposition 5.4. From Proposition 4.5, we have
[TABLE]
Let . Then, \mu\mapsto\Big{\{}w_{e}^{\frac{1}{p}}\mu(\gamma_{e})\Big{\}}_{e\in E} can be regarded as a feature map for measure onto . Therefore, the first term of is equivalent to times the distance between two feature maps of measures on respectively. Recall that . Thus, the first term of is negative definite for .
Additionally, the second term of is times the distance between and . Since and , we also have from (11) that . Therefore, the second term of is also negative definite.
Hence, is negative definite for any . ∎
Appendix B FURTHER RESULTS AND DISCUSSIONS
B.1 Brief Reviews
We give brief reviews for some definitions used in our work.
B.1.1 Length Measure on Graphs
We recall the definition and properties in (Le et al.,, 2022, §4.1) about the length measure on graphs.
Definition B.1** (Length measure).**
Let be the unique Borel measure on such that the restriction of on any edge is the length measure of that edge. That is, satisfies:
- i)
For any edge connecting two nodes and , we have whenever and for with . Here, is the line segment in connecting and . 2. ii)
For any Borel set , we have
[TABLE]
The next lemma asserts that is closely connected to the graph metric , and thus justifies the terminology of a length measure.
Lemma B.2** ( is the length measure on graph).**
Suppose that has no short cuts, namely, any edge is a shortest path connecting its two end-points. Then, is a length measure in the sense that
[TABLE]
for any shortest path connecting and . In particular, has no atom in the sense that for every in .
B.1.2 Wasserstein distances
We recall here the definition of the -Wasserstein distances with graph metric ground cost on .
Definition B.3**.**
Let . Suppose that and are two nonnegative Borel measures on satisfying . Then the -Wasserstein distance between and is defined by
[TABLE]
where
[TABLE]
with .
B.1.3 Kernels
We review some important definitions and theorems/corollaries about kernels that are used in our work.
- •
Positive Definite Kernels (Berg et al.,, 1984, pp. 66–67). A kernel function is called positive definite if for every positive integer and every points , we have
[TABLE]
- •
Negative Definite Kernels (Berg et al.,, 1984, pp. 66–67). A kernel function is called negative definite if for every integer and every points , we have
[TABLE]
- •
Theorem 3.2.2 in (Berg et al.,, 1984, pp. 74). Let be a negative definite kernel. Then for every , the kernel
[TABLE]
is positive definite.
- •
Definition 2.6 in (Berg et al.,, 1984, pp. 76). A positive definite kernel is called infinitely divisible if for each , there exists a positive definite kernel such that
[TABLE]
- •
Corollary 2.10 in (Berg et al.,, 1984, pp. 78). Let be a negative definite kernel. Then for , the kernel
[TABLE]
is negative definite.
B.2 Further Discussions
In this subsection, we discuss some extension for our work and describe more details for some parts in the main manuscript.
Path length for points in .
We can canonically measure a path length connecting any two points where are not necessary to be nodes in . Indeed, for two points belonging to the same edge which connects two nodes and in , then we have
[TABLE]
for some numbers . Therefore, the length of the path connecting and along the edge (i.e., the line segment ) is defined by . Hence, the length for an arbitrary path in can be similarly defined by breaking down into pieces over edges and summing over their corresponding lengths (Le et al.,, 2022).
Lipschitz nonnegative weight function on graph .
An example of -Lipschitz nonegative weight function on is
[TABLE]
for some constants and .
Extension to measures supported on .
The closed-form formula for in (4.6) can be extended for measures with finite supports on (i.e., measures which may have supports on edges) by using the same strategy to measure a path length connecting and y for any (see §2). More precisely, we break down edges containing supports into pieces and sum over their corresponding values instead of the sum over edges for in (4.6).
About the assumption of uniqueness property of the shortest paths on .
As discussed in the supplementary of (Le et al.,, 2022), since for any edge of graph ., it is almost surely that every node in the graph can be regarded as unique-path root node (with a high probability, lengths of paths connecting any two nodes in graph are different). Additionally, for some special graph, e.g., a grid of nodes, there is no unique-path root node for such graph. However, by perturbing each node of such graph (or lengths of edges in in case is a non-physical graph, i.e., ) with a small deviation , we can obtain a graph satisfying the unique-path root node assumption.
About the unbalanced Sobolev transport.
Similar to the work (Le et al.,, 2022), we assume that we know the graph metric space (i.e., the graph structure) where supports of measures are belongs to. Giving such graph, we define the unbalanced Sobolev transport for measures which may have different total mass and are supported on that graph metric space. We leave a question to learn an optimal graph metric structure from data (i.e., supports of measures) for unbalanced Sobolev transport for future work.
About graphs and (Le et al.,, 2022).
First, we use a clustering method, e.g., the farthest-point clustering, to partition supports of measures into at most clusters.777 is the input number of clusters for the clustering method. Therefore, the result has at most clusters depending on input data. Then, let denote the set of centroids of these clusters. For edges, in graph , we randomly choose edges; and edges for graph , we also denote the set of those sampled edges as .
For each edge , its corresponding weight is computed by the Euclidean distance between the two corresponding nodes of . Let be the number of connected components in the graph , we then randomly add more edges between these connected components to construct a connected graph from .Let be the set of these added edges and denote set , then is the considered graph.
Datasets and Computational Devices.
For document dataset (i.e., TWITTER, RECIPE, CLASSIC, AMAZON), orbit dataset (Orbit) and a -class subset of MPEG7 dataset, one can contact the authors of (Le et al.,, 2022) to access to these datasets. For computational devices, we run all of our experiments on commodity hardware.
B.3 Further Empirical Results
In this subsection, we provide further empirical results for our work.
B.3.1 Extended Empirical Results for the Main Text
Similar to Figure 3 in the main text for TDA, we illustrate the effect of the number of slices for document classification with graph in Figure 4.
We also consider a graph with a different setting:. Recall that for Figure 1, Figure 2, Figure 3 in the main text and Figure 4, results are for graph where for document datasets, for MPEG7 dataset and for Orbit dataset.888There is a typo in the main text (§6): It should be is for MPEG7 and is for Orbit. We illustrate corresponding results for graph in Figure 5, Figure 6, Figure 7, and Figure 8 respectively.
B.3.2 Further Empirical Results
We also provides further results for document classification and TDA as follow:
For document classification.
- •
For , we illustrate the SVM results and time consumption for kernels matrices and the effect of the number of slices for graph in Figure 9 and Figure 10 respectively. The corresponding results for graph are in Figure 11 and Figure 12.
- •
For , we illustrate the SVM results and time consumption for kernels matrices and the effect of the number of slices for graph in Figure 13 and Figure 14 respectively. The corresponding results for graph are in Figure 15 and Figure 16.
- •
For , we illustrate the SVM results and time consumption for kernels matrices and the effect of the number of slices for graph in Figure 17 and Figure 18 respectively. The corresponding results for graph are in Figure 19 and Figure 20.
- •
For , we illustrate the SVM results and time consumption for kernels matrices and the effect of the number of slices for graph in Figure 21 and Figure 22 respectively. The corresponding results for graph are in Figure 23 and Figure 24.
For TDA.
- •
For , we illustrate the SVM results and time consumption for kernels matrices and the effect of the number of slices for graph in Figure 25 and Figure 26 respectively. The corresponding results for graph are in Figure 27 and Figure 28.
- •
For , we illustrate the SVM results and time consumption for kernels matrices and the effect of the number of slices for graph in Figure 29 and Figure 30 respectively. The corresponding results for graph are in Figure 31 and Figure 32.
- •
For on Orbit dataset and on MPEG7 dataset (due to the same size of MPEG7 dataset), we illustrate the SVM results and time consumption for kernels matrices and the effect of the number of slices for graph in Figure 33 and Figure 34 respectively. The corresponding results for graph are in Figure 35 and Figure 36.
With different exponent for UST.
We also carry out experiments for different in unbalanced Sobolev transport using the same setting for in the main text (i.e., for document datasets, for MPEG7 dataset and for Orbit dataset) on graph and graph . Figure 37 and Figure 38 illustrate performances on document classification and TDA respectively with graph . For graph , the corresponding results are shown in Figure 39 and Figure 40.999We skip plots about time consumption since the time consumption of UST for and are almost identical. Please refer to other Figures where we illustrate the time consumption of UST for .
With Sinkhorn divergence-based approach for UOT (Séjourné et al.,, 2019) as an extra baseline.
Furthermore, we also consider Sinkhorn divergence-based approach for UOT () (Séjourné et al.,, 2019) as an extra baseline. As we noted in the main manuscript, is the debiased version of Sinkhorn-based approach for UOT () which may be helpful for applications. Both and are empirically indefinite and they have the same computational complexity.
We illustrate SVM results for document classification and TDA with the extra baseline for both graph and corresponding to Figure 1 (in the main text), Figure 2 (in the main text), Figure 5, and Figure 6 in Figure 41, Figure 42, Figure 43, Figure 44 respectively.
B.3.3 Further Discussions on Empirical Results
The unbalanced Sobolev transport (UST) versus of entropy partial transport (EPT) on a tree.
Overall, performances of the UST compare favorably with those of of EPT on a tree. Moreover, time consumption of UST is comparable to that of of EPT on trees. So, by exploiting the full graph structure, UST improves performances of of EPT on a tree and still keeps the advantage about the computational complexity.
The unbalanced Sobolev transport (UST) versus Sinkhorn-based unbalanced optimal transport (UOT).
The performances of UST is comparable to those of Sinkhorn-based UOT. Recall that kernels for UST are positive definite while kernels for Sinkhorn-based UOT are empirically indefinite. This indefiniteness may affect performances of Sinkhorn-UOT in some settings (e.g., datasets or graph structure). It is worth noting that the UST is several order faster than Sinkhorn-based UOT. Therefore, it is prohibited to apply Sinkhorn-based UOT for large-scale settings while our proposed approach (UST) is scalable to such settings.
The effects of the number of slices (i.e., the number of root nodes used for averaging).
In general, when one increases the number of slices for the UST (and of EPT on a tree), their corresponding performances are also increased but it comes with a trade-off about time consumption (i.e., linear to the number of slices). We observe that 10 slices seems a good trade-off between performances and time consumption, similar to observations in (Le and Nguyen,, 2021).
Unbalanced Sobolev transport with different .
In our experiments on document classification and TDA, we observe that for UST consistently gives better performances than for UST.101010Recall that UST with has a stronger connection to EPT on graphs thatn UST with as illustrated in Lemma A.2. Generally, one may turn parameter to improve performances of UST in applications.
The extra baseline: Sinkhorn divergence-based approach for UOT.
In our experiments, the performances of the extra baseline are relative with those of when comparing with performances of (EPT on a tree) and our proposed UST. The debias property of improves performances of in some datasets, especially for datasets in TDA tasks (Orbit and MPEG7). For document datasets, performances of and are comparative (the role of debias property is not clear).
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1Adams et al., (2017) Adams, H., Emerson, T., Kirby, M., Neville, R., Peterson, C., Shipman, P., Chepushtanova, S., Hanson, E., Motta, F., and Ziegelmeier, L. (2017). Persistence images: A stable vector representation of persistent homology. Journal of Machine Learning Research , 18(1):218–252.
- 2Altschuler et al., (2021) Altschuler, J. M., Chewi, S., Gerber, P., and Stromme, A. J. (2021). Averaging on the Bures-Wasserstein manifold: Dimension-free convergence of gradient descent. Advances in Neural Information Processing Systems .
- 3Balaji et al., (2020) Balaji, Y., Chellappa, R., and Feizi, S. (2020). Robust optimal transport with applications in generative modeling and domain adaptation. Advances in Neural Information Processing Systems , 33:12934–12944.
- 4Benamou, (2003) Benamou, J.-D. (2003). Numerical resolution of an “unbalanced” mass transport problem. ESAIM: Mathematical Modelling and Numerical Analysis-Modélisation Mathématique et Analyse Numérique , 37(5):851–868.
- 5Berg et al., (1984) Berg, C., Christensen, J. P. R., and Ressel, P., editors (1984). Harmonic analysis on semigroups . Springer-Verglag, New York.
- 6Bonneel and Coeurjolly, (2019) Bonneel, N. and Coeurjolly, D. (2019). Spot: sliced partial optimal transport. ACM Transactions on Graphics (TOG) , 38(4):1–13.
- 7Bonneel et al., (2015) Bonneel, N., Rabin, J., Peyré, G., and Pfister, H. (2015). Sliced and radon wasserstein barycenters of measures. Journal of Mathematical Imaging and Vision , 51(1):22–45.
- 8Bunne et al., (2019) Bunne, C., Alvarez-Melis, D., Krause, A., and Jegelka, S. (2019). Learning Generative Models across Incomparable Spaces. In International Conference on Machine Learning (ICML) , volume 97.
