Scalable Unbalanced Sobolev Transport for Measures on a Graph

Tam Le; Truyen Nguyen; Kenji Fukumizu

arXiv:2302.12498·cs.LG·February 27, 2023

Scalable Unbalanced Sobolev Transport for Measures on a Graph

Tam Le, Truyen Nguyen, Kenji Fukumizu

PDF

Open Access 1 Repo

TL;DR

This paper introduces a scalable unbalanced Sobolev transport method for measures on a graph, addressing computational complexity and mass imbalance issues in optimal transport, with theoretical and empirical validation.

Contribution

It extends Sobolev transport to unbalanced measures on graphs, providing a closed-form formula, geometric insights, and kernel design for efficient comparison of measures.

Findings

01

Fast computation of the proposed UST method

02

Comparable performance to existing transport baselines

03

Effective kernel design for unbalanced measures

Abstract

Optimal transport (OT) is a popular and powerful tool for comparing probability measures. However, OT suffers a few drawbacks: (i) input measures required to have the same mass, (ii) a high computational complexity, and (iii) indefiniteness which limits its applications on kernel-dependent algorithmic approaches. To tackle issues (ii)--(iii), Le et al. (2022) recently proposed Sobolev transport for measures on a graph having the same total mass by leveraging the graph structure over supports. In this work, we consider measures that may have different total mass and are supported on a graph metric space. To alleviate the disadvantages (i)--(iii) of OT, we propose a novel and scalable approach to extend Sobolev transport for this unbalanced setting where measures may have different total mass. We show that the proposed unbalanced Sobolev transport (UST) admits a closed-form formula for…

Equations348

\vspace{-1pt}\Lambda(x)\triangleq\big{\{}y\in{\mathbb{G}}:\,x\in[z_{0},y]\big{\}},

\vspace{-1pt}\Lambda(x)\triangleq\big{\{}y\in{\mathbb{G}}:\,x\in[z_{0},y]\big{\}},

\gamma_{e}\triangleq\big{\{}y\in{\mathbb{G}}:\,e\subset[z_{0},y]\big{\}},

\gamma_{e}\triangleq\big{\{}y\in{\mathbb{G}}:\,e\subset[z_{0},y]\big{\}},

∣ w (x) - w (y) ∣ \leq b d_{G} (x, y), \forall x, y \in G .

∣ w (x) - w (y) ∣ \leq b d_{G} (x, y), \forall x, y \in G .

∥ f ∥_{L^{p} (G, ω)} ≜ (\int_{G} ∣ f (y) ∣^{p} ω (d y))^{\frac{1}{p}} for 1 \leq p < \infty, and

∥ f ∥_{L^{p} (G, ω)} ≜ (\int_{G} ∣ f (y) ∣^{p} ω (d y))^{\frac{1}{p}} for 1 \leq p < \infty, and

\|f\|_{L^{\infty}({\mathbb{G}},\omega)}\triangleq\inf\left\{t\in\mathbb{R}:\,|f(x)|\leq t\mbox{ for $\omega$-a.e. }x\in{\mathbb{G}}\right\}.

\|f\|_{L^{\infty}({\mathbb{G}},\omega)}\triangleq\inf\left\{t\in\mathbb{R}:\,|f(x)|\leq t\mbox{ for $\omega$-a.e. }x\in{\mathbb{G}}\right\}.

Π_{\leq} (μ, ν) ≜ {γ \in M (G \times G) : γ_{1} \leq μ, γ_{2} \leq ν}

Π_{\leq} (μ, ν) ≜ {γ \in M (G \times G) : γ_{1} \leq μ, γ_{2} \leq ν}

F_{1} (γ_{1} ∣ μ) ≜ \int_{G} w_{1} (x) F_{1} (f_{1} (x)) μ (d x),

F_{1} (γ_{1} ∣ μ) ≜ \int_{G} w_{1} (x) F_{1} (f_{1} (x)) μ (d x),

F_{2} (γ_{2} ∣ ν) ≜ \int_{G} w_{2} (x) F_{2} (f_{2} (x)) ν (d x),

F_{2} (γ_{2} ∣ ν) ≜ \int_{G} w_{2} (x) F_{2} (f_{2} (x)) ν (d x),

\displaystyle\mathrm{W}_{c,m}(\mu,\nu)\triangleq\inf_{\gamma\in\Pi_{\leq}(\mu,\nu),\,\gamma({\mathbb{G}}\times{\mathbb{G}})=m}\Big{[}{\mathcal{F}}_{1}(\gamma_{1}|\mu)+{\mathcal{F}}_{2}(\gamma_{2}|\nu)

\displaystyle\mathrm{W}_{c,m}(\mu,\nu)\triangleq\inf_{\gamma\in\Pi_{\leq}(\mu,\nu),\,\gamma({\mathbb{G}}\times{\mathbb{G}})=m}\Big{[}{\mathcal{F}}_{1}(\gamma_{1}|\mu)+{\mathcal{F}}_{2}(\gamma_{2}|\nu)

\displaystyle+\,b\int_{{\mathbb{G}}\times{\mathbb{G}}}c(x,y)\gamma(\mathrm{d}x,\mathrm{d}y)\Big{]}.\hskip 1.99997pt

F_{1} (s) = F_{2} (s) = ∣ s - 1∣

F_{1} (s) = F_{2} (s) = ∣ s - 1∣

ET_{c, λ} (μ, ν) = γ \in Π_{\leq} (μ, ν) in f C_{λ} (γ),

ET_{c, λ} (μ, ν) = γ \in Π_{\leq} (μ, ν) in f C_{λ} (γ),

C_{λ} (γ) ≜ \int_{G} w_{1} μ (d x) + \int_{G} w_{2} ν (d x) - \int_{G} w_{1} γ_{1} (d x)

C_{λ} (γ) ≜ \int_{G} w_{1} μ (d x) + \int_{G} w_{2} ν (d x) - \int_{G} w_{1} γ_{1} (d x)

- \int_{G} w_{2} γ_{2} (d x) + b \int_{G \times G} [c (x, y) - λ] γ (d x, d y) .

\mathrm{ET}_{c,\lambda}(\mu,\nu)=\sup_{(u,v)\in{\mathbb{K}}}\Big{[}\int_{{\mathbb{G}}}u(x)\mu(\mathrm{d}x)+\int_{{\mathbb{G}}}v(x)\nu(\mathrm{d}x)\Big{]},

\mathrm{ET}_{c,\lambda}(\mu,\nu)=\sup_{(u,v)\in{\mathbb{K}}}\Big{[}\int_{{\mathbb{G}}}u(x)\mu(\mathrm{d}x)+\int_{{\mathbb{G}}}v(x)\nu(\mathrm{d}x)\Big{]},

\displaystyle\mathrm{ET}_{\lambda}(\mu,\nu)=\sup_{f\in\mathbb{U}}\int_{\mathbb{G}}f(\mu-\nu)-\frac{b\lambda}{2}\big{[}\mu({\mathbb{G}})+\nu({\mathbb{G}})\big{]},

\displaystyle\mathrm{ET}_{\lambda}(\mu,\nu)=\sup_{f\in\mathbb{U}}\int_{\mathbb{G}}f(\mu-\nu)-\frac{b\lambda}{2}\big{[}\mu({\mathbb{G}})+\nu({\mathbb{G}})\big{]},

f (x) - f (z_{0}) = \int_{[z_{0}, x]} h (y) ω (d y), \forall x \in G .

f (x) - f (z_{0}) = \int_{[z_{0}, x]} h (y) ω (d y), \forall x \in G .

W^{1, p_{2}} (G, ω) \subset W^{1, p_{1}} (G, ω),

W^{1, p_{2}} (G, ω) \subset W^{1, p_{1}} (G, ω),

f (x) = f (z_{0}) + \int_{[z_{0}, x]} f^{'} (y) ω (d y) .

f (x) = f (z_{0}) + \int_{[z_{0}, x]} f^{'} (y) ω (d y) .

- w_{2} (x) - \frac{bλ}{2} \leq f (x) \leq w_{1} (x) + \frac{bλ}{2}, \forall x \in G

- w_{2} (x) - \frac{bλ}{2} \leq f (x) \leq w_{1} (x) + \frac{bλ}{2}, \forall x \in G

f(z_{0})\in I_{\alpha}\triangleq\Big{[}-w_{2}(z_{0})-\frac{b\lambda}{2}+\alpha,w_{1}(z_{0})+\frac{b\lambda}{2}-\alpha\Big{]}

f(z_{0})\in I_{\alpha}\triangleq\Big{[}-w_{2}(z_{0})-\frac{b\lambda}{2}+\alpha,w_{1}(z_{0})+\frac{b\lambda}{2}-\alpha\Big{]}

∥ f^{'} ∥_{L^{p^{'}} (G, ω)} \leq b .

∥ f^{'} ∥_{L^{p^{'}} (G, ω)} \leq b .

f (x) = s + \int_{[z_{0}, x]} h (y) ω (d y)

f (x) = s + \int_{[z_{0}, x]} h (y) ω (d y)

∥ h ∥_{L^{p^{'}} (G, ω)} \leq b .

∥ h ∥_{L^{p^{'}} (G, ω)} \leq b .

\mathrm{US}_{p}^{\alpha}(\mu,\nu)\triangleq\sup_{f\in\mathbb{U}_{p^{\prime}}^{\alpha}}\Big{[}\int_{\mathbb{G}}f(x)\mu(\mathrm{d}x)-\int_{\mathbb{G}}f(x)\nu(\mathrm{d}x)\Big{]}.

\mathrm{US}_{p}^{\alpha}(\mu,\nu)\triangleq\sup_{f\in\mathbb{U}_{p^{\prime}}^{\alpha}}\Big{[}\int_{\mathbb{G}}f(x)\mu(\mathrm{d}x)-\int_{\mathbb{G}}f(x)\nu(\mathrm{d}x)\Big{]}.

\displaystyle\mathrm{US}_{1}^{0}(\mu,\nu)\geq\sup\Big{[}\int_{\mathbb{G}}f(\mu-\nu):\,f\in\mathbb{U}_{0}\Big{]}

\displaystyle\mathrm{US}_{1}^{0}(\mu,\nu)\geq\sup\Big{[}\int_{\mathbb{G}}f(\mu-\nu):\,f\in\mathbb{U}_{0}\Big{]}

\displaystyle\mathrm{US}_{p}^{\alpha}(\mu,\nu)=b\,\Big{[}\int_{{\mathbb{G}}}|\mu(\Lambda(x))-\nu(\Lambda(x))|^{p}\,\omega(\mathrm{d}x)\Big{]}^{\frac{1}{p}}

\displaystyle\mathrm{US}_{p}^{\alpha}(\mu,\nu)=b\,\Big{[}\int_{{\mathbb{G}}}|\mu(\Lambda(x))-\nu(\Lambda(x))|^{p}\,\omega(\mathrm{d}x)\Big{]}^{\frac{1}{p}}

+ Θ∣ μ (G) - ν (G) ∣,

\displaystyle\Theta\triangleq\left\{\begin{array}[]{lr}w_{1}(z_{0})+\frac{b\lambda}{2}-\alpha&\mbox{if}\quad\mu({\mathbb{G}})\geq\nu({\mathbb{G}}),\\ w_{2}(z_{0})+\frac{b\lambda}{2}-\alpha&\mbox{if}\quad\mu({\mathbb{G}})<\nu({\mathbb{G}}).\end{array}\right.

\displaystyle\Theta\triangleq\left\{\begin{array}[]{lr}w_{1}(z_{0})+\frac{b\lambda}{2}-\alpha&\mbox{if}\quad\mu({\mathbb{G}})\geq\nu({\mathbb{G}}),\\ w_{2}(z_{0})+\frac{b\lambda}{2}-\alpha&\mbox{if}\quad\mu({\mathbb{G}})<\nu({\mathbb{G}}).\end{array}\right.

\displaystyle\mathrm{US}_{p}^{\alpha}(\mu,\nu)=b\,\Big{(}\sum_{e\in E}w_{e}\left|\mu(\gamma_{e})-\nu(\gamma_{e})\right|^{p}\Big{)}^{\frac{1}{p}}

\displaystyle\mathrm{US}_{p}^{\alpha}(\mu,\nu)=b\,\Big{(}\sum_{e\in E}w_{e}\left|\mu(\gamma_{e})-\nu(\gamma_{e})\right|^{p}\Big{)}^{\frac{1}{p}}

+ Θ∣ μ (G) - ν (G) ∣.

US_{p}^{α} (μ, ν) \leq US_{p}^{α} (μ, σ) + US_{p}^{α} (σ, ν), \forall μ, ν, σ \in M (G) .

US_{p}^{α} (μ, ν) \leq US_{p}^{α} (μ, σ) + US_{p}^{α} (σ, ν), \forall μ, ν, σ \in M (G) .

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

lttam/unbalancedsobolevtransport
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAsphalt Pavement Performance Evaluation

Full text

Scalable Unbalanced Sobolev Transport for Measures on a Graph

Tam Le ∗,†,‡

Truyen Nguyen ∗,⋄

Kenji Fukumizu †

The Institute of Statistical Mathematics †

The University of Akron ⋄

RIKEN AIP ‡

Abstract

Optimal transport (OT) is a popular and powerful tool for comparing probability measures. However, OT suffers a few drawbacks: (i) input measures required to have the same mass, (ii) a high computational complexity, and (iii) indefiniteness which limits its applications on kernel-dependent algorithmic approaches. To tackle issues (ii)–(iii), Le et al., (2022) recently proposed Sobolev transport for measures on a graph having the same total mass by leveraging the graph structure over supports. In this work, we consider measures that may have different total mass and are supported on a graph metric space. To alleviate the disadvantages (i)–(iii) of OT, we propose a novel and scalable approach to extend Sobolev transport for this unbalanced setting where measures may have different total mass. We show that the proposed unbalanced Sobolev transport (UST) admits a closed-form formula for fast computation, and it is also negative definite. Additionally, we derive geometric structures for the UST and establish relations between our UST and other transport distances. We further exploit the negative definiteness to design positive definite kernels and evaluate them on various simulations to illustrate their fast computation and comparable performances against other transport baselines for unbalanced measures on a graph.

1 INTRODUCTION

Optimal transport (OT) has become a popular approach and its theory lays out a compelling toolkit for data analysis on probability distributions. OT has been leveraged in several research areas such as machine learning (Peyré and Cuturi,, 2019; Nadjahi et al.,, 2019; Titouan et al.,, 2019; Bunne et al.,, 2019, 2022; Janati et al.,, 2020; Muzellec et al.,, 2020; Paty et al.,, 2020; Mukherjee et al.,, 2021; Altschuler et al.,, 2021; Fatras et al.,, 2021; Le et al., 2021a, ; Le et al., 2021b, ; Liu et al.,, 2021; Nguyen et al., 2021b, ; Scetbon et al.,, 2021; Si et al.,, 2021; Takezawa et al.,, 2022; Fan et al.,, 2022), computer vision (Nguyen et al., 2021a, ; Saleh et al.,, 2022; Wang et al., 2022b, ), and statistics (Mena and Niles-Weed,, 2019; Weed and Berthet,, 2019; Liu et al.,, 2022; Nguyen et al.,, 2022; Nietert et al.,, 2022; Wang et al., 2022a, ) to name a few. Nevertheless, it has some fundamental disadvantages.††∗: Two authors contributed equally.

One drawback of OT is that it requires input measures having the same mass for the transportation. To address this problem, several proposals have been developed in the recent literature. For examples, the partial optimal transport (POT) (Caffarelli and McCann,, 2010; Figalli,, 2010) constraints a fixed amount of mass for transportation; the optimal entropy transport (OET) (Liero et al.,, 2018; Chizat et al., 2018b, ; Kondratyev et al.,, 2016) optimizes a sum of a transport functional and two convex entropy functionals. Additionally, there are various other approaches, e.g., the Kantorovich-Rubinstein discrepancy (Hanin,, 1992; Guittet,, 2002; Lellmann et al.,, 2014; Sato et al.,, 2020), the unbalanced mass transport (Benamou,, 2003), the generalized Wasserstein distance (Piccoli and Rossi,, 2014, 2016), the unnormalized optimal transport (Gangbo et al.,, 2019), and the entropy partial transport (Le and Nguyen,, 2021). These approaches are either special cases of the OET (e.g., by using some specific instances of entropy functional such as the total variation distance, $\ell^{2}$ distance), or a variant of OET (e.g., by using the $\ell^{p}$ distance, partial transport in place of the entropy functional, transport functional respectively). It is worth pointing out that the unbalanced setting for measures with unequal mass has been applied in several application domains and learning problems, e.g., color transfer and shape matching (Bonneel et al.,, 2015); multi-label learning (Frogner et al.,, 2015); positive-unlabeled learning (Chapel et al.,, 2020); natural language processing and topological data analysis (Le and Nguyen,, 2021). In particular, the unbalanced approach becomes essential when supports of input measures are subject to noise or have outliers since such supports are not desirably aligned in the matching problem (Frogner et al.,, 2015; Balaji et al.,, 2020; Mukherjee et al.,, 2021).

Another drawback of standard OT is that it has a high computational complexity. This disadvantage also exists in the unbalanced optimal transport (UOT), which hinders its applications, especially for large-scale settings. For examples, let us consider the OET with Kullback-Leibler divergence for the entropy functional which is widely used in applications. For this, one can leverage the entropic regularization to derive efficient Sinkhorn-based algorithmic approach (Frogner et al.,, 2015; Chizat et al., 2018a, ; Séjourné et al.,, 2019) which has a quadratic complexity (Pham et al.,, 2020). Another popular approach to scale up UOT is to exploit geometric structures of supports, e.g., one-dimensional structure (Bonneel and Coeurjolly,, 2019; Séjourné et al.,, 2022), tree structure (Le and Nguyen,, 2021; Sato et al.,, 2020). More concretely, Bonneel and Coeurjolly, (2019) proposed the sliced partial optimal transport (SPOT) by projecting supports into a random one-dimensional space. By assuming a unit mass on each support, they developed an efficient algorithmic approach with a quadratic complexity for the worst case. Nonetheless, SPOT suffers a curse of dimensionality since using one-dimensional projections for supports limits its ability to capture topological structures of distributions, especially in a high-dimensional space. Le and Nguyen, (2021) proposed the entropy partial transport (EPT) by exploiting a tree structure to remedy the curse of dimensionality for SPOT. Moreover, EPT yields the first closed-form solution among various variants of UOT (i.e., its complexity is linear to the number of edges in a tree) for fast computation which is applicable for large-scale settings. However, tree structure may be a restricted condition which narrows down its practical usage in applications.

The aforementioned circumstances motivate us to consider measures with unequal mass and supported on a graph metric space which has more degrees of freedom (i.e., graph structure rather than tree structure) and appears more popularly in applications. Inspired by the Sobolev transport (Le et al.,, 2022) for probability measures on a graph, we propose a novel and scalable approach to leverage graph structure and extend Sobolev transport for the unbalanced setting. At a high level, our contributions are three-fold as follow:

•

we propose a novel $p$ -order unbalanced Sobolev transport (UST) ( $p\geq 1$ ) for measures with unequal mass and supported on a graph metric space. We prove that UST admits a closed-form formula for a fast computation and it is negative definite;

•

we derive geometric structures for the UST and propose positive definite kernels built upon the UST. Additionally, we establish relations between UST and the EPT on a graph;

•

we empirically illustrate that UST is fast for computation (i.e., closed-form solution of UST). Also various simulations demonstrate that the performances of the proposed kernels for UST compare favorably with other unbalanced transport baselines for measures with unequal mass on a graph.

The paper is organized as follows: we introduce notations and the problem setup in §2. In §3, we extend and derive the EPT for unbalanced measures on a graph. We then present our main contribution: the UST for measures with unequal mass on a graph in §4 and derive its properties in §5. In §6, we evaluate the proposed kernel for UST against other unbalanced transport baselines for measures with unequal mass on a graph on various simulations. We conclude our work in §7. The detailed proofs for our theoretical results are placed in Appendix §A.2. Furthermore, we have released code for our proposals.111https://github.com/lttam/UnbalancedSobolevTransport

2 PRELIMINARIES

In this section, we introduce our problem setting, notations, and review relevant definitions.

We consider the same graph setting ${\mathbb{G}}=(V,E)$ where $V,E$ are sets of nodes and edges respectively as in (Le et al.,, 2022) for Sobolev transport. More precisely, ${\mathbb{G}}$ is an undirected, connected and physical graph in the sense that $V\subset\mathbb{R}^{n}$ and each edge $e\in E$ is the standard line segment in $\mathbb{R}^{n}$ connecting the two corresponding end-points of $e$ . Graph ${\mathbb{G}}$ has positive edge lengths $\{w_{e}\}_{e\in E}$ and is imposed a graph metric $d_{{\mathbb{G}}}(\cdot,\cdot)$ which equals to the length of the shortest path on ${\mathbb{G}}$ . Following a convention in (Le et al.,, 2022), by graph ${\mathbb{G}}$ , we mean the set of all nodes in $V$ and all points forming the edges in $E$ , i.e., the continuous setting for graph ${\mathbb{G}}$ . We also assume that there exists a fixed root node $z_{0}\in V$ such that for every $x\in{\mathbb{G}}$ , $d_{\mathbb{G}}(x,z_{0})$ is attained by the unique shortest path connecting $x$ and $z_{0}$ , i.e., the uniqueness property of the shortest paths (Le et al.,, 2022).

Given a point $x\in{\mathbb{G}}$ (resp. an edge $e\in E$ in ${\mathbb{G}}$ ), we denote $\Lambda(x)$ (resp. $\gamma_{e}$ ) as the collection of all points $y\in{\mathbb{G}}$ such that the unique shortest path in ${\mathbb{G}}$ connecting the root node $z_{0}$ and $y$ contains the point $x$ (resp. the edge $e$ ). That is,

[TABLE]

where we write $[z_{0},y]$ for the shortest path in ${\mathbb{G}}$ connecting the root node $z_{0}$ and $y$ .

We denote ${\mathcal{M}}({\mathbb{G}})$ (resp. ${\mathcal{M}}({\mathbb{G}}\times{\mathbb{G}})$ ) as the set of all nonnegative Borel measures on ${\mathbb{G}}$ (resp. ${\mathbb{G}}\times{\mathbb{G}}$ ) with a finite mass. By continuous function $f$ on ${\mathbb{G}}$ , we mean that $f:{\mathbb{G}}\to\mathbb{R}$ is continuous w.r.t. the topology on ${\mathbb{G}}$ induced by the Euclidean distance. Similar adoption is also applied for continuous functions on ${\mathbb{G}}\times{\mathbb{G}}$ . We denote $C({\mathbb{G}})$ as the collection of all continuous functions on ${\mathbb{G}}$ .

Given a scalar $b>0$ , a function $w:{\mathbb{G}}\to\mathbb{R}$ is called $b$ -Lipschitz w.r.t. the graph metric $d_{\mathbb{G}}$ if

[TABLE]

For $1\leq p\leq\infty$ , we denote $p^{\prime}$ as its conjugate, i.e., $p^{\prime}\in[1,\infty]$ s.t., $\frac{1}{p}+\frac{1}{p^{\prime}}=1$ . For a nonnegative Borel measure $\omega$ on ${\mathbb{G}}$ , let $L^{p}({\mathbb{G}},\omega)$ denote the space of all Borel measurable functions $f:{\mathbb{G}}\to\mathbb{R}$ satisfying $\int_{\mathbb{G}}|f(y)|^{p}\omega(\mathrm{d}y)<\infty$ . When $p=\infty$ , we assume that $f$ is bounded $\omega$ -a.e. instead. Functions $f_{1},f_{2}\in L^{p}({\mathbb{G}},\omega)$ are considered to be the same if $f_{1}(x)=f_{2}(x)$ for $\omega$ -a.e. $x\in{\mathbb{G}}$ . Then, $L^{p}({\mathbb{G}},\omega)$ is a normed space with the norm defined by

[TABLE]

Recall that Sobolev transport for probability measures on a graph is an instance of integral probability metrics (IPM) (Müller,, 1997). Intuitively, the definition of Sobolev transport is based on the dual form of the $1$ -order Wasserstein distance, but its Lipschitz constraint for the critic function is considered in the graph-based Sobolev space (see (Le et al.,, 2022, §3) for the detail). As a consequence, it may not possible to directly leverage approaches for standard OT (e.g., partial OT, entropy (partial) transport) to extend Sobolev transport for unbalanced measures on a graph.

In this paper, we propose a detour to develop unbalanced Sobolev transport for measures with unequal mass on a graph. We first take a step back to leverage the EPT (for unbalanced measures on a tree) (Le and Nguyen,, 2021) and extend it for unbalanced measures on a graph (§3). Although it is still a great challenge to efficiently compute the EPT for unbalanced measures on a graph, this novel extension (especially its dual form) plays a cornerstone in deriving a scalable approach for the proposed unbalanced Sobolev transport (UST) (§4).

3 ENTROPY PARTIAL TRANSPORT ON A GRAPH

The entropy partial transport (EPT) (Le and Nguyen,, 2021) is developed for unbalanced measures on a tree. In this section, we propose an extension of EPT for unbalanced measures on a graph. Intuitively, EPT optimizes a sum of a transport function and two convex entropy functions in a similar spirit to the OET (Liero et al.,, 2018; Chizat et al., 2018b, ). We first consider the primal formulation of EPT on a graph. We then derive its dual formulation which is the main result of this section. This novel dual formulation paves the way for our development of the UST (§4).

Given two measures $\mu,\nu\in{\mathcal{M}}({\mathbb{G}})$ which may have different total mass, consider the set

[TABLE]

where $\gamma_{1}$ and $\gamma_{2}$ respectively denote the first and second marginals of $\gamma$ ; by $\gamma_{1}\leq\mu$ , we mean that $\gamma_{1}(B)\leq\mu(B)$ for every Borel set $B\subset{\mathbb{G}}$ . Similar convention is used when we write $\gamma_{2}\leq\nu$ .

For $\gamma\in\Pi_{\leq}(\mu,\nu)$ , let $f_{1}$ and $f_{2}$ respectively be the Radon-Nikodym derivatives of $\gamma_{1}$ w.r.t. $\mu$ and of $\gamma_{2}$ w.r.t. $\nu$ , i.e., $\gamma_{1}=f_{1}\mu$ and $\gamma_{2}=f_{2}\nu$ . Then, we have $0\leq f_{1}\leq 1$ $\mu$ -a.e., and $0\leq f_{2}\leq 1$ $\nu$ -a.e. The weighted relative entropies of $\gamma_{1}$ w.r.t. $\mu$ and of $\gamma_{2}$ w.r.t. $\nu$ are defined by

[TABLE]

where $F_{1},\,F_{2}:[0,1]\to(0,\infty)$ are convex and lower semicontinuous entropy functions; and $w_{1},w_{2}:{\mathbb{G}}\to[0,\infty)$ are given nonnegative weight functions.

Given a continuous cost function $c:{\mathbb{G}}\times{\mathbb{G}}\to\mathbb{R}$ with $c(x,x)=0$ , a constant $b\geq 0$ and a fixed scalar $m\in[0,\bar{m}]$ where $\bar{m}\triangleq\min\{\mu({\mathbb{G}}),\nu({\mathbb{G}})\}$ , we consider the primal formulation of EPT problem on a graph:

[TABLE]

Following (Le and Nguyen,, 2021), we consider

[TABLE]

for the entropy functions in (3) and form a Lagrange multiplier $\lambda\in\mathbb{R}$ conjugate to the constraint $\gamma({\mathbb{G}}\times{\mathbb{G}})=m$ . As a result, we instead study the problem

[TABLE]

where $\mathcal{C}_{\lambda}(\gamma)$ is defined as

[TABLE]

The connection between problem (3) with mass constraint $m$ and problem (4) with Lagrange multiplier $\lambda$ is given in Theorem A.1 (Appendix §A.1). Also, from Theorem A.1, we see that solving the auxiliary problem (4) gives us a solution to the original problem (3). We now derive a novel dual formulation for problem (4) which paves the way for our proposed UST (§4).

Theorem 3.1 (Dual formula for general cost).

For $\lambda\geq 0$ , nonnegative weights $w_{1},w_{2}$ , and two input measures $\mu,\nu\in{\mathcal{M}}({\mathbb{G}})$ , we have

[TABLE]

where ${\mathbb{K}}\triangleq\Big{\{}(u,v):\,u\leq w_{1},\,-b\lambda+\inf_{x\in{\mathbb{G}}}[b\,c(x,y)-w_{1}(x)]\leq v(y)\leq w_{2}(y),\,u(x)+v(y)\leq b[c(x,y)-\lambda]\Big{\}}$ .

The main idea of proving this result is to attach to the graph ${\mathbb{G}}$ a new point $\hat{s}$ , and then suitably and carefully extend the cost $c$ and the input distributions $\mu,\nu$ to the set $\hat{\mathbb{G}}\triangleq{\mathbb{G}}\cup\{\hat{s}\}$ inspired by an observation in (Caffarelli and McCann,, 2010). The key point of this extension is to ensure that the extended input distributions on $\hat{\mathbb{G}}$ have the same total mass and the value of the new balanced OT between extended input distributions on $\hat{\mathbb{G}}$ is equal to that of the original EPT on graph ${\mathbb{G}}$ (i.e., the unbalanced setting). We then exploit the dual theory for the new balanced OT problem on $\hat{\mathbb{G}}$ to establish the dual formulation for our EPT problem on graph ${\mathbb{G}}$ (see Appendix §A.2 for detailed proof). When the ground cost $c$ is the graph metric $d_{\mathbb{G}}$ , the dual formula in Theorem 3.1 can be rewritten in a simpler and more symmetric form as follows.

Corollary 3.2 (Dual formula for graph metric).

Assume that $\lambda\geq 0$ and the nonnegative weight functions $w_{1},w_{2}$ are $b$ -Lipschitz w.r.t. $d_{\mathbb{G}}$ . For simplicity, let $\mathrm{ET}_{\lambda}\triangleq\mathrm{ET}_{d_{\mathbb{G}},\lambda}$ . Then, we have

[TABLE]

where $\mathbb{U}\triangleq\big{\{}f\in C({\mathbb{G}}):-w_{2}-\frac{b\lambda}{2}\leq f\leq w_{1}+\frac{b\lambda}{2},\,|f(x)-f(y)|\leq b\,d_{\mathbb{G}}(x,y)\big{\}}$ .

Remark 3.3.

We remark that one cannot directly use the dual formulation in (Le and Nguyen,, 2021), or that of (Piccoli and Rossi,, 2014, 2016) for unbalanced measures on a graph since the considered problem does not satisfy the conditions imposed in these approaches for duality.

In principal, for input unbalanced measures on a graph, it is simpler to learn the optimal $f^{*}$ in dual form (6) than to learn the optimal $\gamma^{*}$ in primal form (4). This is due to the fact that the critic $f^{*}$ is a function on the lower dimensional space compared to $\gamma^{*}$ . Moreover, the Lipschitz constraint for $f^{*}$ is easier to handle than the constraint $\Pi_{\leq}(\mu,\nu)$ for $\gamma^{*}$ . Nevertheless, it is still a challenge to effectively compute $\mathrm{ET}_{\lambda}$ using (6).

As illustrated in (Le et al.,, 2019; Le and Nguyen,, 2021) for transport problems on a tree, the Lipschitz constraint for the critic $f$ can be effectively optimized by leveraging the tree structure supports. Furthermore, the Lipschitz constraint is linked with the $1$ -order Wasserstein distance via the Kantorovich duality formulation. Due to the different nature of duality for $p$ -order Wasserstein distance when $p>1$ , it is however unknown that one can extend the fast computational results in (Le et al.,, 2019; Le and Nguyen,, 2021) to $p$ -order Wasserstein distance with $p>1$ , even for measures on a tree.

To alleviate this, we propose in the next section an efficient $p$ -order unbalanced Sobolev transport for measures with unequal mass on a graph for any $p\geq 1$ .

4 UNBALANED SOBOLEV TRANSPORT

As pointed out in §3, it is a great challenge to efficiently compute $\mathrm{ET}_{\lambda}$ (i.e., the EPT problem) for unbalanced measures on a graph using either the primal form (4) or the dual form (6). To overcome this issue, we propose in this section an efficient variant called unbalanced Sobolev transport (UST) distance. We further derive a novel closed-form formula which allows a fast computation for the proposed transport distance, especially for large-scale settings.

Our strategy in defining the UST is based on the dual formulation (6) (in Corollary 3.2) but by simultaneously relaxing the two constraints for critic function $f$ in the set $\mathbb{U}$ . This approach is partially adopted in (Le and Nguyen,, 2021) for the EPT problem for measures on a tree, but they only relax the first corresponding constraint for $f$ in the set $\mathbb{U}$ (i.e., the bounded constraint for the critic function $f$ ). However, keeping the Lipschitz constraint for $f$ limits the approach in (Le and Nguyen,, 2021) to be extended to more general structures rather than tree structure (e.g., graph structure). We note that the Lipschitz constraint is about bounding the derivative of $f$ and hence it is more fundamental and relevant than the first constraint. In this paper, we propose to also relax the Lipschitz constraint by leveraging a notion of Sobolev functions. This approach relies on the following concept of derivatives for functions on graphs introduced by Le et al., (2022), which can be viewed as a generalized version of the fundamental theorem of calculus for a graph.

Definition 4.1 (Graph-based Sobolev space (Le et al.,, 2022)).

Let $\omega$ be a nonnegative Borel measure on ${\mathbb{G}}$ , and let $1\leq p\leq\infty$ . A continuous function $f:{\mathbb{G}}\to\mathbb{R}$ is in the Sobolev space $W^{1,p}({\mathbb{G}},\omega)$ if there exists a function $h\in L^{p}({\mathbb{G}},\omega)$ satisfying

[TABLE]

Such function $h$ is unique in $L^{p}({\mathbb{G}},\omega)$ and is called the graph derivative of $f$ w.r.t. the measure $\omega$ . Hereafter, this graph derivative of $f$ is denoted by $f^{\prime}$ .

From Definition 4.1 and the property of $L^{p}({\mathbb{G}},\omega)$ space, we have

[TABLE]

whenever $1\leq p_{1}\leq p_{2}\leq\infty$ . In particular, $W^{1,\infty}({\mathbb{G}},\omega)$ is the smallest space and $W^{1,1}({\mathbb{G}},\omega)$ is the largest space. Additionally, we prove that $W^{1,\infty}({\mathbb{G}},\omega^{*})$ contains the space of all Lipschitz continuous functions, and both spaces coincide when ${\mathbb{G}}$ is a tree (see Lemma A.2 in Appendix §A.1 for the detail). Hereafter, let $\omega^{*}$ denote the length measure on ${\mathbb{G}}$ as defined in (Le et al.,, 2022, §4.1) (see Appendix §B.1 for a review). We propose to regularize the transport $\mathrm{ET}_{\lambda}$ in (6) by relaxing the constraint set $\mathbb{U}$ for critic function $f$ in two ways:

$\bullet$ Firstly, we replace the Lipschitz condition for the critic function $f$ in the set $\mathbb{U}$ (in Corollary 3.2) by instead considering this constraint in the graph-based Sobolev space, i.e., $f\in W^{1,p^{\prime}}({\mathbb{G}},\omega)$ with $\|f^{\prime}\|_{L^{p^{\prime}}({\mathbb{G}},\omega)}\leq b$ . This has the following advantages: (i) we can enlarge the constraint set on the Sobolev space $W^{1,p^{\prime}}({\mathbb{G}},\omega)$ by decreasing the value of parameter $p^{\prime}$ ; (ii) we can vary the constraint set by choosing a suitable measure $\omega$ on ${\mathbb{G}}$ . The measure $\omega$ can be interpreted as a cost of moving a unit mass from one location to another, and this cost is the same as the graph metric $d_{\mathbb{G}}$ when $\omega$ is chosen as the length measure $\omega^{*}$ of ${\mathbb{G}}$ . Even when $p=1$ and $\omega=\omega^{*}$ , this relaxation viewpoint still has the fundamental benefit: it allows us to extend most of the main results in (Le and Nguyen,, 2021) for tree structure to graph structure.

We emphasize that extending the approach in (Le and Nguyen,, 2021) (i.e., EPT problem for measures on a tree) to EPT problem for measures on a graph ${\mathbb{G}}$ is problematic. In this special case, we know from Lemma A.2 (Appendix §A.1) that our corresponding Sobolev constraint is equivalent the Lipschitz constraint when ${\mathbb{G}}$ is a tree. However, Lemma A.2 also implies that the Sobolev constraint set is possibly larger for a general graph ${\mathbb{G}}$ . This flexibility of Sobolev functions enables us to overcome the limitation of the approach in (Le and Nguyen,, 2021) (i.e., for a tree structure) and gives us an effective way to exploit the graph structure by working with critic function $f$ of a specific form in Sobolev space (see Definition 4.1). Our obtained results in this section reveal that critic of Sobolev type in the sense of Definition 4.1 is more suitable for EPT problem for measures on a graph than critic of the Lipschitz type.

$\bullet$ Secondly, we relax the first condition for $f$ in the set $\mathbb{U}$ (i.e., the bounded constraint for the critic function $f$ ) by using the following observation. According to Definition 4.1, any function $f\in W^{1,p^{\prime}}({\mathbb{G}},\omega)$ can be represented as

[TABLE]

If in addition $\|f^{\prime}\|_{L^{p^{\prime}}({\mathbb{G}},\omega)}\leq b$ , then by Hölder inequality, the second term on the right hand side is controlled by $b\,\omega\big{(}[z_{0},x]\big{)}^{\frac{1}{p}}$ . Thus, instead of requiring

[TABLE]

as in the definition of $\mathbb{U}$ , we suggest to constrain only the first term $f(z_{0})$ .

Putting these two ways of regularization together, we propose to consider the following constraint set $\mathbb{U}_{p^{\prime}}^{\alpha}$ as a relaxation of the constraint set $\mathbb{U}$ for the critic function $f$ in Corollary 3.2. Note that the choice of $\alpha\!=\!0$ corresponds to our above discussion. Here, we generalize our theoretical development for a more general $\alpha$ to allow an extra degree of freedom which might be potentially useful in practical applications, e.g., by tuning $\alpha$ for further improvement.

Definition 4.2 (The regularized set $\mathbb{U}_{p^{\prime}}^{\alpha}$ for critic function).

For $1\leq p\leq\infty$ and $0\leq\alpha\leq\frac{1}{2}[b\lambda+w_{1}(z_{0})+w_{2}(z_{0})]$ , let $\mathbb{U}_{p^{\prime}}^{\alpha}$ be the collection of all functions $f\in W^{1,p^{\prime}}({\mathbb{G}},\omega)$ satisfying

[TABLE]

and

[TABLE]

Equivalently, $\mathbb{U}_{p^{\prime}}^{\alpha}$ is the collection of all functions $f$ of the form

[TABLE]

with $s\in I_{\alpha}$ and with $h:{\mathbb{G}}\to\mathbb{R}$ being some function satisfying

[TABLE]

It is clear from Definition 4.2 that $\mathbb{U}\subset\mathbb{U}_{p^{\prime}}^{0}$ (see Corollary 3.2 for set $\mathbb{U}$ ). The requirement $\alpha\leq\frac{1}{2}[b\lambda+w_{1}(z_{0})+w_{2}(z_{0})]$ is to ensure that the interval $I_{\alpha}$ is nonempty. By constraining critic $f$ to the relaxed set $\mathbb{U}_{p^{\prime}}^{\alpha}$ and noting that the last term in (6) is simply a constant depending on the total masses of $\mu$ and $\nu$ , we propose the following regularization of the transport $\mathrm{ET}_{\lambda}$ in Corollary 3.2, namely unbalanced Sobolev transport (UST).

Definition 4.3 (Unbalanced Sobolev transport).

Let $\omega$ be a nonnegative Borel measure on graph ${\mathbb{G}}$ . Given $1\leq p\leq\infty$ and $0\leq\alpha\leq\frac{1}{2}[b\lambda+w_{1}(z_{0})+w_{2}(z_{0})]$ . For $\mu,\nu\in{\mathcal{M}}({\mathbb{G}})$ , the unbalanced Sobolev transport is defined as follow

[TABLE]

The measure $\omega$ used for representing critic $f$ in $U_{p^{\prime}}^{\alpha}$ (see (7)) acts as the ground cost of moving masses on graph ${\mathbb{G}}$ from one location to another. Especially, when $\omega$ is chosen as the length measure $\omega^{*}$ of graph ${\mathbb{G}}$ , we have $\omega([x,y])=d_{\mathbb{G}}(x,y)$ (see Lemma B.2 in Appendix §B.1).

We then show the connection between $1$ -order UST and the dual formulation of EPT on graph ${\mathbb{G}}$ with the Lipschitz constraint, but the bounded constraint only applied on the critic function at root node $z_{0}$ . Precisely, we obtain:

Lemma 4.4.

Recall that $\omega^{*}$ be the length measure of graph ${\mathbb{G}}$ . For $\omega=\omega^{*}$ , we have

[TABLE]

where $\mathbb{U}_{0}\triangleq\Big{\{}f\in C({\mathbb{G}}):\,-w_{2}(z_{0})-\frac{b\lambda}{2}\leq f(z_{0})\leq w_{1}(z_{0})+\frac{b\lambda}{2},\,|f(x)-f(y)|\leq b\,d_{\mathbb{G}}(x,y)\Big{\}}$ . Moreover, the inequality in (8) becomes the equality if ${\mathbb{G}}$ is a tree.

We next state our fundamental result, which demonstrates that the proposed UST (Definition 4.3) for measures with unequal mass on a graph is computationally effective. We in fact obtain a closed-form formula for UST in terms of an integral explicitly depending on the input measures. This yields a substantial computational advantage in comparison with the EPT approach for unbalanced measures on a graph (i.e., $\mathrm{ET}_{\lambda}$ ) which requires to solve sophisticated optimization problems either in the primal (4) or its dual (6). To our knowledge, the proposed UST is the first approach which yields a closed-form solution among available variants of unbalanced OT for measures with unequal mass on a graph.

Proposition 4.5.

Let $\omega$ be a nonnegative measure on graph ${\mathbb{G}}$ . Let $1\leq p\leq\infty$ and $0\leq\alpha\leq\frac{1}{2}[b\lambda+w_{1}(z_{0})+w_{2}(z_{0})]$ . Then, for two input measures $\mu,\nu\in{\mathcal{M}}({\mathbb{G}})$ , we have

[TABLE]

where $\Lambda(x)$ is defined by (1) and

[TABLE]

The constant $\Theta$ depends on $\mu$ and $\nu$ unless $\mu({\mathbb{G}})=\nu({\mathbb{G}})$ or $w_{1}(z_{0})=w_{2}(z_{0})$ . The integral in the above expression can be computed explicitly and efficiently as in the following corollary when the two input distributions are supported on nodes of the graph (i.e., the node set $V$ of graph ${\mathbb{G}}$ ).

Corollary 4.6.

Under the same assumptions as in Proposition 4.5 and assume in addition that $\omega(\{x\})=0$ for every $x\in{\mathbb{G}}$ . Suppose that $\mu,\nu\in{\mathcal{M}}({\mathbb{G}})$ are supported on nodes in $V$ of graph ${\mathbb{G}}$ .222We discuss an extension for measures supported in ${\mathbb{G}}$ in Appendix §B.2. Then, we have

[TABLE]

Remark 4.7 (UST for non-physical graph).

We have assumed that ${\mathbb{G}}$ is a physical graph as in §2. However, Corollary 4.6 shows that the $p$ -order unbalanced Sobolev transport $\mathrm{US}_{p}^{\alpha}$ does not depend on this physical assumption when input measures are supported on nodes. Precisely, it only depends on the graph structure $(V,E)$ and edge weights $w_{e}$ . Thus, $\mathrm{US}_{p}^{\alpha}$ can be applied for non-physical graph ${\mathbb{G}}$ .

We next describe a preprocessing step on graph ${\mathbb{G}}$ and analyze the time complexity in computing $\mathrm{US}_{p}^{\alpha}$ .

Preprocessing step. To compute $\mathrm{US}_{p}^{\alpha}$ , we apply a preprocessing step to form the set $\gamma_{e}$ for each edge $e\in E$ in graph ${\mathbb{G}}$ by identifying shortest paths from the root node $z_{0}$ to other nodes (e.g., by Dijkstra algorithm with a complexity $\mathcal{O}(|E|+|V|\log{|V|})$ where $|E|,|V|$ are the numbers of egdes and nodes of graph ${\mathbb{G}}$ respectively). Especially, observe that any edge $e$ with $\gamma_{e}=\varnothing$ does not contribute to the computation of $\mathrm{US}_{p}^{\alpha}$ . Therefore, one can remove such edge $e$ in the summation in (4.6). We emphasize that this preprocessing step only involves the graph structure itself and is independent of input measures.

Computational complexity. Let $E_{\mu,\nu}\triangleq\left\{e\in E\mid e\subset[z_{0},z]\mbox{ for some$ z\in $}\text{supp}(\mu)\cup\text{supp}(\nu)\right\}$ , where $\text{supp}(\mu),\text{supp}(\nu)$ are respectively the support of measures $\mu,\nu$ . Then, the computational complexity of $\mathrm{US}_{p}^{\alpha}(\mu,\nu)$ is linear to the number of edges in $E_{\mu,\nu}$ .

Related work. Beyond the pure graph of supports, the metric structure inherited from the graph metric space plays an important role in our work. More precisely, an edge weight $w_{e}$ is considered as a cost to move a unit mass from one node to the other node of edge $e$ (i.e., graph metric distance between two edge nodes). Therefore, one should distinguish our approach with the unbalanced diffusion earth mover’s distance (Tong et al.,, 2022) which uses an affinity between two edge nodes in their graph.

$\bullet$ ** Relation with Sobolev transport (ST) (Le et al.,, 2022).** We emphasize that ST is only valid for measures with equal mass on a graph. It cannot be applied for our considered problem where input measures may have different total mass. Even though both ST and the proposed UST are instances of integral probability metrics (IPM), it is nontrivial to effectively extend ST for unbalanced measures on a graph by defining a function set for the critic. The theoretical results of EPT on a graph in §3 play the fundamental role in developing our proposed UST.

Remark 4.8 (The special case of balanced mass).

When input measures have the same mass, from Lemma A.6 of §A.1.5, the proposed unbalanced Sobolev transport (with $b=1$ ) coincides with the balanced Sobolev transport (Le et al.,, 2022, Definition 3.2).

$\bullet$ ** Relation with EPT on a tree (Le and Nguyen,, 2021).** As we discussed previously, extending the approach in (Le and Nguyen,, 2021) for EPT on a tree to our considered problem (i.e., EPT on a graph) is problematic. We see from Lemma A.2 (Appendix §A.1) and Lemma 4.4 that the Sobolev constraint set in our approach is possibly larger than the Lipschitz constraint set for a general graph ${\mathbb{G}}$ , but these two constraint sets coincide when ${\mathbb{G}}$ is a tree. Our results illustrate that it is more efficient to exploit graph structure for critic of Sobolev type (as in our approach) than critic of the Lipschitz type (as in EPT on a tree).

5 PROPERTIES OF UNBALANCED SOBOLEV TRANSPORT

In this section, we derive geometric structures together with bounds for UST and prove its negative definiteness. Consequently, we develop positive definite kernels upon UST, required in many kernel-dependent frameworks.

We first show that $\mathrm{US}_{p}^{\alpha}$ possess the metric property. Moreover, it makes the space of measures ${\mathcal{M}}({\mathbb{G}})$ a geodesic space. Thus, $({\mathcal{M}}({\mathbb{G}}),\mathrm{US}_{p}^{\alpha})$ inherits all geometric properties of the geodesic space.

Proposition 5.1 (Geometric structures of $\mathrm{US}_{p}^{\alpha}$ ).

Let $\omega$ be a nonnegative Borel measure on ${\mathbb{G}}$ . Assume that $\lambda,w_{1}(z_{0}),w_{2}(z_{0})\geq 0$ . For $1\leq p\leq\infty$ and $0\leq\alpha<\frac{b\lambda}{2}+\min\{w_{1}(z_{0}),w_{2}(z_{0})\}$ , then we have

i)

$\mathrm{US}_{p}^{\alpha}(\mu+\sigma,\nu+\sigma)=\mathrm{US}_{p}^{\alpha}(\mu,\nu)$ , $\forall\mu,\nu,\sigma\in{\mathcal{M}}({\mathbb{G}})$ . 2. ii)

$\mathrm{US}_{p}^{\alpha}$ * is a divergence333I.e., $\mathrm{US}_{p}^{\alpha}\geq 0$ , and $\mathrm{US}_{p}^{\alpha}(\mu,\nu)=0$ if and only if $\mu=\nu$ . and satisfies the triangle inequality:*

[TABLE] 3. iii)

If in addition $w_{1}(z_{0})=w_{2}(z_{0})$ , then $\mathrm{US}_{p}^{\alpha}$ is a metric and $({\mathcal{M}}({\mathbb{G}}),\mathrm{US}_{p}^{\alpha})$ is a complete metric space. Moreover, it is a geodesic space in the sense that for every two points $\mu$ and $\nu$ in ${\mathcal{M}}({\mathbb{G}})$ there exists a path $\varphi:[0,a]\to{\mathcal{M}}({\mathbb{G}})$ with $a\triangleq\mathrm{US}_{p}^{\alpha}(\mu,\nu)$ such that $\varphi(0)=\mu$ , $\varphi(a)=\nu$ , and

[TABLE]

In Proposition A.4 (Appendix §A.1), we also establish a comparison between $\mathrm{US}_{p}^{\alpha}$ for different exponent $p$ . We next derive a lower bound for $\mathrm{US}_{1}^{0}$ in terms of $\mathrm{ET}_{\lambda}$ . In fact, a more general estimate holds true for every $p\geq 1$ and is given in Proposition A.5 (Appendix §A.1). As a consequence of Corollary 3.2 and Lemma 4.4 and since $\mathbb{U}\subset\mathbb{U}_{0}$ , we obtain:

Proposition 5.2 (Lower bound for $\mathrm{US}_{1}^{0}$ ).

Recall that $\omega^{*}$ is the length measure on ${\mathbb{G}}$ . Assume that $w_{1},w_{2}$ are $b$ -Lipschitz w.r.t. $d_{\mathbb{G}}$ . For $\omega=\omega^{*}$ , $\mu,\nu\in{\mathcal{M}}({\mathbb{G}})$ , we have

[TABLE]

We emphasize that when ${\mathbb{G}}$ is a tree, our EPT on a graph (i.e., $\mathrm{ET}_{c,\lambda}$ and $\mathrm{ET}_{\lambda}$ ) coincide with the ones defined in (Le and Nguyen,, 2021). Furthermore, we have:

Proposition 5.3 (Lower bounds).

Assume that ${\mathbb{G}}$ is a tree and $\omega=\omega^{*}$ . The followings hold true:

i)

$\mathrm{US}_{1}^{\alpha}(\mu,\nu)=d_{\alpha}(\mu,\nu)$ . Also for $1\leq p\leq\infty$ , we have

[TABLE]

where $d_{\alpha}$ is defined in (Le and Nguyen,, 2021, Eq. (9)). 2. ii)

If $\mu({\mathbb{G}})=\nu({\mathbb{G}})$ , then for $1\leq p\leq\infty$ , we have

[TABLE]

where ${\mathcal{W}}_{p}$ is the $p$ -order Wasserstein distance444The definition of ${\mathcal{W}}_{p}$ is recalled in Appendix §B.1.* with cost $d_{\mathbb{G}}^{p}$ . Moreover, the equality is attained when $p=1$ .*

We next prove the negative definiteness for UST. This important property allows us to build positive definite kernels upon UST, required for kernel-dependent machine learning algorithmic approaches.

Proposition 5.4.

Under the same assumptions as in Corollary 4.6 and $w_{1}(z_{0})=w_{2}(z_{0})$ . Then, $\mathrm{US}_{p}^{\alpha}$ is negative definite on ${\mathcal{M}}({\mathbb{G}})$ for any $1\leq p\leq 2$ .

From Proposition 5.4 and by using (Berg et al.,, 1984, Theorem 3.2.2), we obtain that the kernel

[TABLE]

is positive definite on ${\mathcal{M}}({\mathbb{G}})$ for any given $t>0$ and $1\leq p\leq 2$ .

6 EXPERIMENTS

In this section, we illustrate the fast computation (i.e., closed-form solution) of the proposed UST and comparable performances of the proposed positive definite kernel associated to UST against other popular unbalanced transport baselines and their corresponding kernels. More concretely, we evaluate for measures with unequal mass on a given graph under two simulations: document classification and topological data analysis (TDA).

Document classification. We consider four traditional document datasets: TWITTER, RECIPE, CLASSIC, and AMAZON. Their characteristics are summarized in Figure 1. We represent each document as a measure by considering each word in the document as its support with a unit mass. Therefore, documents with different lengths have different total mass. We employ the same word embedding procedure as in (Le and Nguyen,, 2021) to embed words into vectors in $\mathbb{R}^{300}$ .

TDA. We carry out two tasks: orbit recognition on Orbit dataset and object shape recognition on MPEG7 dataset. For Orbit dataset, it is synthesized as in (Adams et al.,, 2017) for link twist map which are discrete dynamical systems to model flows in DNA microarrays (Hertzsch et al.,, 2007). There are five classes of orbits in the dataset. For each class, we generated $1000$ orbits where each orbit contains $1000$ points. For MPEG7 dataset (Latecki et al.,, 2000), we consider its 10-class subset where each class has $20$ samples as in (Le and Yamada,, 2018). The characteristics of the considered Orbit and MPEG7 datasets are summarized in Figure 2. We use the same procedure as in (Le and Nguyen,, 2021) to extract persistence diagram (PD) for orbits and object shapes. PD are multisets of points in $\mathbb{R}^{2}$ . Each point in PD summarizes the lifespan (i.e., birth and death time) of a topological feature (e.g., connected component, ring, cavity). We represent each PD as a measure by regarding each $2$ -dimensional point in PD as its support with a unit mass. Consequently, persistence diagrams having a different number of topological features are represented as measures with different total mass.

Notice that supports in document classification simulations are in high-dimensional spaces (i.e., in $\mathbb{R}^{300}$ ) while supports in TDA simulations are in low-dimensional spaces (i.e., in $\mathbb{R}^{2}$ ). Therefore, we can observe the effects of dimensions to the proposed UST and other unbalanced transport baselines from these simulations. We next describe various graph settings (i.e., the assumed graph metric spaces for measures) considered in our experiments.

Graph settings. We use the same graph settings (i.e., ${\mathbb{G}}_{\text{Log}}$ and ${\mathbb{G}}_{\text{Sqrt}}$ ) employed in (Le et al.,, 2022, §5) for our simulations on document classification and TDA. For these graphs, we consider the number of nodes: $M\!=\!10^{2},10^{3},10^{4},4\!\times\!10^{4}$ . We note that these graphs satisfy the assumptions in §2. Similar to the observations in (Le et al.,, 2022), each node in these graphs has a high probability to satisfy the root node condition, i.e., the uniqueness property of the shortest path (see Appendix §B.2 for a further discussion).

Root node $z_{0}$ for UST. The UST is defined over graph ${\mathbb{G}}$ with a root node $z_{0}$ . From Definition 4.1, the root node $z_{0}$ imposes its own geometry by characterizing the graph derivative of functions on ${\mathbb{G}}$ . To alleviate this dependency, we follow the sliced approach in (Le et al.,, 2022) for Sobolev transport by averaging over different choices of the root node $z_{0}$ in graph ${\mathbb{G}}$ , which can be viewed as a sliced variant for UST.

Baselines, and experimental setup. We consider two typical UOT approaches for measures with unequal mass and supported on a graph metric space as baselines: (i) the Sinkhorn-based UOT (Frogner et al.,, 2015; Chizat et al., 2018a, ) ( $\mathrm{S}_{\mathrm{UOT}}$ )555Séjourné et al., (2019) derived a debiased version for $\mathrm{S}_{\mathrm{UOT}}$ which may be helpful in applications. The debiased version is also empirically indefinite and has the same complexity as $\mathrm{S}_{\mathrm{UOT}}$ . with a graph metric ground cost, and (ii) the distance $d_{\alpha}$ of EPT on a tree (Le et al.,, 2022, Eq. (9)) (see Proposition 5.3 for its relation with $\mathrm{US}_{p}^{\alpha}$ ) where the tree structures are randomly sampled from graph ${\mathbb{G}}$ . From results in Lemma 4.4 and Proposition 5.2 and for simplicity, we consider $\alpha=0$ and $p=1$ (and $d_{0}$ for EPT on a tree as in (Le and Nguyen,, 2021))666One may tune these parameters for further improvements.. We further note that there are different approaches for simulations on document classification and TDA. However, that is not the purpose of our empirical simulations which compare different unbalanced transports for measures with unequal mass on a graph in the same settings.

We apply the kernel approach in the form $\exp(-t\bar{d})$ , where $\bar{d}$ is a discrepancy for unbalanced measures on a graph and $t>0$ , with support vector machines (SVM) for the simulations on document classification and TDA. Note that kernels for $\mathrm{US}_{p}^{\alpha}$ and $d_{\alpha}$ are positive definite, but kernels for $\mathrm{S}_{\mathrm{UOT}}$ is empirically indefinite (see (Peyré and Cuturi,, 2019, §8.3)). Similar to (Le and Nguyen,, 2021), we regularized the Gram matrices for kernels with $\mathrm{S}_{\mathrm{UOT}}$ by adding a sufficiently large diagonal term.

For simplicity, we employ the same setup for the EPT problem in (Le and Nguyen,, 2021), i.e., using $\lambda\!=\!b\!=\!1$ for the EPT. From Corollary 4.6 and Proposition 5.4, we consider the weight functions $w_{1}(x)\!=\!w_{2}(x)\!=\!a_{1}d_{{\mathbb{G}}}(z_{0},x)+a_{0}$ where $a_{1}\!=\!b$ and $a_{0}\!=\!1$ .

For kernel SVM, we use the same setting as in (Le and Nguyen,, 2021). In each dataset, we randomly split it into $70\%/30\%$ for training and test with $10$ repeats. We use 1-vs-1 strategy for SVM with multiclass data. Hyperparameters are typically chosen by cross validation. For kernel hyperparameter, we choose $1/t$ from $\{q_{s},2q_{s},5q_{s}\}$ with $s\!=\!10,20,\dotsc,90$ where $q_{s}$ is the $s\%$ quantile of a random subset of corresponding distances on training data. For SVM regularization hyperparameter, we choose it from $\left\{0.01,0.1,1,10,100\right\}$ . For $\mathrm{S}_{\mathrm{UOT}}$ , we choose the entropic regularization from $\left\{0.01,0.1,1,10\right\}$ . The reported time consumption for each kernel matrices also includes the corresponding preprocessing, e.g., compute shortest paths on graph ${\mathbb{G}}$ for $\mathrm{US}_{p}^{\alpha}$ and $\mathrm{S}_{\mathrm{UOT}}$ , or sampling random tree structures from ${\mathbb{G}}$ for $d_{\alpha}$ of EPT on a tree.

Results of SVM, time consumption and discussions. We illustrate the SVM results and time consumption for kernel matrices for document classification and TDA in Figure 1 and Figure 2 with $M\!=\!10^{4}$ for document datasets, $M\!=\!10^{3}$ for Orbit and $M\!=\!10^{2}$ for MPEG7 for graph ${\mathbb{G}}_{\text{Sqrt}}$ . The performances of kernels for our proposed UST compare favorably with other approaches (except $\mathrm{S}_{\mathrm{UOT}}$ on RECIPE). Additionally, the time consumption of $\mathrm{US}_{1}^{0}$ and $d_{0}$ is several-order faster than that of $\mathrm{S}_{\mathrm{UOT}}$ . Recall that kernels for $\mathrm{S}_{\mathrm{UOT}}$ is indefinite, which may affect performances of $\mathrm{S}_{\mathrm{UOT}}$ in some datasets (e.g., Orbit, TWITTER). In Figure 3, we illustrate the effects of the number of slices (i.e., the number of root nodes used for averaging) for $\mathrm{US}_{1}^{0}$ and $d_{0}$ for TDA. Generally, performances of those approaches are improved with more slices but with a trade-off on time consumption. We observe that $10$ slices give a good trade-off in applications. Extensive further empirical results can be seen in Appendix §B.3, e.g., for various graph structures, graph sizes $M$ , and different orders $p$ of UST.

7 CONCLUSION

In this work, we proposed unbalanced Sobolev transport (UST) for measures with unequal mass on a graph. UST is the first variant of UOT having a closed-form formula for a fast computation. Additionally, UST is negative definite which allows to build positive definite kernels, required for kernel-dependent frameworks. Since UST exploits the graph metric structure of supports, it may restrict to applications with prior graph structures, or applications where one can build graphs from supports. On the other hand, we have not forseen any negative societal impacts of our work.

Acknowledgements

We thank anonymous reviewers and area chairs for their comments. KF has been supported in part by Grant-in-Aid for Transformative Research Areas (A) 22H05106. The research of TN is supported in part by a grant from the Simons Foundation ( $\#318995$ ). TL gratefully acknowledges the support of JSPS KAKENHI Grant number 20K19873. Finally, this research was enabled in part by computational support provided by Makoto Yamada.

Appendix A PROOFS AND ADDITIONAL THEORETICAL RESULTS

In this section, we give detailed proofs for the theoretical results in the main manuscript. We also provide some additional results for the unbalanced Sobolev transport (UST).

A.1 Further Theoretical Results

We include here some additional results for the transport problems and the unbalanced Sobolev transport $\mathrm{US}_{p}^{\alpha}$ .

A.1.1 The Connection between Problem (3) and Problem (4)

We show the connection between problem (3) and problem (4) for EPT on a graph by following a similar reasoning as EPT on a tree (Le and Nguyen,, 2021). It is a direct extension of results in (Le and Nguyen,, 2021).

Theorem A.1.

Let $H(\lambda)\triangleq-\mathrm{ET}_{c,\lambda}(\mu,\nu)$ for $\lambda\in\mathbb{R}$ , and denote

[TABLE]

for the set of all subgradients of $H$ at $\lambda$ . Also, set $\partial H(\mathbb{R})\triangleq\cup_{\lambda\in\mathbb{R}}\partial H(\lambda)$ . Then, we have

i)

$H$ * is a convex function on $\mathbb{R}$ , and*

[TABLE]

where we write $\Gamma^{0}$ for a set of all optimal plans $\gamma$ . Also if $\lambda_{1}<\lambda_{2}$ , then $m_{1}\leq m_{2}$ for every $m_{1}\in\partial H(\lambda_{1})$ and $m_{2}\in\partial H(\lambda_{2})$ .

ii)

$H$ * is differentiable at $\lambda$ if and only if every optimal plan in $\Gamma^{0}(\lambda)$ has the same mass. When this happens, we also have*

[TABLE]

for any $\gamma\in\Gamma^{0}(\lambda)$ .

iii)

If there exists a constant $M>0$ such that

[TABLE]

for all $x,y\in{\mathbb{G}}$ , then $\partial H(\mathbb{R})=[0,b\,\bar{m}]$ . Moreover,

[TABLE]

when $\lambda<-M$ , and $H^{\prime}(\lambda)=b\,\bar{m}$ for $\lambda>\|c\|_{L^{\infty}({\mathbb{G}}\times{\mathbb{G}})}$ .

The proof is placed in §A.2.1.

For any $m\in[0,\bar{m}]$ , part iii) of Theorem A.1 implies that there exists $\lambda\in\mathbb{R}$ such that $b\,m\in\partial H(\lambda)$ . It then follows from part i) of this theorem that $m=\gamma^{*}({\mathbb{G}}\times{\mathbb{G}})$ for some $\gamma^{*}\in\Gamma^{0}(\lambda)$ . It is also clear that this $\gamma^{*}$ is an optimal plan for $\mathrm{W}_{c,m}(\mu,\nu)$ , and

[TABLE]

Thus solving the auxiliary problem (4) gives us a solution to the original problem (3). When $H$ is differentiable, the relation between $m$ and $\lambda$ is given explicitly as

[TABLE]

Note that the above selection of $\lambda$ is unique only if the function $H$ is strictly convex. Nevertheless, it enjoys the following monotonicity regardless of the uniqueness: if $m_{1}<m_{2}$ , then $\lambda_{1}\leq\lambda_{2}$ . Indeed, we have $m_{1}=\gamma^{1}({\mathbb{G}}\times{\mathbb{G}})$ and $m_{2}=\gamma^{2}({\mathbb{G}}\times{\mathbb{G}})$ for some $\gamma^{1}\in\Gamma^{0}(\lambda_{1})$ and $\gamma^{2}\in\Gamma^{0}(\lambda_{2})$ . Since $\gamma^{1}({\mathbb{G}}\times{\mathbb{G}})<\gamma^{2}({\mathbb{G}}\times{\mathbb{G}})$ , one has $\lambda_{1}\leq\lambda_{2}$ by i) of Theorem A.1.

A.1.2 $W^{1,\infty}({\mathbb{G}},\omega^{*})$ versus Lipschitz space

We describe the connection between the Sobolev space $W^{1,\infty}({\mathbb{G}},\omega^{*})$ and the space of Lipschitz continuous functions. The definition of the length measure $\omega^{*}$ is reviewed in §B.1.1).

Lemma A.2.

Let $\omega^{*}$ be the length measure on graph ${\mathbb{G}}$ , and let $f:{\mathbb{G}}\to\mathbb{R}$ be a function. We have:

i)

If $|f(x)-f(y)|\leq b\,d_{\mathbb{G}}(x,y)$ , $\forall x,\,y\in{\mathbb{G}}$ , then $f\in W^{1,\infty}({\mathbb{G}},\omega^{*})$ with $\|f^{\prime}\|_{L^{\infty}({\mathbb{G}},\omega^{*})}\leq b$ . 2. ii)

Assume in addition that ${\mathbb{G}}$ is a tree. Then, $f\in W^{1,\infty}({\mathbb{G}},\omega^{*})$ with $\|f^{\prime}\|_{L^{\infty}({\mathbb{G}},\omega^{*})}\leq b$ implies that $|f(x)-f(y)|\leq b\,d_{\mathbb{G}}(x,y)$ for every $x,\,y\in{\mathbb{G}}$ .

The proof is placed in §A.2.2.

Remark A.3.

Our proof for Lemma A.2 (in §A.2.2) also shows that the result in part ii) of Lemma A.2 in fact holds for every measure $\omega$ . Precisely, let $\omega$ be a nonnegative Borel measure on a tree ${\mathbb{G}}$ . Then, we have $f\in W^{1,\infty}({\mathbb{G}},\omega)$ with $\|f^{\prime}\|_{L^{\infty}({\mathbb{G}},\omega)}\leq b$ implies that $|f(x)-f(y)|\leq b\,\omega([x,y])$ for every $x,\,y\in{\mathbb{G}}$ .

A.1.3 Comparison between Sobolev Spaces with Diferent Exponents

We derive a comparison between UST with different exponent $p$ , and its proof is a direct consequence of our closed-form formula given in Proposition 4.5.

Proposition A.4 (Relation for different $p$ ).

Assume that $\omega$ is a nonnegative Borel measure on ${\mathbb{G}}$ . Then for any $1\leq p\leq q\leq\infty$ and $\mu,\nu\in{\mathcal{M}}({\mathbb{G}})$ , we have

[TABLE]

where $\Theta$ is the constant defined by (11).

Proof of Proposition A.4.

The case $p=q$ is trivial, so let us consider $1\leq p<q\leq\infty$ . Then by using Proposition 4.5 and Hölder’s inequality, we obtain

[TABLE]

∎

A.1.4 Lower Bound for $\mathrm{US}_{p}^{0}$

We derive a lower bound for $\mathrm{US}_{p}^{0}$ which is a generalization of the result for $p=1$ in Proposition 5.2.

Proposition A.5 (Lower bound for $\mathrm{US}_{p}^{0}$ ).

Let $\omega^{*}$ be the length measure on ${\mathbb{G}}$ , and assume that $w_{1}$ and $w_{2}$ are $b$ -Lipschitz w.r.t. $d_{\mathbb{G}}$ . Then by taking $\omega=\omega^{*}$ , we have for every $1\leq p\leq\infty$ that

[TABLE]

for every $\mu,\,\nu\in{\mathcal{M}}({\mathbb{G}})$ . Here $\Theta$ is the constant defined by (11).

Proof.

This is a consequence of Corollary 3.2, Lemma 4.4, and Proposition A.4. ∎

A.1.5 The Special Case of Balanced Mass

Observe that for the case $\mu({\mathbb{G}})=\nu({\mathbb{G}})$ , the constraint $f(z_{0})\in I_{\alpha}$ in the definition of $\mathbb{U}_{p^{\prime}}^{\alpha}$ is redundant. Indeed, we have:

Lemma A.6.

Let $\omega$ be a nonnegative Borel measure on ${\mathbb{G}}$ . Assume that $\mu,\nu\in{\mathcal{M}}({\mathbb{G}})$ satisfy $\mu({\mathbb{G}})=\nu({\mathbb{G}})$ . Then,

[TABLE]

In particular, $\mathrm{US}_{p}^{\alpha}(\mu,\nu)$ is independent of the parameters $\alpha$ , $\lambda$ and the weights $w_{1}$ , $w_{2}$ .

Proof.

This follows from the fact that Definition 4.3 is unchanged in the case $\mu({\mathbb{G}})=\nu({\mathbb{G}})$ when the critic function $f$ is translated by a constant. ∎

From Lemma A.6, we see that for the case $\mu({\mathbb{G}})=\nu({\mathbb{G}})$ , our proposed unbalanced Sobolev transport $\mathrm{US}_{p}^{\alpha}$ with $b=1$ coincides with the balanced Sobolev transport ${\mathcal{S}}_{p}$ (defined in (Le et al.,, 2022, Definition 3.2)).

A.1.6 Infinite Divisibility for Unbalanced Sobolev Transport Kernel

Recall that given $t>0$ and $1\leq p\leq 2$ , the unbalanced Sobolev transport kernel $k_{\text{US}_{p}^{\alpha}}(\mu,\nu)\triangleq\exp(-t\text{US}_{p}^{\alpha}(\mu,\nu))$ is positive definite (see §5 and Proposition 5.4).

For $i\in\mathbb{N}^{*}$ , the kernel $k_{\text{US}_{pi}^{\alpha}}(\mu,\nu)\triangleq\exp(-\frac{t}{i}\text{US}_{p}^{\alpha}(\mu,\nu))$ is positive definite. Additionally, $k_{\text{US}_{p}^{\alpha}}(\mu,\nu)=\left[k_{\text{US}_{pi}^{\alpha}}(\mu,\nu)\right]^{i}$ . Therefore, $k_{\text{US}_{p}^{\alpha}}$ is indefinitely divisible following (Berg et al.,, 1984, Definition 2.6 in §3).

Hence, one does not need to recompute the Gram matrix for unbalanced Sobolev transport kernel $k_{\text{US}_{p}^{\alpha}}$ for different values of $t$ . Indeed, it is suffice to compute the Gram matrix of $k_{\text{US}_{p}^{\alpha}}$ once for some fixed $t$ and leverage its indefinite divisibility for other values of $t$ .

A.2 Detailed Proofs

In this section, we give detailed proofs for our theoretical results.

A.2.1 Proof of Theorem A.1

Proof of Theorem A.1.

We employ a similar reasoning for EPT on a tree (Le and Nguyen,, 2021) to prove the relation between problem (3) and problem (4) for EPT on a graph as follow:

i) Note that $\lambda\mapsto\mathrm{ET}_{c,\lambda}(\mu,\nu)$ is a concave function since it is the infimum of a family of concave functions in $\lambda$ . Therefore, $H$ is convex on $\mathbb{R}$ . In particular, $H$ is differentiable almost everywhere on $\mathbb{R}$ .

Let $\lambda\in\mathbb{R}$ , recall the definition of $\mathcal{C}_{\lambda}(\gamma)$ in Equation (3). Then for any $\gamma\in\Gamma^{0}(\lambda)$ , we have

[TABLE]

This implies that

[TABLE]

We next show that the opposite inclusion is also true, i.e., $\big{\{}b\,\gamma({\mathbb{G}}\times{\mathbb{G}}):\gamma\in\Gamma^{0}(\lambda)\big{\}}=\partial H(\lambda)$ . This is obviously holds if $\partial H(\lambda)$ is singleton, which holds for example when $H$ is differentiable at $\lambda$ . Hence we only need to consider $\lambda$ for which the convex set $\partial H(\lambda)$ has more than one element.

Let $m\in\partial H(\lambda)$ , then $m$ can be expressed as a convex combination of extreme points $m_{1},\dotsc,m_{N}$ of $\partial H(\lambda)$ , i.e., $m=\sum_{i=1}^{N}t_{i}m_{i}$ with $0\leq t_{i}\leq 1$ and $\sum_{i=1}^{N}t_{i}=1$ . As $m_{i}$ is an extreme point of $\partial H(\lambda)$ , there exists a sequence $\lambda_{n}\to\lambda$ such that $\lambda_{n}$ is a differentiable point of $H$ and $H^{\prime}(\lambda_{n})\to m_{i}$ .

Let $\gamma^{n}\in\Gamma^{0}(\lambda_{n})$ , then $b\,\gamma^{n}({\mathbb{G}}\times{\mathbb{G}})=H^{\prime}(\lambda_{n})\to m_{i}$ . By compactness, there exists a subsequence $\{\gamma^{n_{k}}\}$ and $\tilde{\gamma}^{i}\in\Pi_{\leq}(\mu,\nu)$ such that $\gamma^{n_{k}}\to\tilde{\gamma}^{i}$ weakly. It follows that $\gamma^{n_{k}}({\mathbb{G}}\times{\mathbb{G}})\to\tilde{\gamma}^{i}({\mathbb{G}}\times{\mathbb{G}})$ , and hence we must have $b\,\tilde{\gamma}^{i}({\mathbb{G}}\times{\mathbb{G}})=m_{i}$ . We have

[TABLE]

and for any $\gamma\in\Gamma^{0}(\lambda)$ , there holds

[TABLE]

We thus deduce that $\lim_{k\to\infty}\mathcal{C}_{\lambda_{n_{k}}}(\gamma^{\lambda_{n_{k}}})=\mathrm{ET}_{c,\lambda}(\mu,\nu)$ . These together with the lower semicontinuity of $\mathcal{C}_{\lambda}$ give

[TABLE]

Therefore, $\tilde{\gamma}^{i}\in\Gamma^{0}(\lambda)$ with mass $b\,\tilde{\gamma}^{i}({\mathbb{G}}\times{\mathbb{G}})=m_{i}$ . Due to the convexity of $\Gamma^{0}(\lambda)$ , we have $\bar{\gamma}:=\sum_{i=1}^{N}t_{i}\tilde{\gamma}^{i}\in\Gamma^{0}(\lambda)$ with $b\,\bar{\gamma}({\mathbb{G}}\times{\mathbb{G}})=\sum_{i=1}^{N}t_{i}m_{i}=m$ . That is,

[TABLE]

and we thus infer that $\big{\{}b\,\gamma({\mathbb{G}}\times{\mathbb{G}}):\gamma\in\Gamma^{0}(\lambda)\big{\}}=\partial H(\lambda)$ for all $\lambda\in\mathbb{R}$ .

In order to prove the second part of i), let $\gamma\in\Gamma^{0}(\lambda_{1})$ and $\tilde{\gamma}\in\Gamma^{0}(\lambda_{2})$ be arbitrary. We have

[TABLE]

Hence by combining with (13), we deduce that

[TABLE]

which yields $\gamma({\mathbb{G}}\times{\mathbb{G}})\leq\tilde{\gamma}({\mathbb{G}}\times{\mathbb{G}})$ . This together with the above characterization of $\partial H(\lambda)$ implies the second part of i).

ii) If $H$ is differentiable at $\lambda$ , then $\partial H(\lambda)$ is a singleton set. However, as $\partial H(\lambda)=\big{\{}b\,\gamma({\mathbb{G}}\times{\mathbb{G}}):\gamma\in\Gamma^{0}(\lambda)\big{\}}$ by i), we thus infer that the mass $\gamma({\mathbb{G}}\times{\mathbb{G}})$ must be the same for every $\gamma\in\Gamma^{0}(\lambda)$ .

Next assume that every element in $\Gamma^{0}(\lambda)$ has the same mass, say $m$ . For $\delta\neq 0$ , let $\gamma^{\lambda+\delta}\in\Gamma^{0}(\lambda+\delta)$ and $m(\lambda+\delta)\triangleq\gamma^{\lambda+\delta}({\mathbb{G}}\times{\mathbb{G}})$ . Then, we claim that

[TABLE]

Assume the claim for the moment, and let $\delta>0$ . Then, as in (13)–(A.2.1), we have

[TABLE]

It follows that

[TABLE]

This together with claim (15) gives $\lim_{\delta\to 0^{+}}\frac{\mathrm{ET}_{c,\lambda+\delta}(\mu,\nu)-\mathrm{ET}_{c,\lambda}(\mu,\nu)}{\delta}=-bm$ . By the same argument, we also have $\lim_{\delta\to 0^{-}}\frac{\mathrm{ET}_{c,\lambda+\delta}(\mu,\nu)-\mathrm{ET}_{c,\lambda}(\mu,\nu)}{\delta}=-bm$ . Thus, we infer that $H$ is differentiable at $\lambda$ with $H^{\prime}(\lambda)=bm$ . Therefore, it remains to prove claim (15).

Indeed, by compactness there exists a subsequence, still labeled by $\gamma^{\lambda+\delta}$ , and $\gamma\in\Pi_{\leq}(\mu,\nu)$ such that $\gamma^{\lambda+\delta}\to\gamma$ weakly as $\delta\to 0$ . As in i), we can show that $\gamma\in\Gamma^{0}(\lambda)$ . Then, as the mass functional is weakly continuous, we obtain $m(\lambda+\delta)=\gamma^{\lambda+\delta}({\mathbb{G}}\times{\mathbb{G}})\to\gamma({\mathbb{G}}\times{\mathbb{G}})=m$ . We in fact have shown that any subsequence of $\{m(\lambda+\delta)\}_{\delta}$ has a further subsequence converging to the same number $m$ . Therefore, the full sequence $\{m(\lambda+\delta)\}_{\delta}$ must converge to $m$ , and hence (15) is proved.

iii) For any $\lambda\in\mathbb{R}$ , we have by i) that $\partial H(\lambda)=\big{\{}b\,\gamma({\mathbb{G}}\times{\mathbb{G}}):\gamma\in\Gamma^{0}(\lambda)\big{\}}\subset[0,b\,\bar{m}]$ . Thus, we only need to prove $[0,b\,\bar{m}]\subset\partial H(\mathbb{R})$ . First, note that as $\partial H(\lambda)\subset\mathbb{R}$ is a compact and convex set, it must be a finite and closed interval. Therefore, if we let

[TABLE]

then it follows from ii) that $\partial H(\lambda)=\big{[}b\,\gamma^{\lambda}_{min}({\mathbb{G}}\times{\mathbb{G}}),b\,\gamma^{\lambda}_{max}({\mathbb{G}}\times{\mathbb{G}})\big{]}$ for every $\lambda\in\mathbb{R}$ . From Equation (3), it is clear that $\partial H(\lambda)=\{0\}$ for $\lambda$ negative enough. Indeed, if we take $\lambda<-M$ , then as $w_{1}(x)+w_{2}(y)\leq b\,[c(x,y)+M]$ , we have $0<b\,[c(x,y)-\lambda]-w_{1}(x)-w_{2}(y)$ for all $x,y\in{\mathbb{G}}$ . Then, we obtain from Equation (3) that $\mathcal{C}_{\lambda}(0)\leq\mathcal{C}_{\lambda}(\gamma)$ for every $\gamma\in\Pi_{\leq}(\mu,\nu)$ and the strict inequality holds if $\gamma\neq 0$ . Thus, $\Gamma^{0}(\lambda)=\{0\}$ which gives $\partial H(\lambda)=\{0\}$ and $H(\lambda)=-\int_{\mathbb{G}}w_{1}\mu(\mathrm{d}x)-\int_{\mathbb{G}}w_{2}\nu(\mathrm{d}x)$ .

We next show that $\partial H(\lambda)=\{b\,\bar{m}\}$ for $\lambda$ positive enough. Since $c(x,y)$ is bounded due to its continuity on ${\mathbb{G}}\times{\mathbb{G}}$ , we can choose $\lambda\in\mathbb{R}$ such that $c(x,y)-\lambda<0$ for all $x,y\in{\mathbb{G}}$ . Let $\gamma\in\Gamma^{0}(\lambda)$ . We claim that either $\gamma_{1}=\mu$ or $\gamma_{2}=\nu$ . Indeed, since otherwise we have $\gamma_{1}(A_{0})<\mu(A_{0})$ and $\gamma_{2}(B_{0})<\nu(B_{0})$ for some Borel sets $A_{0},B_{0}\subset{\mathbb{G}}$ . Let $\tilde{\gamma}:=\gamma+[(\mu-\gamma_{1})\chi_{A_{0}}]\otimes[(\nu-\gamma_{2})\chi_{B_{0}}]$ . Then, for any Borel set $A\subset{\mathbb{G}}$ we have

[TABLE]

Likewise, $\tilde{\gamma}_{2}(B)\leq\nu(B)$ for any Borel set $B\subset{\mathbb{G}}$ . Thus $\tilde{\gamma}\in\Pi_{\leq}(\mu,\nu)$ . On the other hand, it is clear from (3) and the facts $\gamma_{1}\leq\tilde{\gamma}_{1}$ , $\gamma_{2}\leq\tilde{\gamma}_{2}$ , and $c-\lambda<0$ that $\mathcal{C}_{\lambda}(\tilde{\gamma})<\mathcal{C}_{\lambda}(\gamma)$ . This is impossible and so the claim is proved. That is, either $\gamma_{1}=\mu$ or $\gamma_{2}=\nu$ . It follows that $\gamma({\mathbb{G}}\times{\mathbb{G}})=\bar{m}$ for every $\gamma\in\Gamma^{0}(\lambda)$ , and hence $\partial H(\lambda)=\{b\,\bar{m}\}$ due to i). This also means that $H$ is differentiable at $\lambda$ with $H^{\prime}(\lambda)=b\,\bar{m}$ .

Therefore, it remains to show that

[TABLE]

Assume by contradiction that there exists $m\in(0,b\,\bar{m})$ such that $m\not\in\partial H(\lambda)$ for every $\lambda\in\mathbb{R}$ . For convenience, we adopt the following notation: for sets $A,B\subset\mathbb{R}$ and $r\in\mathbb{R}$ , we write $A<r$ if $a<r$ for every $a\in A$ , and $A<B$ if $a<b$ for every $a\in A$ and $b\in B$ . Let us consider the following two sets

[TABLE]

Then $\lambda\in S_{1}$ if $\lambda$ is negative enough, and $\lambda\in S_{2}$ if $\lambda$ is positive enough. For any $\lambda_{1}\in S_{1}$ and $\lambda_{2}\in S_{2}$ , we have $\partial H(\lambda_{1})<m<\partial H(\lambda_{2})$ , and hence $\lambda_{1}<\lambda_{2}$ by the monotonicity in i). That is, $S_{1}<S_{2}$ and so we obtain

[TABLE]

If $\lambda^{*}<\lambda^{**}$ , then for any $\lambda\in(\lambda^{*},\lambda^{**})$ we have $\lambda\not\in S_{1}$ and $\lambda\not\in S_{2}$ . Therefore, $\partial H(\lambda)\not<m$ and $\partial H(\lambda)\not>m$ . Hence, we can find $m_{1},m_{2}\in\partial H(\lambda)$ such that $m_{1}\geq m$ and $m_{2}\leq m$ . Thus, $m\in[m_{2},m_{1}]\subset\partial H(\lambda)$ due to the convexity of the set $\partial H(\lambda)$ . This contradicts our hypothesis, and we conclude that $\lambda^{*}=\lambda^{**}$ .

We next select sequences $\{\lambda^{1}_{n}\}\subset S_{1}$ and $\{\lambda^{2}_{n}\}\subset S_{2}$ such that $\lambda^{1}_{n}\to\lambda^{*}$ and $\lambda^{2}_{n}\to\lambda^{**}=\lambda^{*}$ . For each $n$ , let

[TABLE]

By compactness, there exist subsequences, still labeled as $\{\gamma^{n}_{min}\}$ and $\{\gamma^{n}_{max}\}$ , and $\gamma^{*},\gamma^{**}\in\Pi_{\leq}(\mu,\nu)$ such that $\gamma^{n}_{min}\to\gamma^{*}$ weakly and $\gamma^{n}_{max}\to\gamma^{**}$ weakly. By arguing exactly as in i), we then obtain $\gamma^{*},\gamma^{**}\in\Gamma^{0}(\lambda^{*})$ , $\gamma^{n}_{min}({\mathbb{G}}\times{\mathbb{G}})\to\gamma^{*}({\mathbb{G}}\times{\mathbb{G}})$ , and $\gamma^{n}_{max}({\mathbb{G}}\times{\mathbb{G}})\to\gamma^{**}({\mathbb{G}}\times{\mathbb{G}})$ . As $b\,\gamma^{n}_{min}({\mathbb{G}}\times{\mathbb{G}})<m$ due to $\lambda^{1}_{n}\in S_{1}$ , we must have $b\,\gamma^{*}({\mathbb{G}}\times{\mathbb{G}})\leq m$ . Likewise, we have $b\,\gamma^{**}({\mathbb{G}}\times{\mathbb{G}})\geq m$ as $b\,\gamma^{n}_{max}({\mathbb{G}}\times{\mathbb{G}})>m$ for all $n$ . Hence, $m\in[b\,\gamma^{*}({\mathbb{G}}\times{\mathbb{G}}),b\,\gamma^{**}({\mathbb{G}}\times{\mathbb{G}})]$ . Since $\gamma^{*},\gamma^{**}\in\Gamma^{0}(\lambda^{*})$ , we infer that $m\in\partial H(\lambda^{*})$ . This is a contradiction and the proof is complete. We note that since $\lambda^{1}_{n}\leq\lambda^{*}\leq\lambda^{2}_{n}$ , we have from the monotonicity in i) that

[TABLE]

for every $\gamma\in\Gamma^{0}(\lambda^{*})$ . By sending $n$ to infinity, it follows that $\gamma^{*}({\mathbb{G}}\times{\mathbb{G}})\leq\gamma({\mathbb{G}}\times{\mathbb{G}})\leq\gamma^{**}({\mathbb{G}}\times{\mathbb{G}})$ for every $\gamma\in\Gamma^{0}(\lambda^{*})$ . That is, $\gamma^{*}=\gamma^{\lambda^{*}}_{min}$ and $\gamma^{**}=\gamma^{\lambda^{*}}_{max}$ . ∎

A.2.2 Proof of Lemma A.2

Proof of Lemma A.2.

Let us define

[TABLE]

and

[TABLE]

i) The statement of this part is equivalent to showing that $A\subset B$ . Let $f\in A$ . Then $f$ is continuous on ${\mathbb{G}}$ , and

[TABLE]

On each edge $e$ and similar to the real line, the Lipschitz condition (17) implies that there exists a function $h_{e}:e\to\mathbb{R}$ with the following properties: $|h_{e}(z)|\leq b$ for $\omega^{*}$ -a.e. $z\in e$ , and

[TABLE]

where we recall that $\langle y,x\rangle$ denotes the line segment in $\mathbb{R}^{n}$ connecting $y$ and $x$ (noting that for general graph, $\langle y,x\rangle$ might not be the same as the shortest path $[y,x]$ ). Let us glue them together by taking $h(z)=h_{e}(z)$ if $z$ is an interior point of an edge $e$ . Then $h:{\mathbb{G}}\to\mathbb{R}$ is a function satisfying: $|h(z)|\leq b$ for $\omega^{*}$ -a.e. $z\in G$ . That is, $h\in L^{\infty}({\mathbb{G}},\omega^{*})$ with $\|h\|_{L^{\infty}({\mathbb{G}},\omega^{*})}\leq b$ . Moreover, for every edge $e$ in ${\mathbb{G}}$ we have

[TABLE]

Now let $x\in{\mathbb{G}}$ be arbitrary. Let us break the unique shortest path $[z_{0},x]$ connecting $z_{0}$ and $x$ into sub line segments $\langle z_{0},y_{0}\rangle,\,\langle y_{0},y_{1}\rangle,...,\langle y_{m-1},y_{m}\rangle,\,\langle y_{m},x\rangle$ such that each of them is contained in exactly one edge. Then by applying (18) to each of these sub line segments, we obtain

[TABLE]

Thus, we have proved that

[TABLE]

Therefore, according to Definition 4.1 we conclude that $f\in W^{1,\infty}({\mathbb{G}},\omega^{*})$ with $\|f^{\prime}\|_{L^{p^{\prime}}({\mathbb{G}},\omega^{*})}\leq b$ . It then follows that $f\in B$ , and hence $A\subset\mathbb{B}$ as desired.

ii) Assume that ${\mathbb{G}}$ is a tree. We can and will assume that $z_{0}$ is the root of this tree. We need to show that $B\subset A$ . For this, let $f\in B$ . Then by Definition 4.1, we have $\|f^{\prime}\|_{L^{\infty}({\mathbb{G}},\omega^{*})}\leq b$ and

[TABLE]

Thus for any two points $x,y\in{\mathbb{G}}$ , we obtain

[TABLE]

Let $\hat{z}$ be the deepest node on the tree that belongs to both path $[z_{0},x]$ and path $[z_{0},y]$ . Due to the tree structure, the joining of path $[x,\hat{z}]$ and path $[\hat{z},y]$ constitutes the shortest path $[x,y]$ connecting the points $x$ and $y$ . These together with (19) imply that

[TABLE]

By the property of the length measure given in Lemma B.2, we then infer that $|f(x)-f(y)|\leq b\,d_{\mathbb{G}}(x,y)$ for every $x,y\in{\mathbb{G}}$ . It follows that $f\in A$ . Therefore, we have proved that $B\subset A$ as desired. ∎

A.2.3 Proof of Theorem 3.1

The proof of Theorem 3.1 is based on two auxiliary lemmas. Before stating these lemmas, let us describe the the setting and associated problem.

First, in order to investigate problem (4), we recast it as the standard complete OT problem by using an observation in (Caffarelli and McCann,, 2010). More precisely, let $\hat{s}$ be a point outside graph ${\mathbb{G}}$ and consider the set $\hat{\mathbb{G}}:={\mathbb{G}}\cup\{\hat{s}\}$ . We next extend the cost function to $\hat{\mathbb{G}}\times\hat{\mathbb{G}}$ as follow

[TABLE]

The measures $\mu,\nu$ are extended accordingly by adding a Dirac mass at the isolated point $\hat{s}$ : $\hat{\mu}=\mu+\nu({\mathbb{G}})\delta_{\hat{s}}$ and $\hat{\nu}=\nu+\mu({\mathbb{G}})\delta_{\hat{s}}$ . As $\hat{\mu},\hat{\nu}$ have the same total mass on $\hat{\mathbb{G}}$ , we can consider the standard complete OT problem between $\hat{\mu},\hat{\nu}$ as follow

[TABLE]

where

[TABLE]

This reformulation under an observation in (Caffarelli and McCann,, 2010) helps us to transform an unbalanced optimal transport (EPT) on a graph into a corresponding standard complete OT. Therefore, we can not only bypass all the issues coming from the unbalanced setting, but also rely on many results in the standard setting for OT.

We then adapt the procedure in (Caffarelli and McCann,, 2010) to derive the dual formulation for the EPT on a graph.

Additionally, we have a one-to-one correspondence between $\gamma\in\Pi_{\leq}(\mu,\nu)$ and $\hat{\gamma}\in\Gamma(\hat{\mu},\hat{\nu})$ as follow

[TABLE]

Indeed, if $\gamma\in\Pi_{\leq}(\mu,\nu)$ , then it is clear that $\hat{\gamma}$ defined by (21) satisfies $\hat{\gamma}\in\Gamma(\hat{\mu},\hat{\nu})$ . The converse is guaranteed by the next technical result.

Lemma A.7.

For $\hat{\gamma}\in\Gamma(\hat{\mu},\hat{\nu})$ , let $\gamma$ be the restriction of $\hat{\gamma}$ to ${\mathbb{G}}$ . Then, relation (21) holds and $\gamma\in\Pi_{\leq}(\mu,\nu)$ .

Proof.

We first observe for any Borel set $A\subset{\mathbb{G}}$ that

[TABLE]

For the same reason, we have $\hat{\gamma}(\{\hat{s}\}\times B)=\int_{B}(1-f_{2})\nu(dx)$ for any set Borel set $B\subset{\mathbb{G}}$ . Also,

[TABLE]

Since (21) is obviously true for sets of the form $A\times B$ with $A,B\subset{\mathbb{G}}$ being Borel sets, we only need to verify it for sets of the following three forms: $(A\cup\{\hat{s}\})\times B$ , $A\times(B\cup\{\hat{s}\})$ , $(A\cup\{\hat{s}\})\times(B\cup\{\hat{s}\})$ for Borel sets $A,B\subset{\mathbb{G}}$ . We check it case by case as follows.

$\bullet$ (i) For $(A\cup\{\hat{s}\})\times B$ : Using the above observation, we have

[TABLE]

Therefore, (21) holds in this case.

$\bullet$ (ii) For $A\times(B\cup\{\hat{s}\}))$ : (21) is also true for this case because

[TABLE]

$\bullet$ (iii) For $(A\cup\{\hat{s}\})\times(B\cup\{\hat{s}\})$ : (21) is true as well since

[TABLE]

Now as (21) holds, we obviously have $\gamma(U\times{\mathbb{G}})\leq\hat{\gamma}(U\times{\mathbb{G}})\leq\hat{\gamma}(U\times\hat{\mathbb{G}})=\hat{\mu}(U)=\mu(U)$ for any Borel set $U\subset{\mathbb{G}}$ . Likewise, $\gamma({\mathbb{G}}\times U)\leq\nu(U)$ for any Borel set $U\subset{\mathbb{G}}$ . Therefore, $\gamma\in\Pi_{\leq}(\mu,\nu)$ . ∎

These observations in particular display the following connection between the EPT problem on a graph (4) and the corresponding standard complete OT problem (20).

Lemma A.8 (EPT on a graph versus its corresponding complete OT).

For every $\mu,\nu\in{\mathcal{M}}({\mathcal{T}})$ , we have $\mathrm{ET}_{c,\lambda}(\mu,\nu)=\mathrm{KT}(\hat{\mu},\hat{\nu})$ . Moreover, relation (21) gives a one-to-one correspondence between optimal solution $\gamma$ for EPT problem (4) and optimal solution $\hat{\gamma}$ for standard complete OT problem (20).

Proof.

We derive two parts as follow:

$\bullet$ (i) We show that $\mathrm{KT}(\hat{\mu},\hat{\nu})\leq\mathrm{ET}_{c,\lambda}(\mu,\nu)$ :

For any $\gamma\in\Pi_{\leq}(\mu,\nu)$ , let $\hat{\gamma}$ be given by (21). Then, $\hat{\gamma}\in\Gamma(\hat{\mu},\hat{\nu})$ and

[TABLE]

It follows that $\mathrm{KT}(\hat{\mu},\hat{\nu})\leq\mathrm{ET}_{c,\lambda}(\mu,\nu)$ .

$\bullet$ (ii) We show that $\mathrm{KT}(\hat{\mu},\hat{\nu})\geq\mathrm{ET}_{c,\lambda}(\mu,\nu)$ :

To see this, for any $\hat{\gamma}\in\Gamma(\hat{\mu},\hat{\nu})$ we let $\gamma$ be the restriction of $\hat{\gamma}$ to ${\mathcal{T}}$ . Then by Lemma A.7, we have $\gamma\in\Pi_{\leq}(\mu,\nu)$ and (21) holds. Consequently,

[TABLE]

By taking the infimum over $\hat{\gamma}$ , we infer that $\mathrm{KT}(\hat{\mu},\hat{\nu})\geq\mathrm{ET}_{c,\lambda}(\mu,\nu)$ .

Thus, from the above two parts, we obtain

[TABLE]

The relation about the optimal solutions also follows from the above arguments. ∎

Given the above two lemmas, we are ready to present the proof of Theorem 3.1.

Proof of Theorem 3.1 .

From Lemma A.8 and the dual formulation for $\mathrm{KT}(\hat{\mu},\hat{\nu})$ proved in (Caffarelli and McCann,, 2010, Corollary 2.6), we have

[TABLE]

Therefore, it is enough to prove that $I=J$ where

[TABLE]

For $(u,v)$ satisfying $u\leq w_{1}$ , $v\leq w_{2}$ and $u(x)+v(y)\leq b[c(x,y)-\lambda]$ , we extend it to $\hat{\mathbb{G}}$ by taking $\hat{u}(\hat{s})=0$ and $\hat{v}(\hat{s})=0$ . Then, it is clear that $\hat{u}(x)+\hat{v}(y)\leq\hat{c}(x,y)$ for $x,y\in\hat{\mathbb{G}}$ , and

[TABLE]

It follows that $I\geq J$ . In order to prove the converse, let $(\hat{u},\hat{v})$ be a maximizer for $I$ . Then, by considering $(\hat{u}-\hat{u}(\hat{s}),\hat{v}+\hat{u}(\hat{s}))$ , we can assume that $\hat{u}(\hat{s})=0$ . Also, if we let $v(y):=\inf_{x\in\hat{\mathbb{G}}}[\hat{c}(x,y)-\hat{u}(x)]$ , then $(\hat{u},v)$ is still in the admissible class for $I$ and $\hat{v}(y)\leq v(y)$ . This implies that $(\hat{u},v)$ is also a maximizer for $I$ . For these reasons, we can assume w.l.g. that the maximizer $(\hat{u},\hat{v})$ has the following additional properties: $\hat{u}(\hat{s})=0$ and

[TABLE]

In particular, $\hat{v}(\hat{s})=\inf_{x\in\hat{\mathbb{G}}}[\hat{c}(x,\hat{s})-\hat{u}(x)]$ . For convenience, define $w_{1}(\hat{s})=0$ and consider the following two possibilities.

$\bullet$ (i) For $\inf_{x\in\hat{\mathbb{G}}}[w_{1}(x)-\hat{u}(x)]\geq 0$ :

Since $\hat{c}(\hat{s},\hat{s})-\hat{u}(\hat{s})=0$ and $\inf_{x\in{\mathbb{G}}}[\hat{c}(x,\hat{s})-\hat{u}(x)]=\inf_{x\in{\mathbb{G}}}[w_{1}(x)-\hat{u}(x)]\geq 0$ , we have $\hat{v}(\hat{s})=0$ .

Also, $\hat{v}(y)\leq\hat{c}(\hat{s},y)-\hat{u}(\hat{s})\leq w_{2}(y)$ for all $y\in\hat{\mathbb{G}}$ . For each $y\in{\mathbb{G}}$ , by using the facts $\hat{u}\leq w_{1}$ and $\hat{c}(\hat{s},y)-w_{1}(\hat{s})=w_{2}(y)\geq 0$ we get

[TABLE]

Thus $(\hat{u},\hat{v})\in{\mathbb{K}}$ and

[TABLE]

$\bullet$ (ii) For $\inf_{x\in\hat{\mathbb{G}}}[w_{1}(x)-\hat{u}(x)]<0$ :

By arguing as in the above case (i), we have $\hat{v}(\hat{s})=\inf_{x\in{\mathbb{G}}}[w_{1}(x)-\hat{u}(x)]<0$ and

[TABLE]

Let $\tilde{u}(x):=\min\{\hat{u}(x),w_{1}(x)\}$ . Then, it is obvious that $\tilde{u}(x)+\hat{v}(y)\leq\hat{c}(x,y)$ and $\tilde{u}(\hat{s})=0$ . Since $\inf_{x\in{\mathbb{G}}}[w_{1}(x)-\hat{u}(x)]<0$ , there exists $x_{0}\in{\mathbb{G}}$ such that $w_{1}(x_{0})<\hat{u}(x_{0})$ . Thus, $\tilde{u}(x_{0})=w_{1}(x_{0})$ and hence $\inf_{{\mathbb{G}}}[w_{1}-\tilde{u}]\leq 0$ . As $\tilde{u}\leq w_{1}$ , we infer further that $\inf_{{\mathbb{G}}}[w_{1}-\tilde{u}]=0$ . We also have

[TABLE]

This together with (22) gives

[TABLE]

Now let $\tilde{v}(y)=\inf_{x\in\hat{\mathbb{G}}}[\hat{c}(x,y)-\tilde{u}(x)]$ for $y\in{\mathbb{G}}$ . Then, $\hat{v}(y)\leq\tilde{v}(y)\leq\hat{c}(\hat{s},y)-\tilde{u}(\hat{s})=w_{2}(y)$ for $y\in{\mathbb{G}}$ . For each $y\in{\mathbb{G}}$ , by using the facts $\tilde{u}\leq w_{1}$ and $\hat{c}(\hat{s},y)-w_{1}(\hat{s})=w_{2}(y)\geq 0$ we also get

[TABLE]

It follows that $(\tilde{u},\tilde{v})\in{\mathbb{K}}$ and

[TABLE]

Thus we conclude that $I=J$ and the theorem follows. ∎

A.2.4 Proof of Corollary 3.2

Proof of Corollary 3.2.

Notice that as $w_{i}$ ( $i=1,2$ ) is $b$ -Lipschitz w.r.t. $d_{\mathbb{G}}$ , we have for every $x\in{\mathbb{G}}$ that

[TABLE]

Let ${\mathbb{K}}$ be the set defined in the statement of Theorem 3.1. Then for each $(u,v)\in{\mathbb{K}}$ , let

[TABLE]

By using $-b\lambda+\inf_{x\in{\mathbb{G}}}[b\,d_{\mathbb{G}}(x,y)-w_{1}(x)]\leq v(y)\leq w_{2}(y)$ and (23), we obtain for every $x\in{\mathbb{G}}$ that

[TABLE]

We also have $v^{*}$ is $b$ -Lipschitz, i.e., $|v^{*}(x_{1})-v^{*}(x_{2})|\leq b\,d_{\mathbb{G}}(x_{1},x_{2})$ . Indeed, let $x_{1},x_{2}\in{\mathbb{G}}$ . Then for any ${\epsilon}>0$ , there exists $y_{1}\in{\mathbb{G}}$ such that

[TABLE]

It follows that

[TABLE]

Since this holds for every ${\epsilon}>0$ , we get

[TABLE]

By interchanging the role of $x_{1}$ and $x_{2}$ , we also obtain $v^{*}(x_{1})-v^{*}(x_{2})\leq b\,d_{\mathbb{G}}(x_{1},x_{2})$ . Thus,

[TABLE]

Hence, we have shown that $v^{*}\in\mathbb{U^{*}}$ with

[TABLE]

We next claim $v^{**}=-b\lambda-v^{*}$ . For this, it is clear from the definition that $v^{**}(y)\leq-b\lambda-v^{*}(y)$ . On the other hand, from the Lipschitz property of $v^{*}$ we obtain

[TABLE]

which gives $-b\lambda-v^{*}(y)\leq v^{**}(y)$ . Thus, we conclude that $v^{**}=-b\lambda-v^{*}$ as claimed.

From these, we obtain that

[TABLE]

This together with Theorem 3.1 in the main text implies that

[TABLE]

To prove the converse, let $f\in\mathbb{U^{*}}$ . Define $u:=f$ and $v:=-b\lambda-f$ . Then, we have

[TABLE]

and

[TABLE]

Also, the Lipschitz property of $f$ gives

[TABLE]

Thus $(u,v)\in{\mathbb{K}}$ , and hence we obtain from Theorem 3.1 in the main text that

[TABLE]

As this holds for every $f\in\mathbb{U^{*}}$ , we get

[TABLE]

Thus, we have shown that

[TABLE]

Now consider $f=\tilde{f}-\frac{b\lambda}{2}$ . Then, $f\in\mathbb{U^{*}}$ if and only if $\tilde{f}\in\mathbb{U}$ . Moreover,

[TABLE]

Therefore, the conclusion of the corollary follows from (24). ∎

A.2.5 Proof of Lemma 4.4

Proof of Lemma 4.4.

By using part i) of Lemma A.2, we see that

[TABLE]

As a consequence, we obtain

[TABLE]

Thus the first statement of the lemma is proved. Now if ${\mathbb{G}}$ is a tree. Then Lemma A.2 implies that the inclusion in (25) is actually the equality. That is, $\mathbb{U}_{0}=\mathbb{U}_{\infty}^{0}$ . Therefore, we get the desired identity

[TABLE]

∎

A.2.6 Proof of Proposition 4.5

Proof of Proposition 4.5.

It follows from Definition 4.3 and the representation (7) for $f$ that

[TABLE]

The first supremum equals to $[w_{1}(z_{0})+\frac{b\lambda}{2}-\alpha][\mu({\mathbb{G}})-\nu({\mathbb{G}})]$ if $\mu({\mathbb{G}})\geq\nu({\mathbb{G}})$ and equals to $-[w_{2}(z_{0})+\frac{b\lambda}{2}-\alpha][\mu({\mathbb{G}})-\nu({\mathbb{G}})]$ if $\mu({\mathbb{G}})<\nu({\mathbb{G}})$ .

On the other hand, by the same arguments as in the proof of (Le et al.,, 2022, Proposition 3.5) we see that the second supremum equals to $b\left(\int_{{\mathbb{G}}}|\mu(\Lambda(x))-\nu(\Lambda(x))|^{p}\,\omega(\mathrm{d}x)\right)^{\frac{1}{p}}$ . Putting them together, we obtain the desired formula for $\mathrm{US}_{p}^{\alpha}(\mu,\nu)$ . ∎

A.2.7 Proof of Corollary 4.6

Proof of Corollary 4.6.

We first recall that $\langle u,v\rangle$ denotes the line segment in $\mathbb{R}^{n}$ connecting two points $u,v$ , while $(u,v)$ means the same line segment but without its two end-points. Then as $\omega(\{x\})=0$ for every $x\in{\mathbb{G}}$ , we have

[TABLE]

Since $\mu$ and $\nu$ are supported on nodes, we can rewrite the above identity as

[TABLE]

For $e=\langle u,v\rangle$ and $x\in(u,v)$ , we observe that $y\in{\mathbb{G}}\setminus(u,v)$ belongs to $\Lambda(x)$ if and only if $y\in\gamma_{e}$ . It follows that $\Lambda(x)\setminus(u,v)=\gamma_{e}$ , and thus we deduce from the above identity that

[TABLE]

This together with Proposition 4.5 yields the postulated result. ∎

A.2.8 Proof of Proposition 5.1

We begin with the following auxiliary result.

Lemma A.9.

Let $\mu,\nu\in{\mathcal{M}}({\mathbb{G}})$ . Then, $\mu=\nu$ if and only if $\mu(\Lambda(x))=\nu(\Lambda(x))$ for every $x$ in ${\mathbb{G}}$ .

Proof.

It is obvious that $\mu=\nu$ implies that $\mu(\Lambda(x))=\nu(\Lambda(x))$ for every $x$ in ${\mathbb{G}}$ . Now assume that $\mu(\Lambda(x))=\nu(\Lambda(x))$ for every $x$ in ${\mathbb{G}}$ . We first claim that $\mu(\{a\})=\nu(\{a\})$ for any $a\in{\mathbb{G}}$ . Let $a\in{\mathbb{G}}$ be arbitray. Then there are two possibility for $a$ : either $a$ is a node or $a$ is an interior point of an edge. We consider these two cases saperately.

$\bullet$ (i) $a$ is an interior point of an edge $e\in E$ (i.e. $a$ is not a node):

Let $\{a_{n}\}_{n=1}^{\infty}$ be a sequence of distinct points on the same edge $e$ as $a$ such that $d_{\mathbb{G}}(a_{n},z_{0})>d_{\mathbb{G}}(a,z_{0})$ for every $n\geq 1$ and $a_{n}\to a$ as $n\to\infty$ . It follows that $\Lambda(a_{n})\subset\Lambda(a)$ and $\Lambda(a)\setminus\Lambda(a_{n})\downarrow\{a\}$ as $n\to\infty$ . As a consequence, we have

[TABLE]

But as $\mu(\Lambda(x))=\nu(\Lambda(x))$ for every $x$ in ${\mathbb{G}}$ , we thus obtain

[TABLE]

as claimed.

$\bullet$ (ii) $a$ is a node:

We can assume that $a$ is a common node for edges $e_{1},...,e_{k}$ . Then for each $i\in\{1,...,k\}$ , let $\{a^{i}_{n}\}_{n=1}^{\infty}$ be a sequence of distinct points on edge $e_{i}$ such that $d_{\mathbb{G}}(a^{i}_{n},z_{0})>d_{\mathbb{G}}(a,z_{0})$ for every $n\geq 1$ and $a^{i}_{n}\to a$ as $n\to\infty$ . These choices yield $\Lambda(a^{i}_{n})\subset\Lambda(a)$ and $\Lambda(a)\setminus\cup_{i=1}^{k}\Lambda(a^{i}_{n})\downarrow\{a\}$ as $n\to\infty$ . Using this and the assumption $\mu(\Lambda(x))=\nu(\Lambda(x))$ for every $x$ in ${\mathbb{G}}$ , we obtain

[TABLE]

Thus, we have proved the claim that $\mu(\{a\})=\nu(\{a\})$ for every $a\in{\mathbb{G}}$ .

On the other hand, for any points $x,y$ belonging to the same edge

[TABLE]

where $\langle x,y)$ denotes the line segment in $\mathbb{R}^{n}$ connecting two points $x,y$ but without its right end-point $x$ (while $\langle x,y\rangle$ include both end-points).

Thus, by combining them, we infer further that $\mu(\langle x,y\rangle)=\nu(\langle x,y\rangle)$ for any $x,y\in e$ and for any edge $e\in E$ . It follows that $\mu=\nu$ , and the proof is complete. ∎

Proof of Proposition 5.1.

We note first that the quantity $\mathrm{US}_{p}^{\alpha}$ depends only on the values of the weights at the root $z_{0}$ of the graph. This comes from the fact that only $w_{1}(z_{0})$ and $w_{2}(z_{0})$ are used in the definition of $\mathbb{U}_{p^{\prime}}^{\alpha}$ .

i) This follows immediately from Proposition 4.5 in the main text.

ii) It follows from Definition 4.3 that $\mathrm{US}_{p}^{\alpha}(\mu,\mu)=0$ and $\mathrm{US}_{p}^{\alpha}$ satisfies the triangle inequality. As the constant function $f=0$ belongs to the constraint set $\mathbb{U}_{p^{\prime}}^{\alpha}$ , we also have $\mathrm{US}_{p}^{\alpha}(\mu,\nu)\geq 0$ . Next, assume that $\mathrm{US}_{p}^{\alpha}(\mu,\nu)=0$ . Then by Proposition 4.5 in the main text, we get

[TABLE]

As $\Theta>0$ by our assumption of $\alpha$ , we must have

[TABLE]

Therefore, $\mu(\Lambda(x))=\nu(\Lambda(x))$ for every $x\in{\mathbb{G}}$ . By using Lemma A.9, we then conclude that $\mu=\nu$ .

iii) Due to the assumption $w_{1}(z_{0})=w_{2}(z_{0})$ we have $f\in\mathbb{U}_{p^{\prime}}^{\alpha}$ if and only if $-f\in\mathbb{U}_{p^{\prime}}^{\alpha}$ . Hence we obtain from Definition 4.3 that $\mathrm{US}_{p}^{\alpha}(\mu,\nu)=\mathrm{US}_{p}^{\alpha}(\nu,\mu)$ . This together with ii) implies that $({\mathcal{M}}({\mathbb{G}}),\mathrm{US}_{p}^{\alpha})$ is a metric space. Its completeness follows from (Piccoli and Rossi,, 2014, Proposition 4). As a complete metric space, it is well known that $({\mathcal{M}}({\mathbb{G}}),\mathrm{US}_{p}^{\alpha})$ is a geodesic space if and only if for every $\mu,\nu\in{\mathcal{M}}({\mathbb{G}})$ there exists $\sigma\in{\mathcal{M}}({\mathbb{G}})$ such that

[TABLE]

To verify the latter, take $\sigma:=\frac{\mu+\nu}{2}$ . Then using Definition 4.3 in the main text, we obtain

[TABLE]

and

[TABLE]

∎

A.2.9 Proof of Proposition 5.3

Proof of Proposition 5.3.

i) From its definition, we have $\mathbb{U}_{\infty}^{\alpha}=\mathbb{L}_{\alpha}$ with $\mathbb{L}_{\alpha}$ being the set defined in (Le and Nguyen,, 2021, Section 3.2). As a consequence, we obtain $\mathrm{US}_{1}^{\alpha}(\mu,\nu)=d_{\alpha}(\mu,\nu)$ . On the other hand, Proposition A.4 yields for any $1\leq p\leq\infty$ that

[TABLE]

Therefore, we conclude that

[TABLE]

By moving and combining terms we arrive at

[TABLE]

ii) Let $\bar{m}\triangleq\mu({\mathbb{G}})=\nu({\mathbb{G}})$ . From the definition of the $p$ -Wasserstein distance, we have

[TABLE]

where

[TABLE]

Therefore, the first statement will follow if we can show that

[TABLE]

Since $\mu({\mathbb{G}})=\nu({\mathbb{G}})$ , we have from Lemma A.6 that

[TABLE]

Hence by taking $g\triangleq f/b$ , we can rewrite this identity as

[TABLE]

where ${\mathcal{S}}_{p}$ is the balanced Sobolev transport distance defined in (Le et al.,, 2022, Definition 3.2). On the other hand, we have ${\mathcal{S}}_{p}(\mu,\nu)\geq\omega^{*}({\mathbb{G}})^{-\frac{1}{p^{\prime}}}{\mathcal{W}}_{1}(\mu,\nu)$ by (Le et al.,, 2022, Lemma 4.3). Therefore, we obtain (26) as desired.

Alternatively, we can derive (26) as follows. By using $\mathbb{U}_{\infty}^{\alpha}=\mathbb{L}_{\alpha}$ as in the proof of part i) and the observation about the translation invariant in the proof of Lemma A.6, we see that

[TABLE]

Then due to Lemma A.2, we can further rewrite as

[TABLE]

On the other hand, part i) above gives

[TABLE]

Therefore, we obtain

[TABLE]

for every $1\leq p\leq\infty$ .

For $p=1$ , the equality happens since $p^{\prime}=\infty$ and

[TABLE]

Thus, the second statement follows.

∎

A.2.10 Proof of Proposition 5.4

Proof of Proposition 5.4.

We first prove that $\ell_{p}$ distance is negative definite for $1\leq p\leq 2$ , where

[TABLE]

It is easy to see that the function $(u,v)\mapsto(u-v)^{2}$ is negative definite for $u,v\in\mathbb{R}$ . Using this and by applying (Berg et al.,, 1984, Corollary 2.10), the function $(u,v)\mapsto(u-v)^{p}$ is negative definite for $1\leq p\leq 2$ .

Therefore, for $1\leq p\leq 2$ , the function $\ell_{p}^{p}$ is negative definite since it is a sum of negative definite functions. Using this and by applying (Berg et al.,, 1984, Corollary 2.10), we have $\ell_{p}$ is negative definite for $1\leq p\leq 2$ .

We are now ready to prove the Proposition 5.4. From Proposition 4.5, we have

[TABLE]

Let $m=|E|$ . Then, $\mu\mapsto\Big{\{}w_{e}^{\frac{1}{p}}\mu(\gamma_{e})\Big{\}}_{e\in E}$ can be regarded as a feature map for measure $\mu$ onto $\mathbb{R}_{+}^{m}$ . Therefore, the first term of $\mathrm{US}_{p}^{\alpha}$ is equivalent to $b$ times the $\ell_{p}$ distance between two feature maps of measures $\mu,\nu$ on $\mathbb{R}^{m}_{+}$ respectively. Recall that $b\geq 0$ . Thus, the first term of $\mathrm{US}_{p}^{\alpha}$ is negative definite for $1\leq p\leq 2$ .

Additionally, the second term of $\mathrm{US}_{p}^{\alpha}$ is $\Theta$ times the $\ell_{1}$ distance between $\mu({\mathbb{G}})$ and $\nu({\mathbb{G}})$ . Since $w_{1}(z_{0})=w_{2}(z_{0})$ and $\alpha\leq\frac{b\lambda}{2}+w_{1}(z_{0})$ , we also have from (11) that $\Theta=w_{1}(z_{0})+\frac{b\lambda}{2}-\alpha\geq 0$ . Therefore, the second term of $\mathrm{US}_{p}^{\alpha}$ is also negative definite.

Hence, $\mathrm{US}_{p}^{\alpha}$ is negative definite for any $1\leq p\leq 2$ . ∎

Appendix B FURTHER RESULTS AND DISCUSSIONS

B.1 Brief Reviews

We give brief reviews for some definitions used in our work.

B.1.1 Length Measure on Graphs

We recall the definition and properties in (Le et al.,, 2022, §4.1) about the length measure on graphs.

Definition B.1 (Length measure).

Let $\omega^{*}$ be the unique Borel measure on ${\mathbb{G}}$ such that the restriction of $\omega^{*}$ on any edge is the length measure of that edge. That is, $\omega^{*}$ satisfies:

i)

For any edge $e$ connecting two nodes $u$ and $v$ , we have $\omega^{*}(\langle x,y\rangle)=(t-s)w_{e}$ whenever $x=(1-s)u+sv$ and $y=(1-t)u+tv$ for $s,t\in[0,1)$ with $s\leq t$ . Here, $\langle x,y\rangle$ is the line segment in $e$ connecting $x$ and $y$ . 2. ii)

For any Borel set $F\subset{\mathbb{G}}$ , we have

[TABLE]

The next lemma asserts that $\omega^{*}$ is closely connected to the graph metric $d_{\mathbb{G}}$ , and thus justifies the terminology of a length measure.

Lemma B.2 ( $\omega^{*}$ is the length measure on graph).

Suppose that ${\mathbb{G}}$ has no short cuts, namely, any edge $e$ is a shortest path connecting its two end-points. Then, $\omega^{*}$ is a length measure in the sense that

[TABLE]

for any shortest path $[x,y]$ connecting $x$ and $y$ . In particular, $\omega^{*}$ has no atom in the sense that $\omega^{*}(\{x\})=0$ for every $x$ in ${\mathbb{G}}$ .

B.1.2 Wasserstein distances

We recall here the definition of the $p$ -Wasserstein distances with graph metric ground cost on ${\mathbb{G}}$ .

Definition B.3.

Let $1\leq p<\infty$ . Suppose that $\mu$ and $\nu$ are two nonnegative Borel measures on ${\mathbb{G}}$ satisfying $\mu({\mathbb{G}})=\nu({\mathbb{G}})$ . Then the $p$ -Wasserstein distance between $\mu$ and $\nu$ is defined by

[TABLE]

where

[TABLE]

with $\bar{m}\triangleq\mu({\mathbb{G}})=\nu({\mathbb{G}})$ .

B.1.3 Kernels

We review some important definitions and theorems/corollaries about kernels that are used in our work.

•

Positive Definite Kernels (Berg et al.,, 1984, pp. 66–67). A kernel function $k:\Omega\times\Omega\rightarrow\mathbb{R}$ is called positive definite if for every positive integer $m$ and every points $x_{1},x_{2},...,x_{m}\in\Omega$ , we have

[TABLE]

•

Negative Definite Kernels (Berg et al.,, 1984, pp. 66–67). A kernel function $k:\Omega\times\Omega\rightarrow\mathbb{R}$ is called negative definite if for every integer $m\geq 2$ and every points $x_{1},x_{2},...,x_{m}\in\Omega$ , we have

[TABLE]

•

Theorem 3.2.2 in (Berg et al.,, 1984, pp. 74). Let $\kappa$ be a negative definite kernel. Then for every $t>0$ , the kernel

[TABLE]

is positive definite.

•

Definition 2.6 in (Berg et al.,, 1984, pp. 76). A positive definite kernel $\kappa$ is called infinitely divisible if for each $n\in{\mathbb{N}}^{*}$ , there exists a positive definite kernel $\kappa_{n}$ such that

[TABLE]

•

Corollary 2.10 in (Berg et al.,, 1984, pp. 78). Let $\kappa$ be a negative definite kernel. Then for $0<t<1$ , the kernel

[TABLE]

is negative definite.

B.2 Further Discussions

In this subsection, we discuss some extension for our work and describe more details for some parts in the main manuscript.

Path length for points in ${\mathbb{G}}$ .

We can canonically measure a path length connecting any two points $x,y\in{\mathbb{G}}$ where $x,y$ are not necessary to be nodes in $V$ . Indeed, for two points $x,y\in\mathbb{R}^{n}$ belonging to the same edge $e=\langle u,v\rangle$ which connects two nodes $u$ and $v$ in $V$ , then we have

[TABLE]

for some numbers $t,s\in[0,1]$ . Therefore, the length of the path connecting $x$ and $y$ along the edge $e$ (i.e., the line segment $\langle x,y\rangle$ ) is defined by $|t-s|w_{e}$ . Hence, the length for an arbitrary path in ${\mathbb{G}}$ can be similarly defined by breaking down into pieces over edges and summing over their corresponding lengths (Le et al.,, 2022).

Lipschitz nonnegative weight function on graph ${\mathbb{G}}$ .

An example of $b$ -Lipschitz nonegative weight function on ${\mathbb{G}}$ is

[TABLE]

for some constants $a_{1}\in[0,b]$ and $a_{0}\in[0,\infty)$ .

Extension to measures supported on ${\mathbb{G}}$ .

The closed-form formula for $\mathrm{US}_{p}^{\alpha}$ in (4.6) can be extended for measures with finite supports on ${\mathbb{G}}$ (i.e., measures which may have supports on edges) by using the same strategy to measure a path length connecting $z_{0}$ and y for any $z_{0},y\in{\mathbb{G}}$ (see §2). More precisely, we break down edges containing supports into pieces and sum over their corresponding values instead of the sum over edges for $\mathrm{US}_{p}^{\alpha}$ in (4.6).

About the assumption of uniqueness property of the shortest paths on ${\mathbb{G}}$ .

As discussed in the supplementary of (Le et al.,, 2022), since $w_{e}\in\mathbb{R}$ for any edge $e\in E$ of graph ${\mathbb{G}}$ ., it is almost surely that every node in the graph can be regarded as unique-path root node (with a high probability, lengths of paths connecting any two nodes in graph ${\mathbb{G}}$ are different). Additionally, for some special graph, e.g., a grid of nodes, there is no unique-path root node for such graph. However, by perturbing each node of such graph (or lengths of edges in ${\mathbb{G}}$ in case ${\mathbb{G}}$ is a non-physical graph, i.e., $w_{e}$ ) with a small deviation $\varepsilon$ , we can obtain a graph satisfying the unique-path root node assumption.

About the unbalanced Sobolev transport.

Similar to the work (Le et al.,, 2022), we assume that we know the graph metric space (i.e., the graph structure) where supports of measures are belongs to. Giving such graph, we define the unbalanced Sobolev transport for measures which may have different total mass and are supported on that graph metric space. We leave a question to learn an optimal graph metric structure from data (i.e., supports of measures) for unbalanced Sobolev transport for future work.

About graphs ${\mathbb{G}}_{\text{Log}}$ and ${\mathbb{G}}_{\text{Sqrt}}$ (Le et al.,, 2022).

First, we use a clustering method, e.g., the farthest-point clustering, to partition supports of measures into at most $M$ clusters.777 $M$ is the input number of clusters for the clustering method. Therefore, the result has at most $M$ clusters depending on input data. Then, let $V$ denote the set of centroids of these clusters. For edges, in graph ${\mathbb{G}}_{\text{Log}}$ , we randomly choose $M\log(M)$ edges; and $M^{3/2}$ edges for graph ${\mathbb{G}}_{\text{Sqrt}}$ , we also denote the set of those sampled edges as $\tilde{E}$ .

For each edge $e$ , its corresponding weight $w_{e}$ is computed by the Euclidean distance between the two corresponding nodes of $e$ . Let $n_{c}$ be the number of connected components in the graph $\tilde{{\mathbb{G}}}(V,\tilde{E})$ , we then randomly add $(n_{c}-1)$ more edges between these $n_{c}$ connected components to construct a connected graph ${\mathbb{G}}$ from $\tilde{{\mathbb{G}}}$ .Let $E_{c}$ be the set of these $(n_{c}-1)$ added edges and denote set $E=\tilde{E}\cup E_{c}$ , then ${\mathbb{G}}(V,E)$ is the considered graph.

Datasets and Computational Devices.

For document dataset (i.e., TWITTER, RECIPE, CLASSIC, AMAZON), orbit dataset (Orbit) and a $10$ -class subset of MPEG7 dataset, one can contact the authors of (Le et al.,, 2022) to access to these datasets. For computational devices, we run all of our experiments on commodity hardware.

B.3 Further Empirical Results

In this subsection, we provide further empirical results for our work.

B.3.1 Extended Empirical Results for the Main Text

Similar to Figure 3 in the main text for TDA, we illustrate the effect of the number of slices for document classification with graph ${\mathbb{G}}_{\text{Sqrt}}$ in Figure 4.

We also consider a graph ${\mathbb{G}}$ with a different setting: ${\mathbb{G}}_{\text{Log}}$ . Recall that for Figure 1, Figure 2, Figure 3 in the main text and Figure 4, results are for graph ${\mathbb{G}}_{\text{Sqrt}}$ where $M=10^{4}$ for document datasets, $M=10^{3}$ for MPEG7 dataset and $M=10^{2}$ for Orbit dataset.888There is a typo in the main text (§6): It should be $M=10^{3}$ is for MPEG7 and $M=10^{2}$ is for Orbit. We illustrate corresponding results for graph ${\mathbb{G}}_{\text{Log}}$ in Figure 5, Figure 6, Figure 7, and Figure 8 respectively.

B.3.2 Further Empirical Results

We also provides further results for document classification and TDA as follow:

For document classification.

•

For $M=10^{2}$ , we illustrate the SVM results and time consumption for kernels matrices and the effect of the number of slices for graph ${\mathbb{G}}_{\text{Sqrt}}$ in Figure 9 and Figure 10 respectively. The corresponding results for graph ${\mathbb{G}}_{\text{Log}}$ are in Figure 11 and Figure 12.

•

For $M=10^{3}$ , we illustrate the SVM results and time consumption for kernels matrices and the effect of the number of slices for graph ${\mathbb{G}}_{\text{Sqrt}}$ in Figure 13 and Figure 14 respectively. The corresponding results for graph ${\mathbb{G}}_{\text{Log}}$ are in Figure 15 and Figure 16.

•

For $M=10^{4}$ , we illustrate the SVM results and time consumption for kernels matrices and the effect of the number of slices for graph ${\mathbb{G}}_{\text{Sqrt}}$ in Figure 17 and Figure 18 respectively. The corresponding results for graph ${\mathbb{G}}_{\text{Log}}$ are in Figure 19 and Figure 20.

•

For $M=4\times 10^{4}$ , we illustrate the SVM results and time consumption for kernels matrices and the effect of the number of slices for graph ${\mathbb{G}}_{\text{Sqrt}}$ in Figure 21 and Figure 22 respectively. The corresponding results for graph ${\mathbb{G}}_{\text{Log}}$ are in Figure 23 and Figure 24.

For TDA.

•

For $M=10^{2}$ , we illustrate the SVM results and time consumption for kernels matrices and the effect of the number of slices for graph ${\mathbb{G}}_{\text{Sqrt}}$ in Figure 25 and Figure 26 respectively. The corresponding results for graph ${\mathbb{G}}_{\text{Log}}$ are in Figure 27 and Figure 28.

•

For $M=10^{3}$ , we illustrate the SVM results and time consumption for kernels matrices and the effect of the number of slices for graph ${\mathbb{G}}_{\text{Sqrt}}$ in Figure 29 and Figure 30 respectively. The corresponding results for graph ${\mathbb{G}}_{\text{Log}}$ are in Figure 31 and Figure 32.

•

For $M=10^{4}$ on Orbit dataset and $M=10^{3}$ on MPEG7 dataset (due to the same size of MPEG7 dataset), we illustrate the SVM results and time consumption for kernels matrices and the effect of the number of slices for graph ${\mathbb{G}}_{\text{Sqrt}}$ in Figure 33 and Figure 34 respectively. The corresponding results for graph ${\mathbb{G}}_{\text{Log}}$ are in Figure 35 and Figure 36.

With different exponent $p$ for UST.

We also carry out experiments for different $p$ in unbalanced Sobolev transport using the same setting for $M$ in the main text (i.e., $M=10^{4}$ for document datasets, $M=10^{3}$ for MPEG7 dataset and $M=10^{2}$ for Orbit dataset) on graph ${\mathbb{G}}_{\text{Sqrt}}$ and graph ${\mathbb{G}}_{\text{Log}}$ . Figure 37 and Figure 38 illustrate performances on document classification and TDA respectively with graph ${\mathbb{G}}_{\text{Sqrt}}$ . For graph ${\mathbb{G}}_{\text{Log}}$ , the corresponding results are shown in Figure 39 and Figure 40.999We skip plots about time consumption since the time consumption of UST for $p=1$ and $p=2$ are almost identical. Please refer to other Figures where we illustrate the time consumption of UST for $p=1$ .

With Sinkhorn divergence-based approach for UOT (Séjourné et al.,, 2019) as an extra baseline.

Furthermore, we also consider Sinkhorn divergence-based approach for UOT ( $\text{SD}_{\text{UOT}}$ ) (Séjourné et al.,, 2019) as an extra baseline. As we noted in the main manuscript, $\text{SD}_{\text{UOT}}$ is the debiased version of Sinkhorn-based approach for UOT ( $\text{S}_{\text{UOT}}$ ) which may be helpful for applications. Both $\text{SD}_{\text{UOT}}$ and $\text{S}_{\text{UOT}}$ are empirically indefinite and they have the same computational complexity.

We illustrate SVM results for document classification and TDA with the extra baseline $\text{SD}_{\text{UOT}}$ for both graph ${\mathbb{G}}_{\text{Sqrt}}$ and ${\mathbb{G}}_{\text{Log}}$ corresponding to Figure 1 (in the main text), Figure 2 (in the main text), Figure 5, and Figure 6 in Figure 41, Figure 42, Figure 43, Figure 44 respectively.

B.3.3 Further Discussions on Empirical Results

The unbalanced Sobolev transport (UST) $\text{US}_{p}^{\alpha}$ versus $d_{\alpha}$ of entropy partial transport (EPT) on a tree.

Overall, performances of the UST compare favorably with those of $d_{\alpha}$ of EPT on a tree. Moreover, time consumption of UST is comparable to that of $d_{\alpha}$ of EPT on trees. So, by exploiting the full graph structure, UST improves performances of $d_{\alpha}$ of EPT on a tree and still keeps the advantage about the computational complexity.

The unbalanced Sobolev transport (UST) versus Sinkhorn-based unbalanced optimal transport (UOT).

The performances of UST is comparable to those of Sinkhorn-based UOT. Recall that kernels for UST are positive definite while kernels for Sinkhorn-based UOT are empirically indefinite. This indefiniteness may affect performances of Sinkhorn-UOT in some settings (e.g., datasets or graph structure). It is worth noting that the UST is several order faster than Sinkhorn-based UOT. Therefore, it is prohibited to apply Sinkhorn-based UOT for large-scale settings while our proposed approach (UST) is scalable to such settings.

The effects of the number of slices (i.e., the number of root nodes used for averaging).

In general, when one increases the number of slices for the UST (and $d_{\alpha}$ of EPT on a tree), their corresponding performances are also increased but it comes with a trade-off about time consumption (i.e., linear to the number of slices). We observe that 10 slices seems a good trade-off between performances and time consumption, similar to observations in (Le and Nguyen,, 2021).

Unbalanced Sobolev transport with different $p$ .

In our experiments on document classification and TDA, we observe that $p=1$ for UST consistently gives better performances than $p=2$ for UST.101010Recall that UST with $p=1$ has a stronger connection to EPT on graphs thatn UST with $p=2$ as illustrated in Lemma A.2. Generally, one may turn parameter $p$ to improve performances of UST in applications.

The extra baseline: Sinkhorn divergence-based approach for UOT.

In our experiments, the performances of the extra baseline $\text{SD}_{\text{UOT}}$ are relative with those of $\text{S}_{\text{UOT}}$ when comparing with performances of $d_{\alpha}$ (EPT on a tree) and our proposed UST. The debias property of $\text{SD}_{\text{UOT}}$ improves performances of $\text{S}_{\text{UOT}}$ in some datasets, especially for datasets in TDA tasks (Orbit and MPEG7). For document datasets, performances of $\text{SD}_{\text{UOT}}$ and $\text{S}_{\text{UOT}}$ are comparative (the role of debias property is not clear).

Bibliography60

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1Adams et al., (2017) Adams, H., Emerson, T., Kirby, M., Neville, R., Peterson, C., Shipman, P., Chepushtanova, S., Hanson, E., Motta, F., and Ziegelmeier, L. (2017). Persistence images: A stable vector representation of persistent homology. Journal of Machine Learning Research , 18(1):218–252.
2Altschuler et al., (2021) Altschuler, J. M., Chewi, S., Gerber, P., and Stromme, A. J. (2021). Averaging on the Bures-Wasserstein manifold: Dimension-free convergence of gradient descent. Advances in Neural Information Processing Systems .
3Balaji et al., (2020) Balaji, Y., Chellappa, R., and Feizi, S. (2020). Robust optimal transport with applications in generative modeling and domain adaptation. Advances in Neural Information Processing Systems , 33:12934–12944.
4Benamou, (2003) Benamou, J.-D. (2003). Numerical resolution of an “unbalanced” mass transport problem. ESAIM: Mathematical Modelling and Numerical Analysis-Modélisation Mathématique et Analyse Numérique , 37(5):851–868.
5Berg et al., (1984) Berg, C., Christensen, J. P. R., and Ressel, P., editors (1984). Harmonic analysis on semigroups . Springer-Verglag, New York.
6Bonneel and Coeurjolly, (2019) Bonneel, N. and Coeurjolly, D. (2019). Spot: sliced partial optimal transport. ACM Transactions on Graphics (TOG) , 38(4):1–13.
7Bonneel et al., (2015) Bonneel, N., Rabin, J., Peyré, G., and Pfister, H. (2015). Sliced and radon wasserstein barycenters of measures. Journal of Mathematical Imaging and Vision , 51(1):22–45.
8Bunne et al., (2019) Bunne, C., Alvarez-Melis, D., Krause, A., and Jegelka, S. (2019). Learning Generative Models across Incomparable Spaces. In International Conference on Machine Learning (ICML) , volume 97.

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Code & Models

Videos

Taxonomy

Abstract

1 INTRODUCTION

2 PRELIMINARIES

3 ENTROPY PARTIAL TRANSPORT ON A GRAPH

Theorem 3.1** (Dual formula for general cost).**

Corollary 3.2** (Dual formula for graph metric).**

Remark 3.3**.**

4 UNBALANED SOBOLEV TRANSPORT

Definition 4.1** (Graph-based Sobolev space (Le et al.,, 2022)).**

Definition 4.2** (The regularized set Up′α\mathbb{U}_{p^{\prime}}^{\alpha}Up′α​ for critic function).**

Definition 4.3** (Unbalanced Sobolev transport).**

Lemma 4.4**.**

Proposition 4.5**.**

Corollary 4.6**.**

Remark 4.7** (UST for non-physical graph).**

Remark 4.8** (The special case of balanced mass).**

5 PROPERTIES OF UNBALANCED SOBOLEV TRANSPORT

Proposition 5.1** (Geometric structures of USpα\mathrm{US}_{p}^{\alpha}USpα​).**

Proposition 5.2** (Lower bound for US10\mathrm{US}_{1}^{0}US10​).**

Proposition 5.3** (Lower bounds).**

Proposition 5.4**.**

6 EXPERIMENTS

7 CONCLUSION

Acknowledgements

Appendix A PROOFS AND ADDITIONAL THEORETICAL RESULTS

A.1 Further Theoretical Results

A.1.1 The Connection between Problem (3) and Problem (4)

Theorem A.1**.**

A.1.2 W1,∞(G,ω∗)W^{1,\infty}({\mathbb{G}},\omega^{*})W1,∞(G,ω∗) versus Lipschitz space

Lemma A.2**.**

Remark A.3**.**

A.1.3 Comparison between Sobolev Spaces with Diferent Exponents

Proposition A.4** (Relation for different ppp).**

Proof of Proposition A.4.

A.1.4 Lower Bound for USp0\mathrm{US}_{p}^{0}USp0​

Proposition A.5** (Lower bound for USp0\mathrm{US}_{p}^{0}USp0​).**

Proof.

A.1.5 The Special Case of Balanced Mass

Lemma A.6**.**

Proof.

A.1.6 Infinite Divisibility for Unbalanced Sobolev Transport Kernel

A.2 Detailed Proofs

A.2.1 Proof of Theorem A.1

Proof of Theorem A.1.

A.2.2 Proof of Lemma A.2

Proof of Lemma A.2.

A.2.3 Proof of Theorem 3.1

Lemma A.7**.**

Proof.

Lemma A.8** (EPT on a graph versus its corresponding complete OT).**

Proof.

Proof of Theorem 3.1 .

A.2.4 Proof of Corollary 3.2

Proof of Corollary 3.2.

A.2.5 Proof of Lemma 4.4

Proof of Lemma 4.4.

A.2.6 Proof of Proposition 4.5

Proof of Proposition 4.5.

A.2.7 Proof of Corollary 4.6

Proof of Corollary 4.6.

A.2.8 Proof of Proposition 5.1

Lemma A.9**.**

Proof.

Proof of Proposition 5.1.

A.2.9 Proof of Proposition 5.3

Proof of Proposition 5.3.

A.2.10 Proof of Proposition 5.4

Proof of Proposition 5.4.

Appendix B FURTHER RESULTS AND DISCUSSIONS

B.1 Brief Reviews

B.1.1 Length Measure on Graphs

Definition B.1** (Length measure).**

Theorem 3.1 (Dual formula for general cost).

Corollary 3.2 (Dual formula for graph metric).

Remark 3.3.

Definition 4.1 (Graph-based Sobolev space (Le et al.,, 2022)).

Definition 4.2 (The regularized set $\mathbb{U}_{p^{\prime}}^{\alpha}$ for critic function).

Definition 4.3 (Unbalanced Sobolev transport).

Lemma 4.4.

Proposition 4.5.

Corollary 4.6.

Remark 4.7 (UST for non-physical graph).

Remark 4.8 (The special case of balanced mass).

Proposition 5.1 (Geometric structures of $\mathrm{US}_{p}^{\alpha}$ ).

Proposition 5.2 (Lower bound for $\mathrm{US}_{1}^{0}$ ).

Proposition 5.3 (Lower bounds).

Proposition 5.4.

Theorem A.1.

A.1.2 $W^{1,\infty}({\mathbb{G}},\omega^{*})$ versus Lipschitz space

Lemma A.2.

Remark A.3.

Proposition A.4 (Relation for different $p$ ).

A.1.4 Lower Bound for $\mathrm{US}_{p}^{0}$

Proposition A.5 (Lower bound for $\mathrm{US}_{p}^{0}$ ).

Lemma A.6.

Lemma A.7.

Lemma A.8 (EPT on a graph versus its corresponding complete OT).

Lemma A.9.

Definition B.1 (Length measure).

Lemma B.2 ( $\omega^{*}$ is the length measure on graph).

Definition B.3.

Path length for points in ${\mathbb{G}}$ .

Lipschitz nonnegative weight function on graph ${\mathbb{G}}$ .

Extension to measures supported on ${\mathbb{G}}$ .

About the assumption of uniqueness property of the shortest paths on ${\mathbb{G}}$ .

About graphs ${\mathbb{G}}_{\text{Log}}$ and ${\mathbb{G}}_{\text{Sqrt}}$ (Le et al.,, 2022).

With different exponent $p$ for UST.

The unbalanced Sobolev transport (UST) $\text{US}_{p}^{\alpha}$ versus $d_{\alpha}$ of entropy partial transport (EPT) on a tree.

Unbalanced Sobolev transport with different $p$ .