Inverse optimal transport

Andrew M. Stuart; Marie-Therese Wolfram

arXiv:1905.03950·math.OC·May 13, 2019·SIAM J. Appl. Math.

Inverse optimal transport

Andrew M. Stuart, Marie-Therese Wolfram

PDF

TL;DR

This paper introduces a Bayesian approach to infer unknown cost functions in optimal transport problems from noisy observations, demonstrated through international migration data, with a focus on estimating transition costs and their uncertainties.

Contribution

It presents a novel systematic method to recover unknown costs in optimal transport from noisy data, requiring only solving linear programs and random sampling, with a Bayesian interpretation.

Findings

01

Successfully estimated migration transition costs.

02

Quantified uncertainty in cost estimates.

03

Validated methodology on real-world migration data.

Abstract

Discrete optimal transportation problems arise in various contexts in engineering, the sciences and the social sciences. Often the underlying cost criterion is unknown, or only partly known, and the observed optimal solutions are corrupted by noise. In this paper we propose a systematic approach to infer unknown costs from noisy observations of optimal transportation plans. The algorithm requires only the ability to solve the forward optimal transport problem, which is a linear program, and to generate random numbers. It has a Bayesian interpretation, and may also be viewed as a form of stochastic optimization. We illustrate the developed methodologies using the example of international migration flows. Reported migration flow data captures (noisily) the number of individuals moving from one country to another in a given period of time. It can be interpreted as a noisy observation of…

Tables3

Table 1. Table 1 . Harmonized migration flow statistics for the period 2002-2007; see [ 8 ] .

From		To
		CZ	DE	DK	LU	NL	PL
CZ	R	0	9,218	262	4	511	45
	S	0	560	24	3	81	583
DE	R	1,362	0	4,001	454	9,182	2,876
	S	8,104	0	3,095	1,686	9,293	100,827
DK	R	46	2,687	0	11	475	34
	S	179	2,612	0	1,387	602	833
LU	R	2	2,282	162	0	161	5
	S	13	911	99	0	97	23
NL	R	255	13,681	864	27	0	163
	S	298	10,493	533	191	0	1,020
PL	R	1,608	136,927	2,436	19	5,744	0
	S	63	14,417	111	23	577	0
Tot:	S	3,273	164,795	7,725	515	16,073	3,123
	R	8,657	28,993	3,862	2,041	10,650	103,286

Table 2. Table 2 . Graph based cost: acceptance rates in % percent \% for different combinations of δ u subscript 𝛿 𝑢 \delta_{u} , δ v subscript 𝛿 𝑣 \delta_{v} and δ f subscript 𝛿 𝑓 \delta_{f} .

$δ_{u}^{2}$	$δ_{v}^{2}$	$δ_{f}^{2}$	$a$	$a_{u}$	$a_{v}$	$a_{f}$
0.02	0.02	0.04	65	80	52	62
0.04	0.04	0.04	51	65	26	62
0.04	0.02	0.04	60	65	52	62

Table 3. Table 3 . Toeplitz cost: acceptance rates for different combinations of δ u subscript 𝛿 𝑢 \delta_{u} , δ v subscript 𝛿 𝑣 \delta_{v} and δ f subscript 𝛿 𝑓 \delta_{f} .

$δ_{u}^{2}$	$δ_{v}^{2}$	$δ_{f}^{2}$	$a$	$a_{u}$	$a_{v}$	$a_{f}$
0.02	0.02	0.02	66.1	67.9	55.4	75
0.02	0.02	0.04	60.3	68.3	55.3	52.7
0.04	0.04	0.04	44.1	45.5	30.0	57

Equations58

\displaystyle{\mathcal{P}}_{n\times n}=\Bigl{\{}B\in\mathbb{R}^{n\times n}:B_{ij}\geq 0,\sum_{i,j=1}^{n}B_{ij}=1\Bigr{\}},~{}~{}{\mathcal{P}}_{n}=\Bigl{\{}u\in\mathbb{R}^{n}:u_{j}\geq 0,\sum_{j=1}^{n}u_{j}=1\Bigr{\}},

\displaystyle{\mathcal{P}}_{n\times n}=\Bigl{\{}B\in\mathbb{R}^{n\times n}:B_{ij}\geq 0,\sum_{i,j=1}^{n}B_{ij}=1\Bigr{\}},~{}~{}{\mathcal{P}}_{n}=\Bigl{\{}u\in\mathbb{R}^{n}:u_{j}\geq 0,\sum_{j=1}^{n}u_{j}=1\Bigr{\}},

\displaystyle{\mathcal{S}}_{\mathfrak{p},\mathfrak{q}}=\Bigl{\{}B\in{\mathcal{P}}_{n\times n}:B\boldsymbol{1}=\mathfrak{p},B^{T}\boldsymbol{1}=\mathfrak{q}\,\,{\rm for}\,\,\mathfrak{p},\mathfrak{q}\in{\mathcal{P}}_{n}\Bigr{\}}\text{ where }\boldsymbol{1}=(1,\cdots,1)^{T}\in\mathbb{R}^{n}.

T^{*} \in argmin_{T \in S_{p, q}} ⟨ C, T ⟩ .

T^{*} \in argmin_{T \in S_{p, q}} ⟨ C, T ⟩ .

T^{*} = F (p, q, C) .

T^{*} = F (p, q, C) .

H (T) = - ⟨ T, lo g (T)⟩ + Tr (T) = - i, j = 1 \sum n T_{i, j} (lo g T_{i, j} - 1),

H (T) = - ⟨ T, lo g (T)⟩ + Tr (T) = - i, j = 1 \sum n T_{i, j} (lo g T_{i, j} - 1),

\displaystyle T^{*}_{\epsilon}={\rm argmin}_{T\in{\mathcal{S}}_{\mathfrak{p},\mathfrak{q}}}\Bigl{(}\langle C,T\rangle+\epsilon H(T)\Bigr{)}.

\displaystyle T^{*}_{\epsilon}={\rm argmin}_{T\in{\mathcal{S}}_{\mathfrak{p},\mathfrak{q}}}\Bigl{(}\langle C,T\rangle+\epsilon H(T)\Bigr{)}.

T_{ϵ}^{*} = F_{ϵ} (p, q, C) .

T_{ϵ}^{*} = F_{ϵ} (p, q, C) .

D_{\scaleto K L 3 pt} (T ∥ K) := ⟨ T, lo g (T / K)⟩ - Tr (T) + Tr (K) = i, j = 1 \sum n T_{i, j} lo g \frac{T _{i, j}}{K _{i, j}} - T_{i, j} + K_{i, j}

D_{\scaleto K L 3 pt} (T ∥ K) := ⟨ T, lo g (T / K)⟩ - Tr (T) + Tr (K) = i, j = 1 \sum n T_{i, j} lo g \frac{T _{i, j}}{K _{i, j}} - T_{i, j} + K_{i, j}

K_{i, j} = exp^{- \frac{C _{i, j}}{ϵ}} .

K_{i, j} = exp^{- \frac{C _{i, j}}{ϵ}} .

T_{ϵ}^{*} = argmin_{T \in S_{p, q}} D_{\scaleto K L 3 pt} (T ∥ K) .

T_{ϵ}^{*} = argmin_{T \in S_{p, q}} D_{\scaleto K L 3 pt} (T ∥ K) .

C_{ii} = \overset{ˉ}{C} ≫ 1 for all i = 1, \dots n .

C_{ii} = \overset{ˉ}{C} ≫ 1 for all i = 1, \dots n .

M_{n} (u)_{j} = u_{j} / (ℓ = 1 \sum n u_{ℓ})

M_{n} (u)_{j} = u_{j} / (ℓ = 1 \sum n u_{ℓ})

M_{n \times n} (W)_{i, j} = W_{i, j} / (k, ℓ = 1 \sum n W_{k, ℓ}) .

M_{n \times n} (W)_{i, j} = W_{i, j} / (k, ℓ = 1 \sum n W_{k, ℓ}) .

\displaystyle T^{*}=\mathcal{G}(u,v,W):=\mathcal{F}\bigl{(}{\mathcal{M}}_{n}(u),{\mathcal{M}}_{n}(v),{\mathcal{M}}_{n\times n}(W))

\displaystyle T^{*}=\mathcal{G}(u,v,W):=\mathcal{F}\bigl{(}{\mathcal{M}}_{n}(u),{\mathcal{M}}_{n}(v),{\mathcal{M}}_{n\times n}(W))

\displaystyle T^{*}=\mathcal{G}(u,v,f):=\mathcal{F}\bigl{(}{\mathcal{M}}_{n}(u),{\mathcal{M}}_{n}(v),{\mathcal{M}}_{n\times n}(\mathcal{E}(f))).

\displaystyle T^{*}=\mathcal{G}(u,v,f):=\mathcal{F}\bigl{(}{\mathcal{M}}_{n}(u),{\mathcal{M}}_{n}(v),{\mathcal{M}}_{n\times n}(\mathcal{E}(f))).

2 n + m - 3 > n^{2} - 1.

2 n + m - 3 > n^{2} - 1.

T = G (u, v, W) + η

T = G (u, v, W) + η

Φ (u, v, W; T) = \frac{1}{2 σ ^{2}} ∣ T - G (u, v, W)) ∣^{2} .

Φ (u, v, W; T) = \frac{1}{2 σ ^{2}} ∣ T - G (u, v, W)) ∣^{2} .

P (u, v, W ∣ T) = \frac{1}{P ( T )} P (T ∣ u, v, W) P (u, v, W)

P (u, v, W ∣ T) = \frac{1}{P ( T )} P (T ∣ u, v, W) P (u, v, W)

\displaystyle\mathbb{P}(T|u,v,W)\propto\exp\Bigl{(}-\frac{1}{2\sigma^{2}}|T-\mathcal{G}(u,v,W)|^{2}\Bigr{)}=\exp\Bigl{(}-\Phi(u,v,W;T)\Bigr{)}.

\displaystyle\mathbb{P}(T|u,v,W)\propto\exp\Bigl{(}-\frac{1}{2\sigma^{2}}|T-\mathcal{G}(u,v,W)|^{2}\Bigr{)}=\exp\Bigl{(}-\Phi(u,v,W;T)\Bigr{)}.

P (u, v, W ∣ T)

P (u, v, W ∣ T)

Z=\int_{\mathsf{U}}\exp\Bigl{(}-\frac{1}{2\sigma^{2}}|T-\mathcal{G}(u,v,W)|^{2}\Bigr{)}du\,dv\,dW\,.

Z=\int_{\mathsf{U}}\exp\Bigl{(}-\frac{1}{2\sigma^{2}}|T-\mathcal{G}(u,v,W)|^{2}\Bigr{)}du\,dv\,dW\,.

T_{i, j} = a_{i} K_{i, j} b_{j} .

T_{i, j} = a_{i} K_{i, j} b_{j} .

diag (a) K 1 = p and diag (b) K^{T} 1 = q .

diag (a) K 1 = p and diag (b) K^{T} 1 = q .

a^{(l + 1)} = \frac{p}{K b ^{(l)}} and b^{(l + 1)} = \frac{q}{K a ^{(l + 1)}} .

a^{(l + 1)} = \frac{p}{K b ^{(l)}} and b^{(l + 1)} = \frac{q}{K a ^{(l + 1)}} .

(u^{k + 1}, v^{k + 1}, W^{k + 1})

(u^{k + 1}, v^{k + 1}, W^{k + 1})

(u^{k + 1}, v^{k + 1}, W^{k + 1}) = {(x, y, Z) (u^{k}, v^{k}, W^{k}) with probability a ((u^{k}, v^{k}, W^{k}), (x, y, Z)) otherwise

(u^{k + 1}, v^{k + 1}, W^{k + 1}) = {(x, y, Z) (u^{k}, v^{k}, W^{k}) with probability a ((u^{k}, v^{k}, W^{k}), (x, y, Z)) otherwise

(u^{k + 1}, v^{k + 1}, W^{k + 1}) = {(x, y, Z) (u^{k}, v^{k}, W^{k}) with probability a ((u^{k}, v^{k}, W^{k}), (x, y, Z)) otherwise

(u^{k + 1}, v^{k + 1}, W^{k + 1}) = {(x, y, Z) (u^{k}, v^{k}, W^{k}) with probability a ((u^{k}, v^{k}, W^{k}), (x, y, Z)) otherwise

a ((u, v, W),

a ((u, v, W),

\displaystyle\quad\min\Bigl{\{}1,\exp\bigl{(}\frac{1}{2\sigma^{2}}\lvert T-\mathcal{G}(u,v,W)\rvert^{2}-\frac{1}{2\sigma^{2}}\lvert T-\mathcal{G}(x,y,Z)\rvert^{2}\bigr{)}\Bigr{\}}.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Full text

Inverse Optimal Transport

Andrew M. Stuart

California Institute of Technology, 1200 E. California Blvd, Pasadena, CA 91125

[email protected]

and

Marie-Therese Wolfram

University of Warwick, Coventry CV4 7AL, UK and RICAM, Austrian Academy of Sciences, Altenbergerstr. 66, 4040 Linz, AT

[email protected]

Abstract.

Discrete optimal transportation problems arise in various contexts in engineering, the sciences and the social sciences. Often the underlying cost criterion is unknown, or only partly known, and the observed optimal solutions are corrupted by noise. In this paper we propose a systematic approach to infer unknown costs from noisy observations of optimal transportation plans. The algorithm requires only the ability to solve the forward optimal transport problem, which is a linear program, and to generate random numbers. It has a Bayesian interpretation, and may also be viewed as a form of stochastic optimization.

We illustrate the developed methodologies using the example of international migration flows. Reported migration flow data captures (noisily) the number of individuals moving from one country to another in a given period of time. It can be interpreted as a noisy observation of an optimal transportation map, with costs related to the geographical position of countries. We use a graph-based formulation of the problem, with countries at the nodes of graphs and non-zero weighted adjacencies only on edges between countries which share a border. We use the proposed algorithm to estimate the weights, which represent cost of transition, and to quantify uncertainty in these weights.

1. Introduction

1.1. Background

There are many problems in engineering, the sciences and the social sciences in which an input is transformed into output in an optimal way according to a cost criterion. We are interested in problems where the transformation from input to output is known, and the objective is to infer the cost criterion which drives this transformation. Our primary motivation is optimal transport (OT) problems in which the transport plan is known but the cost is not. More generally linear programs in which the solution is known, but the cost function and constraints are to be determined, fall into the category of problems to which the methodology introduced in this paper applies. We illustrate the type of problem of interest by means of an example.

Example: International Migration. Quantifying migration flows between countries is essential to understand contemporary migration flow patterns. Typically two types of migration statistics are collected – flow and stock data. Migration stock data states the number of foreign born individuals present in a country at a given time and is usually based on population censuses. Stock data is available for almost all countries in the world. Migration flow data captures the number of migrants entering and leaving (inflow and outflow, respectively) a country over the course of a specific period, such as one year, see [1]. It is collected by most developed countries, but no international standards are defined. For example the time of residence after which a person counts as an international migrant varies from country to country. Because of the different definitions and data collection methods, these statistics can be hard to compare. International agencies, such as the United Nations Statistics Division or the Statistical Office of the European Union (Eurostat), publish annual migration flow estimates. These estimates are often based on Poisson or Bayesian linear regression. For more information about the estimation of migration flows using flow or stock statics we refer to [18, 19, 2, 4]. For the purposes of this paper the main issue to appreciate is that migration data is available, but should be viewed as noisy.

Flow data is typically presented in an origin-destination matrix, in which the $(i,j)^{\rm{th}}$ off-diagonal entry contains the number of people moving from country $i$ to country $j$ in a given period of time. This origin-destination data can be reported by both the sending (S) and the receiving (R) countries. Hence two migration flow tables are available, often desegregated by sex and age groups. Table 1 shows harmonized data, which was pre-processed to improve comparability, reported by 6 European countries for the period 2002-2007. The numbers of the sending and receiving countries vary significantly. For example Germany reported that $136,927$ people immigrated from Poland, while Poland reported $14,417$ individuals who left for Germany. These very different numbers naturally raise the question of the true migration flows. In many settings it is natural to place greater weight on receiving data rather than departure data. But even this data is not subject to uniform standards, and therefore providing reliable estimates and quantifying uncertainty is of great interest.

We interpret the reported origin-destination data maps (when appropriately normalized) as a noisy estimate of a transport plan arising from an OT problem with unknown cost. It is then natural to try and infer the transportation cost as it carries information about the migration process. $\diamond$

The preceding example serves as motivation, and we will come back to it throughout this paper. However we reemphasize that the proposed identification methodologies that we introduce in this paper can be used for general inverse OT and linear programming problems; further examples will serve to illustrate this fact.

1.2. Literature Review

Optimal transport originates with the French mathematician Gaspard Monge who, in 1781, investigated the problem of finding the most cost-effective way to move a pile of sand to fill a hole of the same volume. Kantorovich introduced the modern (relaxed) formulation of the problem, in which mass can be split, in 1942. In more mathematical terms Kantorovich considered the following setup: given two positive measures (of equal mass) and a cost function, find the transportation map that moves one measure to the other minimizing the transport cost. The corresponding infimum induces a distance between these two measures – the so-called Wasserstein distance. The Wasserstein distance plays an important role in probability theory, partial differential equations (PDE) and many other fields in applied mathematics [28, 25]. Furthermore the techniques and methodologies developed in OT have found application in a variety of scientific disciplines including data science, economics, imaging and meteorology [12].

With the spread and application of OT into different scientific disciplines the interest in computational methodologies has increased. Commonly used numerical methods broadly speaking fall into two categories: linear programming [7] and methods specific to the structure of OT. Linear programs are classic problems which have been extensively studied in the field of optimization and operations research. Many computational methodologies have been developed, such as the famous simplex algorithm (and its many variants), the Hungarian algorithm and the auction algorithm. All these methods work well for small to medium sized problems, but are too slow in modern applications such as imaging or supply chain management. Recently a significant speed up, of linear programming, was achieved by considering a regularized OT problem, leading to the Sinkhorn algorithm (or variants thereof) in which an additional entropic regularization term is added to the objective function; this allows efficient computation of the corresponding minimizer and induces a trade-off between fidelity to the original problem, and computational speed. This family of efficient algorithms resulted in the rapid advancement of computational OT in recent years, especially in the context of imaging and data science; see [17, 6, 20].

Inverse problems for linear programming received considerable interest in the engineering literature. The paper [3], building on earlier work in [30], studies the problem by seeking a cost function nearest to a given one in $\ell^{p}$ for which the given solution is an optimal linear program; this problem is itself a linear program in the case $p=1.$ The formulation of an inverse problem for linear programming in [9] took a slightly more general perspective, as it does not assume that the given data necessarily arises as the solution of a linear program, and rather seeks to minimize the distance to the solution set of a linear program. Recent application of the inverse problem for linear programming may be found in [24], for example. These papers on inverse linear programming are foundational and have opened up a great deal of subsequent research. However the methods in them do not account in a systematic way for noise in the data provided, and for the incorporation of prior information. We address these issues by adopting a Bayesian formulation of the inverse problem for linear programming, concentrating on OT in particular; the ideas are readily generalized to inverse linear programming in general. The Bayesian approach not only allows for the quantification of uncertainty, but also leads to new (stochastic) optimization methods. An overview of the computational state of the art for Bayesian inversion may be found in [14]. The specific methods that we introduce have the desirable feature that they require only solution of the forward OT problem and the ability to generate random numbers.

1.3. Our Contribution

Our contributions to the subject of inverse problems within linear programming are as follows.

•

We formulate inverse OT problems in a Bayesian framework.

•

We provide a computational framework for solving inverse OT problems in an efficient fashion.

•

We introduce graph-based cost functions for OT, using graph-shortest paths in an integral way.

•

Graph-based OT has considerable potential for application, and we introduce a new way of studying migration flow data using inverse OT in the graph-based setting.

We emphasize that, whilst the graph-based formulation of cost corresponds to a rather specific way of designing cost functions for discrete linear programs, the framework and algorithms developed in this paper apply quite generally to inverse linear programming, and hence to OT in general. We develop the methodology in general, using graph-based migration flow as a primary illustrative example. In section 2 we define OT as a linear program, describe the cost criteria considered, and formulate inverse OT in a Bayesian setting. Section 3 presents algorithms for the forward and inverse OT problem and section 4 contains numerical results.

We will use the following notation throughout this manuscript. Let $|\cdot|$ and $\langle\cdot,\cdot\rangle$ denote the Euclidean norm and inner-product on $\mathbb{R}^{n}$ and the Frobenius norm and inner-product on $\mathbb{R}^{n\times n}.$ The spaces of probability matrices, probability vectors and probability matrices with specified marginals are defined as

[TABLE]

2. Inverse Optimal Transport

In this section we introduce the forward OT problem and discuss specific cost criteria, before formulating the respective inverse OT problem in the Bayesian framework.

2.1. Forward Problem

We consider two discrete probability vectors $\mathfrak{q}\in{\mathcal{P}}_{n}$ and $\mathfrak{p}\in{\mathcal{P}}_{n}$ and a given cost $C\in{\mathcal{P}}_{n\times n}$ . Then the optimal transport problem corresponds to finding a map transporting $\mathfrak{p}$ to $\mathfrak{q}$ at minimal cost. Note that in OT the cost matrix has non-negative entries, which can be normalized to be an element of ${\mathcal{P}}_{n\times n}$ without loss of generality. The respective forward OT problem is to find

[TABLE]

Problem (1) falls into the more general class of linear programs. Linear programs (and their many variants) arise in various specific settings – such as the earth mover’s distance (EMD)[23] or cost network flows [5] – in different scientific communities. The problem (1) has, by virtue of being a specific class of linear programs, at least one solution; this solution lies on the boundary of the feasible set of solutions (defined by the equality constraints). If the solution is unique then we define mapping $\mathcal{F}:{\mathcal{P}}_{n}\times{\mathcal{P}}_{n}\times{\mathcal{P}}_{n\times n}\rightarrow{\mathcal{P}}_{n\times n}$ by

[TABLE]

In the non-unique setting we define $\mathcal{F}(\mathfrak{p},\mathfrak{q},C)$ to be a unique element determined by running a specific non-random algorithm for the linear program to termination, started at a specific initial guess.

We now consider (1) regularized by the addition of the discrete entropy, an approach popularized in [6, 17] and which has led to considerable analytical and computational developments. The resulting problem is

[TABLE]

where the matrix logarithm operation is applied elementwise. Then

[TABLE]

This problem has a unique minimizer $T_{\epsilon}^{*}$ , since $H(T)$ is strongly convex. Following our previous notation we define the corresponding mapping by $\mathcal{F}_{\epsilon}:{\mathcal{P}}_{n}\times{\mathcal{P}}_{n}\times{\mathcal{P}}_{n}\rightarrow{\mathcal{P}}_{n\times n}$

[TABLE]

It is, in contrast to the optimal solution of (1), not sparse. It is known that solutions to (4) converge to minimisers of (1) as $\epsilon\rightarrow 0$ . Determining the rate of convergence is still an open problem. The special structure of this regularized problem can be used to construct efficient splitting algorithms. These methods are based on the equivalent formulation of finding the projection of the joint coupling with respect to the Kullback-Leibler divergence

[TABLE]

where the matrix logarithm and division operations are applied elementwise and $K$ is the Gibb’s kernel

[TABLE]

In particular

[TABLE]

The Kullback-Leibler divergence can be computed extremely efficiently using proximal methods, yielding for example the celebrated Sinkhorn algorithm. We will briefly outline the underlying ideas in Section 3.1.

2.2. Cost Criteria

Problems (1) or (4) are formulated for general cost matrices $C$ - the specific structure of $C$ depends on the application considered. We will investigate the behavior of the proposed methodologies for $C$ being:

(i)

Toeplitz; 2. (ii)

non-symmetric; 3. (iii)

determined by an underlying graph structure.

We assume that all individuals move, hence $T_{ii}=0$ for all $i=1,\ldots n$ in all three cases, Therefore ’staying’ is penalized by setting

[TABLE]

If $C$ is Toeplitz the cost depends on the difference between indices and $C$ has $2n-3$ degrees of freedom. Case (ii) corresponds to general non-symmetric transportation cost, which in the context of migration flows could include factors such as sharing the same language, the ratio of the gross national income per capita or their EU membership. In case (iii) we assume that costs are related to an underlying discrete structure. In the context of migration flows the geographical position of countries defines an underlying graph with edges only between countries which share a border; see Figure 1. We assume that the total transportation cost corresponds to the sum of the individual costs of moving from one country to another along edges of the graph. In defining cost this way we are implicitly assuming that, between the European countries studied here, migration is primarily via land. This resulting discrete underlying structure, which relates the cost matrix to a directed graph representing the migration network between countries, is detailed in the following.

Let $(V,E)$ be a directed graph with $n=|V|$ vertices and a (possibly non-symmetric) weighted adjacency matrix $A\in\mathbb{R}^{n\times n}.$ We can then define a cost matrix $W\in\mathbb{R}^{n\times n}$ whose $(i,j)^{\rm th}$ entry $W_{i,j}$ is the shortest path cost of moving from vertex $i$ to $j$ according to the weighted adjacency matrix $A.$ Let $m$ be the number of non-zero entries of $A$ and $f\in\mathbb{R}^{m}$ the vector defining the non-zero entries. Then we may define a mapping $\mathcal{E}$ such that $W=\mathcal{E}(f).$ This $W\in\mathbb{R}^{n\times n}$ can then be normalized to give a $C\in{\mathcal{P}}_{n\times n}$ and we may define the solution of the resulting OT problem via (2). For this graph-based cost the solution of the OT problem may be viewed as a function of $\mathfrak{p},\mathfrak{q}$ and $f$ . The minimal cost of moving between vertices of a graph can be computed using Dijkstra’s algorithm, recalled in Section 3.1 below.

We define a similar mapping in the case of Toeplitz cost. Here the respective cost matrix $C$ has $2n-2$ free entries, before normalization to a probability vector and recalling that we fix the diagonal to penalize not moving, and so we define a mapping $\mathcal{E}:f\in\mathbb{R}^{2n-2}\rightarrow\mathbb{R}^{n\times n}_{+}$ ; normalization then gives $C={\mathcal{M}}_{n\times n}(\mathcal{E}(f))$ .

2.3. Inverse Problem

The inverse OT problem is to find $\mathfrak{p},\mathfrak{q}$ and $C$ from the solution $T$ to the OT problem (1), or its regularized counterpart (4). We tackle this problem by introducing a space of componentwise positive and real-valued latent variables $u,v,W$ or $u,v,f$ which map to the unknowns $\mathfrak{p},\mathfrak{q}\in{\mathcal{P}}_{n}$ and $C\in{\mathcal{P}}_{n\times n}$ . It is easier, and more natural, to specify priors in terms of these real-valued latent variables. To this end we introduce mappings from $\mathbb{R}^{n}_{+}$ into ${\mathcal{P}}_{n}$ and from $\mathbb{R}^{n\times n}_{+}$ into ${\mathcal{P}}_{n\times n}$ as follows: ${\mathcal{M}}_{n}:\mathbb{R}^{n}_{+}\mapsto{\mathcal{P}}_{n}$ is defined by

[TABLE]

and ${\mathcal{M}}_{n\times n}:\mathbb{R}^{n\times n}_{+}\mapsto{\mathcal{P}}_{n\times n}$ is defined by

[TABLE]

Note that ${\mathcal{M}}_{n}(\lambda u)={\mathcal{M}}_{n}(u)$ for all $\lambda\in\mathbb{R}$ ; the same holds for ${\mathcal{M}}_{n\times n}.$ Then the forward problem (2) can be written as

[TABLE]

or, in the case of graph-based cost or Toepliz cost, we have

[TABLE]

This is readily generalized to the use of regularized optimal transport as the forward model, simply replacing $\mathcal{F}$ by $\mathcal{F}_{\epsilon}.$

We wish to invert the map $\mathcal{G}$ , given noisy observations of $T^{*}.$ Such problems are in general ill-posed, hence suitably regularized versions have to be considered. Different approaches can be found in the literature – we focus on the Bayesian framework, which allows us to estimate the posterior distribution of $u,v$ and $W$ (or $f$ ).

Depending on the structure of the cost matrix the inverse problem related to (9) or (10) can be over- or underdetermined. We recall that in case of Toeplitz cost the matrix $C$ has $2n-3$ degrees of freedom. Then we have $n^{2}-1$ equations for $4n-5$ unknowns (taking into account the normalization of $u$ , $v$ and $W$ ). Hence the inverse problem is overdetermined for $n>2$ . If $C$ is a general cost matrix with a set penalty on the diagonal, that is case (ii), the cost matrix has $n^{2}-n$ degrees of freedom. In total we have $n^{2}+n-3$ unknowns, and therefore the problem is underdetermined for $n>2$ . For graph-based cost (case (iii)) the matrix $C$ has $m$ degrees of freedom and therefore the problem is underdetermined if

[TABLE]

2.3.1. Likelihood

We define a Bayesian formulation of the inverse problem, working in the case where $u,v,W$ are the unknowns; the extension to $u,v,f$ as unknowns is similar. We assume that the observed transport maps are corrupted by noise, in particular

[TABLE]

where $\eta$ is a mean zero Gaussian random matrix with i.i.d. entries of variance $\sigma^{2}.$ In particular we want to find the conditional probability distribution of $(u,v,W)$ given noisy observations $T^{*}$ , that is $(u,v,W)\mid T^{*}$ . The estimation is based on the model-data misfit function

[TABLE]

Using Bayes’ formula

[TABLE]

the probability of observing $T$ given a realization of $u,v$ and $W$ , which is exactly the posterior distribution of $u,v$ and $W$ , is given by

[TABLE]

In (14) $\mathbb{P}(u,v,W)$ corresponds to the prior information about $u,v$ and $W$ . We assume that $u,v$ and $W$ have i.i.d. entries uniformly distributed in $[0,1]$ and denote the set of vectors and matrices which satisfy this componentwise constraint by $\mathsf{U}.$ In view of the scale invariance of ${\mathcal{M}}_{n}(\cdot)$ the choice of unit interval $[0,1]$ is immaterial; any bounded interval $[0,\lambda]$ would deliver identical posterior on $u,v,W$ .

Then the posterior distribution of $u,v$ and $W$ is given by

[TABLE]

with a normalization constant

[TABLE]

We can either sample from the posterior (16) (which corresponds to the full Bayesian approach) or maximize the posterior probability (16), which leads to the optimization problem of minimizing $\Phi(u,v,W;T)$ over $\mathsf{U}$ . The first approach allows us to quantify uncertainty in the estimates of $u,v$ and $W$ , the latter gives a single estimate. We discuss how to sample from the posterior, using a Random walk Metropolis (RwM) method in Section 3.2

3. Algorithms For Inversion

In the following we present the numerical methods used in the computational experiments in Section 4. Since the proposed Bayesian framework requires the solution of an OT problem (1) (or its regularized version (4)) in every iteration of the sampling algorithm, computational efficiency is essential. We start by presenting the solvers for the forward OT problem followed by the Markov-Chain-Monte-Carlo methods used to sample from the posterior.

3.1. Computational Optimal Transport

Numerical methods for linear programming go back to the seminal works of Dantzig on the simplex method, see [7]. Solutions to the linear program (1) lie on the boundary of the feasible polytope, which is defined by the constraints. The simplex method iterates over the vertices of this polytope to find the optimal solution, see [16]. The method works well in practice, however examples in which the performance scales exponentially with the dimension of the problem, can be constructed. Different approaches to speed up computations have been proposed: for example network simplex algorithms are based on the fact that specific linear programs can be formulated as minimization problems on graphs. The particular structure of the underlying graph can be used to speed up the simplex method significantly. Further information on computational methods for linear programming can be found in [8].

More recently computational techniques, which are based on the regularized OT problem (4) have been proposed in the literature. These methods are extremely efficient, since they are based on the formulation of the OT problem in terms of the Kullback Leibler divergence (7). Its minimiser is given by

[TABLE]

Here $K$ is the Gibb’s kernel (6) and the vectors $a$ and $b$ satisfy the mass constraint

[TABLE]

This mass constraint can be enforced iteratively via

[TABLE]

This splitting, known as Sinkhorn’s algorithm, is very efficient as it involves matrix vector multiplications only. Since the entropic regularization term (3) introduces blurring in the otherwise sparse solution, one is interested in keeping $\epsilon$ as small as possible. Since the convergence of Sinkhorn’s algorithm (18) deteriorates as $\epsilon\rightarrow 0$ , it is important to keep a balance between regularization and computational stability. In practice small values of $\epsilon$ lead to diverging scaling factors in (18) and subsequent numerical instabilities. These problems can often be remedied using suitable scalings, see [26].

If the transportation costs depend on an underlying discrete structure, such as for our graph-based migration problem, then the computational burden of computing this cost must be take into consideration. For our example the total transportation cost corresponds to the sum of edge weights when between vertices traversed on the shortest path. Note that the transportation costs are not necessarily the same in both directions since we consider directed graphs. We use Dijkstra’s algorithm to compute the shortest path from one node to all others in the graph, see [10]. Dijkstra’s algorithm is based on continuous updates of the shortest distance to a starting point, and excludes longer distances in updates. It is the graph-based methodology that underpins the fast marching method to solve the eikonal equation [27].

3.2. MCMC and Optimization

We propose the use of Markov Chain Monte-Carlo (MCMC) methods to sample from the posterior distribution (16). For the user interested simply in optimization the algorithm we propose may be viewed as a stochastic optimization method to reduce the model-data misfit. MCMC methods originated with the seminal paper [15] in which what is now termed the The Random walk Metropolis (RwM) algorithm was introduced for a specific high dimensional integral required in statistical physics. In our context the key desirable feature of the method is that it requires only solution of the forward OT (or regularized OT) problem, together with the generation of random numbers. Given a current (approximate) sample from the posterior distribution, a new sample is proposed by adding a mean zero Gaussian to the current one. This is rejected if the resulting new state leaves $\mathsf{U}$ , and otherwise accepted with a probability designed to preserve detailed balance with respect to the posterior. The covariance of the Gaussian is an important tuning parameter: intuitively it should be chosen such that the acceptance rate is neither close to [math] or $1$ , as either of these limits lead to successive iterates which are highly correlated. The optimal scaling of RWM algorithms for different target densities has been investigated in [21, 22]; although the theory developed there applies in rather restricted scenarios, widespread experience and a variety of theories demonstrate that the work leads to useful rule-of-thumb for tuning acceptance probabilities within the RwM algorithm [29], arguably because it leads to average acceptance probabilities that stay away from [math] or $1$ .

In 1970 Hastings introduced a wide class of MCMC methods, now known as Metropolis-Hastings algorithms [13] and in principle this provides a wide-range of variants on RwM that may be used for our Bayesian formulation of inverse OT. A popular variation of MCMC that we have found useful in the inverse OT setting is Gibbs sampling. In high dimensional spaces it can be hard to design proposals which are accepted with a reasonable acceptance probability, and the idea of fixing subsets of the variables, and proposing moves in the remainder, is natural. The Gibbs sampler allows this to be achieved in a statistically consistent fashion. At each iteration one (or several) components of the unknown parameter is updated by sampling from its full conditional probability distribution, and cycling through all the variables. The method may be relaxed to allow a RwM step from the conditional probability distribution, rather than a full sample. The corresponding RwM-within-Gibbs method is outlined in Algorithm 1. In this algorithm we consecutively update $u$ , $v$ and $W$ (or $f$ ). We generate proposals for each variable, which we accept or reject. Note that in general, for all the methods described here, any proposal which descreases the value of $\Phi$ and remains in $\mathsf{U}$ is accepted with probability one. Thus the Algorithm 1 may be viewed as an optimization method which induces a stochastic gradient; the numerics will demonstrate that this acts to minimize the misfit.

4. Numerical Results

In this section we demonstrate the behavior of MCMC methods for inverse OT, and Algorithm 1 in particular. We start by presenting results for the migration flow example introduced at the beginning and use it as a ’proof-of-concept’ for the proposed framework. We then continue with systematic numerical investigation to study the identifiability of the cost matrix in a variety of scenarios, as well as discussing the behavior of the proposed methodology. We focus on the three cost criteria discussed in Section 2.2: Toeplitz cost (i), non-symmetric cost (ii) and graph-based cost (iii). We use the following functions implemented in the POT library [11] to solve the linear program (1) as well as its regularized version (4):

•

emd - this solver for linear programs is based on the respective network OT flow formulation of the problem and was introduced in [5].

•

sinkhorn - implements the Sinkhorn-Knopp scaling algorithm to solve the regularized OT problem (4) as proposed in [6].

We test the proposed methodologies using simulated data as well as real migration data. In making simulated data we compute the optimal transportation maps $T$ for a given set of vectors $\mathfrak{p}$ , $\mathfrak{q}$ and $f$ and add i.i.d. Gaussian noise with mean [math] and variance $\sigma^{2}$ , see (12). Note that the resulting perturbed map $T^{*}$ may have negative entries and is not an element of ${\mathcal{P}}_{n\times n}$ . Therefore we set all negative entries to zero and normalize it, to ensure that it is an admissible solution.

We illustrate the performance of the methodologies with plots of the running means and the respective posterior distributions. All posterior distributions are calculated after $500,000$ RwM iterations with a burn-in of $300,000$ . The performed numerical experiments indicate that this number is sufficient for the convergence of MCMC. Note that we always plot the scaled vectors and matrices (unless stated otherwise). The penalty $\bar{C}$ in (8) is set to $10$ . Numerical simulations show that its absolute magnitude does not influence the posterior distributions significantly once above a certain level.

4.1. European migration flows

We start by presenting estimates for the European network shown in Figure 1. We recall that vertices represent countries and that edges connect countries sharing a border. The weights of these edges correspond to the cost of moving from one country to another. The network shown in Figure 1 consists of $n=9$ countries, which are connected by $m=30$ edges. We use the estimated transportation map reported in [19] and assume that the noise level is $4\%$ . The variance for the proposals is set to $\delta_{u}^{2}=\delta_{v}^{2}=\delta_{W}^{2}=0.04$ . We perform two runs of the RwM-within-Gibbs algorithm, using the exact solver in the first and Sinkhorn’s algorithm with $\epsilon=0.04$ in the second. The acceptance rate of the exact solver is $50.8\%$ (( $53.8\%$ , $53.7\%$ , $44.9\%$ ) for the different components $u$ , $v$ and $f$ ), for Sinkhorn we have $82.9\%$ ( $84.7\%$ , $85.5\%$ , $78.6\%$ ). The running average of three components of $u$ , $v$ and $f$ are shown in Figure 2 and the corresponding posterior distributions in Figure 3. We observe that both runs give comparable results, however the misfit for Sinkhorn is smaller, see Figure 4. This difference might be explained by the fact that we underestimate the noise level $\sigma$ or that the actual transportation maps look more like solutions of regularized OT problems than the OT problem itself.

4.2. Graph-Based Cost

Next we investigate the behavior of the proposed methodologies for graph-based cost more thoroughly. We will see that

•

The identification of $u$ , $v$ and $f$ is robust with respect to the sampling variances, see Figure 6.

•

The posterior estimates are consistent using different solvers, see Figure 7.

These results are obtained using noisy transportation maps $T^{*}$ for a graph connecting $n=5$ nodes with $m=12$ edges. In doing so we solve problem (1) for given vectors $\mathfrak{p}$ , $\mathfrak{q}$ and $f$ and add $4\%$ noise. Note that this inverse problem is overdetermined since $2\cdot 5+12-3<5^{2}-1$ .

Influence of the Sampling Variance $\delta^{2}$

We start by investigating the impact of the sampling variance $\delta^{2}$ . We perform MCMC runs for different combinations of $\delta_{u}$ , $\delta_{v}$ and $\delta_{f}$ (listed in Table 2) and compute the running average and posterior distributions of some components. Note that these parameters affect the rate of convergence of the algorithm, but not the posterior distribution itself. The variance of the samples determines how much new samples differ from the previous iterates - the larger the variance the more adventurous the search, but the less likely to accept leading to highly correlated samples because of rejections. On the other hand smaller variance has a higher probability of accepting but is not adventurous and hence leads to highly correlated samples. It is thus desirable to have an acceptance rate that is neither close to [math] or $1$ . The running averages of three components of $u$ , $v$ and $W$ are shown in Figure 5, the respective posteriors in Figure 6. We see that the results are consistent for all combinations of $\delta$ ’s. However the respective convergence rates vary, see Table 2. We observe a generally higher acceptance rate when sampling from the marginal distribution of $\mathfrak{p}$ , and a decreased rate when increasing the sampling variance.

Exact vs. Sinkhorn

Next we investigate the sensitivity of the results with respect to the forward solver used in Algorithm 1. We run two RwM simulations - the first one using the exact solver and the second one using the Sinkhorn algorithm. We observe that both runs give similar posterior distributions if we choose the regularization parameter $\epsilon$ in a sensible way, see Figure 7. Generally speaking it seems advisable to choose it similar to the noise level (as in the shown results). We will investigate the impact of the regularization parameter in the next subsection in more detail.

4.3. Toeplitz Cost

In the following we present more detailed numerical experiments if $C$ is Toeplitz. The findings of the numerical experiments performed in this subsection can be summarized as follows:

•

The posterior distributions of $u$ , $v$ and $f$ are consistent for varying ranges of proposal variances $\delta$ , see Figure 8 and Figure 9.

•

The exact solver and Sinkhorn’s algorithm converge to similar posteriors, if the entropic regularization parameter $\epsilon$ is chosen sensibly, see Figure 10 and Figure 11.

•

The variance of the posteriors increases with the noise level in the data, as shown for example in Figure 13 and Figure 14.

•

Sinkhorn’s algorithm gives a higher acceptance rate and a more monotone decrease of the data-misfit function, see Figure 15.

We underpin these statements with numerical simulations using generated noisy transportation maps. We recall that $C$ has $2n-3$ degrees of freedom in case of Toeplitz cost (i). This defines, as in the case of graph-based cost (iii), a mapping from the vector $f\in\mathbb{R}^{2n-3}$ to the cost matrix $C$ , that is $\mathcal{E}:\mathbb{R}^{2n-3}\rightarrow{\mathcal{P}}_{n\times n}$ with $C=\mathcal{E}(f)$ . Hence we generate proposals for the vector $f$ , which define the entries of $C$ .

We set $n=5$ and generate a noisy realization $T^{*}$ for a given set of vectors $\mathfrak{p}$ , $\mathfrak{q}\in\mathcal{P}_{5}$ and $f\in\mathbb{R}^{7}$ (which is mapped to the respective Toeplitz cost matrix $C$ $\mathcal{P}_{5\times 5}$ ). Then $T^{*}$ is obtained by adding noise $\eta$ with variance $\sigma^{2}=0.04$ (unless stated otherwise) and subsequent normalization of the distorted map. Note that this problem is overdetermined, since $C$ is Toeplitz and $n>2$ .

Influence of the Sample Variance $\delta^{2}$

As in the case of graph-based cost, we investigate the performance of the RwM-within-Gibbs algorithm for different combinations of $\delta_{u}$ , $\delta_{v}$ and $\delta_{W}$ . Table 3 lists the considered $\delta$ -combinations together with the acceptance rates. The running average of the $3$ or $4$ different components of the posteriors are shown in Figures 8 and 9. The figures show, as expected, that the posterior distributions are independent of the choice of the $\delta$ parameters. They also show that, in the ranges chosen, the rate of convergence does not vary in any considerable way – the method is fairly robust.

Exact vs. Sinkhorn:

Next we take a closer look how results change if we use Sinkhorn’s algorithm instead of the exact solver. In particular we investigate how the size of the regularization parameter $\epsilon$ as well as the way we generate data effects the performance and results of the RwM algorithm.

We start by generating a noisy transportation map using the exact solver for (1). Then we compare the posterior distributions using the exact solver for the reconstruction in the first run and the Sinkhorn algorithm with $\epsilon=0.04$ and $\epsilon=0.1$ in the next two test runs. Figure 10 shows the running average of three components of the vectors $\mathfrak{p}$ , $\mathfrak{q}$ and $f$ (left to right). The color coding relates to the solver used - red corresponds to the exact forward solver, blue and yellow when the Sinkhorn algorithm was used. Figure 11 shows the posterior distribution of the second component of $\mathfrak{p}$ and $\mathfrak{q}$ as well as the fifth entry of the vector $f$ . We observe that we obtain similar posteriors when using the exact solver (LP) and Sinkhorn with $\epsilon=0.04$ . If the regularization parameter $\epsilon$ is chosen larger, which results in blurred (and therefore less sparse) transportation maps, the posterior distributions are less pronounced and close to uniform on the respective scaled intervals (due to the normalization constraint).

Next we generate the noisy transportation map using the Sinkhorn algorithm. We set the regularization parameter $\epsilon=0.04$ and we distort the computed map with $4\%$ and $10\%$ noise. In each case we perform two different RwM runs, first using the Sinkhorn algorithm and then the exact solver. The respective posterior distributions are shown in Figure 13 and Figure 14. We observe no significant difference in the quality of the posteriors. Figure 15 illustrates an interesting difference in the convergence behavior of the RwM algorithm. The data misfit term (13) shows multiple drops when using the exact solver. Such jumps haven not been observed when using the Sinkhorn algorithm. We recall that the Sinkhorn algorithm solves the respective regularized (convex) optimization problem, which has a unique minimum. We believe that the non-uniqueness of the exact forward problem leads to several local minima in the inverse problem, in which the RwM algorithm gets stuck.

4.4. General Cost

So far we investigated overdetermined problems only. Hence we conclude by considering general non-symmetric costs, that is case (ii), for $n=5$ . This identification problem is underdetermined and we expect poorer identifiability and quality of posteriors. This presumption is confirmed by our numerical experiments, see for example Figure 17.

We investigate the identification from generated data in case of $4\%$ noise. We perform two RwM test runs using the exact solver to calculate the posterior distributions of $u$ , $v$ and $W$ . In the first run we set the sample variance to $\delta_{u}^{2}=\delta_{v}^{2}=\delta_{W}^{2}=\delta=0.02$ and in the second to $\delta_{u}^{2}=\delta_{v}^{2}=0.02$ and $\delta_{W}^{2}=0.04$ . Figure 16 shows the running averages for $3$ different components of $u$ , $v$ and $W$ for both runs. We observe that the components of $u$ and $v$ converge much faster than the ones of $W$ and that the convergence is consistent for both sets of $\delta$ ’s. The posterior distributions of $u$ and $v$ give reasonable results, while the posteriors of the cost matrix are close to uniform on the respective scaled intervals (due to the normalization constraint). This indicates that the components of the cost matrix $W$ are difficult to identify. We expect that the identifiability gets worse as the dimension $n$ increases.

5. Conclusions

This paper introduces a systematic approach to infer unknown costs from noisy observations of optimal transportation plans. It is based on the Bayesian framework for inverse problems and allows to quantify uncertainty in the obtained estimates; however the methodology may also be viewed as a stochastic optimization procedure in its own right, tuning the unknowns so that the optimal transport plan better fits the data. The performance of the developed methodologies is investigated using the example of international migration flows. In this context reported annual migration flow statistics can be interpreted as noisy observations of optimal transportation plans with cost related to the geographical position of countries. We formulate the graph-based problem, estimate the weights, which represent the costs of moving between neighbouring countries, and quantify uncertainty in the weights. Our numerical investigation show that the proposed methodologies are robust and consistent for different cost functions and parametrizations. We observed that the distributions as well as the costs can be accurately determined for a variety of settings, if the problem is overdetermined. The identifiability declines as the dimensionality increases or if the problem becomes underdetermined.

The proposed framework provides the basis for a multitude of future research directions in applied mathematics and other scientific disciplines. The next steps will focus on several questions related to the use of the Sinkhorn algorithm in the context of inverse optimal transport, such as the convergence rate of the regularized problem (4) as $\epsilon\rightarrow 0$ or the optimal choice of $\epsilon$ with respect to the noise level $\sigma$ ; furthermore hierarchical algorithms which learn parameters such as these from the data would also be of interest. In the context of migration flows, different modeling aspects, such as the coupling to age structured population models or the formulation of the OT problem on the continuous level, will be investigated. Furthermore the application of the developed methodologies for general linear programs, which play an important role in transportation research, manufacturing, economics and demography, will be of interest.

Acknowledgments. The authors are grateful to Venkat Chandrasekaran helpful discussions about the literature in inverse linear programming. The work of AMS is funded by US National Science Foundation (NSF) grant DMS 1818977 and AFOSR Grant FA9550-17-1-0185. The work of MTW was partly supported by The Royal Society International Exchanges grant IE 161662.

Bibliography30

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] Handbook on Measuring International Migration through Population Censuses . UN, New York, 2017.
2[2] G. J. Abel and N. Sander. Quantifying global international migration flows. Science , 343(6178):1520–1522, 2014.
3[3] R. K. Ahuja and J. B. Orlin. Inverse optimization. Operations Research , 49(5):771–783, 2001.
4[4] J. J. Azose and A. E. Raftery. Estimation of emigration, return migration, and transit migration between all pairs of countries. Proceedings of the National Academy of Sciences , 116(1):116–122, 2019.
5[5] N. Bonneel, M. van de Panne, S. Paris, and W. Heidrich. Displacement interpolation using lagrangian mass transport. ACM Trans. Graph. , 30(6):158:1–158:12, December 2011.
6[6] M. Cuturi. Sinkhorn distances: Lightspeed computation of optimal transport. In Advances in neural information processing systems , pages 2292–2300, 2013.
7[7] G. Dantzig. Linear programming and extensions . Princeton University Press, 2016.
8[8] J. de Beer, J Raymer, R van der Erf, and L. van Wissen. Overcoming the problems of inconsistent international migration data: A new method applied to flows in europe. European Journal of Population / Revue européenne de Démographie , 26(4):459–481, Nov 2010.

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Videos

Inverse Optimal Transport

Abstract.

1. Introduction

1.1. Background

1.2. Literature Review

1.3. Our Contribution

2. Inverse Optimal Transport

2.1. Forward Problem

2.2. Cost Criteria

2.3. Inverse Problem

2.3.1. Likelihood

3. Algorithms For Inversion

3.1. Computational Optimal Transport

3.2. MCMC and Optimization

4. Numerical Results

4.1. European migration flows

4.2. Graph-Based Cost

Influence of the Sampling Variance δ2\delta^{2}δ2

Exact vs. Sinkhorn

4.3. Toeplitz Cost

Influence of the Sample Variance δ2\delta^{2}δ2

Exact vs. Sinkhorn:

4.4. General Cost

5. Conclusions

Influence of the Sampling Variance $\delta^{2}$

Influence of the Sample Variance $\delta^{2}$