Inverse optimal transport
Andrew M. Stuart, Marie-Therese Wolfram

TL;DR
This paper introduces a Bayesian approach to infer unknown cost functions in optimal transport problems from noisy observations, demonstrated through international migration data, with a focus on estimating transition costs and their uncertainties.
Contribution
It presents a novel systematic method to recover unknown costs in optimal transport from noisy data, requiring only solving linear programs and random sampling, with a Bayesian interpretation.
Findings
Successfully estimated migration transition costs.
Quantified uncertainty in cost estimates.
Validated methodology on real-world migration data.
Abstract
Discrete optimal transportation problems arise in various contexts in engineering, the sciences and the social sciences. Often the underlying cost criterion is unknown, or only partly known, and the observed optimal solutions are corrupted by noise. In this paper we propose a systematic approach to infer unknown costs from noisy observations of optimal transportation plans. The algorithm requires only the ability to solve the forward optimal transport problem, which is a linear program, and to generate random numbers. It has a Bayesian interpretation, and may also be viewed as a form of stochastic optimization. We illustrate the developed methodologies using the example of international migration flows. Reported migration flow data captures (noisily) the number of individuals moving from one country to another in a given period of time. It can be interpreted as a noisy observation of…
| From | To | ||||||
|---|---|---|---|---|---|---|---|
| CZ | DE | DK | LU | NL | PL | ||
| CZ | R | 0 | 9,218 | 262 | 4 | 511 | 45 |
| S | 0 | 560 | 24 | 3 | 81 | 583 | |
| DE | R | 1,362 | 0 | 4,001 | 454 | 9,182 | 2,876 |
| S | 8,104 | 0 | 3,095 | 1,686 | 9,293 | 100,827 | |
| DK | R | 46 | 2,687 | 0 | 11 | 475 | 34 |
| S | 179 | 2,612 | 0 | 1,387 | 602 | 833 | |
| LU | R | 2 | 2,282 | 162 | 0 | 161 | 5 |
| S | 13 | 911 | 99 | 0 | 97 | 23 | |
| NL | R | 255 | 13,681 | 864 | 27 | 0 | 163 |
| S | 298 | 10,493 | 533 | 191 | 0 | 1,020 | |
| PL | R | 1,608 | 136,927 | 2,436 | 19 | 5,744 | 0 |
| S | 63 | 14,417 | 111 | 23 | 577 | 0 | |
| Tot: | S | 3,273 | 164,795 | 7,725 | 515 | 16,073 | 3,123 |
| R | 8,657 | 28,993 | 3,862 | 2,041 | 10,650 | 103,286 |
| 0.02 | 0.02 | 0.04 | 65 | 80 | 52 | 62 |
|---|---|---|---|---|---|---|
| 0.04 | 0.04 | 0.04 | 51 | 65 | 26 | 62 |
| 0.04 | 0.02 | 0.04 | 60 | 65 | 52 | 62 |
| 0.02 | 0.02 | 0.02 | 66.1 | 67.9 | 55.4 | 75 |
| 0.02 | 0.02 | 0.04 | 60.3 | 68.3 | 55.3 | 52.7 |
| 0.04 | 0.04 | 0.04 | 44.1 | 45.5 | 30.0 | 57 |
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Inverse Optimal Transport
Andrew M. Stuart
California Institute of Technology, 1200 E. California Blvd, Pasadena, CA 91125
and
Marie-Therese Wolfram
University of Warwick, Coventry CV4 7AL, UK and RICAM, Austrian Academy of Sciences, Altenbergerstr. 66, 4040 Linz, AT
Abstract.
Discrete optimal transportation problems arise in various contexts in engineering, the sciences and the social sciences. Often the underlying cost criterion is unknown, or only partly known, and the observed optimal solutions are corrupted by noise. In this paper we propose a systematic approach to infer unknown costs from noisy observations of optimal transportation plans. The algorithm requires only the ability to solve the forward optimal transport problem, which is a linear program, and to generate random numbers. It has a Bayesian interpretation, and may also be viewed as a form of stochastic optimization.
We illustrate the developed methodologies using the example of international migration flows. Reported migration flow data captures (noisily) the number of individuals moving from one country to another in a given period of time. It can be interpreted as a noisy observation of an optimal transportation map, with costs related to the geographical position of countries. We use a graph-based formulation of the problem, with countries at the nodes of graphs and non-zero weighted adjacencies only on edges between countries which share a border. We use the proposed algorithm to estimate the weights, which represent cost of transition, and to quantify uncertainty in these weights.
1. Introduction
1.1. Background
There are many problems in engineering, the sciences and the social sciences in which an input is transformed into output in an optimal way according to a cost criterion. We are interested in problems where the transformation from input to output is known, and the objective is to infer the cost criterion which drives this transformation. Our primary motivation is optimal transport (OT) problems in which the transport plan is known but the cost is not. More generally linear programs in which the solution is known, but the cost function and constraints are to be determined, fall into the category of problems to which the methodology introduced in this paper applies. We illustrate the type of problem of interest by means of an example.
Example: International Migration. Quantifying migration flows between countries is essential to understand contemporary migration flow patterns. Typically two types of migration statistics are collected – flow and stock data. Migration stock data states the number of foreign born individuals present in a country at a given time and is usually based on population censuses. Stock data is available for almost all countries in the world. Migration flow data captures the number of migrants entering and leaving (inflow and outflow, respectively) a country over the course of a specific period, such as one year, see [1]. It is collected by most developed countries, but no international standards are defined. For example the time of residence after which a person counts as an international migrant varies from country to country. Because of the different definitions and data collection methods, these statistics can be hard to compare. International agencies, such as the United Nations Statistics Division or the Statistical Office of the European Union (Eurostat), publish annual migration flow estimates. These estimates are often based on Poisson or Bayesian linear regression. For more information about the estimation of migration flows using flow or stock statics we refer to [18, 19, 2, 4]. For the purposes of this paper the main issue to appreciate is that migration data is available, but should be viewed as noisy.
Flow data is typically presented in an origin-destination matrix, in which the off-diagonal entry contains the number of people moving from country to country in a given period of time. This origin-destination data can be reported by both the sending (S) and the receiving (R) countries. Hence two migration flow tables are available, often desegregated by sex and age groups. Table 1 shows harmonized data, which was pre-processed to improve comparability, reported by 6 European countries for the period 2002-2007. The numbers of the sending and receiving countries vary significantly. For example Germany reported that people immigrated from Poland, while Poland reported individuals who left for Germany. These very different numbers naturally raise the question of the true migration flows. In many settings it is natural to place greater weight on receiving data rather than departure data. But even this data is not subject to uniform standards, and therefore providing reliable estimates and quantifying uncertainty is of great interest.
We interpret the reported origin-destination data maps (when appropriately normalized) as a noisy estimate of a transport plan arising from an OT problem with unknown cost. It is then natural to try and infer the transportation cost as it carries information about the migration process.
The preceding example serves as motivation, and we will come back to it throughout this paper. However we reemphasize that the proposed identification methodologies that we introduce in this paper can be used for general inverse OT and linear programming problems; further examples will serve to illustrate this fact.
1.2. Literature Review
Optimal transport originates with the French mathematician Gaspard Monge who, in 1781, investigated the problem of finding the most cost-effective way to move a pile of sand to fill a hole of the same volume. Kantorovich introduced the modern (relaxed) formulation of the problem, in which mass can be split, in 1942. In more mathematical terms Kantorovich considered the following setup: given two positive measures (of equal mass) and a cost function, find the transportation map that moves one measure to the other minimizing the transport cost. The corresponding infimum induces a distance between these two measures – the so-called Wasserstein distance. The Wasserstein distance plays an important role in probability theory, partial differential equations (PDE) and many other fields in applied mathematics [28, 25]. Furthermore the techniques and methodologies developed in OT have found application in a variety of scientific disciplines including data science, economics, imaging and meteorology [12].
With the spread and application of OT into different scientific disciplines the interest in computational methodologies has increased. Commonly used numerical methods broadly speaking fall into two categories: linear programming [7] and methods specific to the structure of OT. Linear programs are classic problems which have been extensively studied in the field of optimization and operations research. Many computational methodologies have been developed, such as the famous simplex algorithm (and its many variants), the Hungarian algorithm and the auction algorithm. All these methods work well for small to medium sized problems, but are too slow in modern applications such as imaging or supply chain management. Recently a significant speed up, of linear programming, was achieved by considering a regularized OT problem, leading to the Sinkhorn algorithm (or variants thereof) in which an additional entropic regularization term is added to the objective function; this allows efficient computation of the corresponding minimizer and induces a trade-off between fidelity to the original problem, and computational speed. This family of efficient algorithms resulted in the rapid advancement of computational OT in recent years, especially in the context of imaging and data science; see [17, 6, 20].
Inverse problems for linear programming received considerable interest in the engineering literature. The paper [3], building on earlier work in [30], studies the problem by seeking a cost function nearest to a given one in for which the given solution is an optimal linear program; this problem is itself a linear program in the case The formulation of an inverse problem for linear programming in [9] took a slightly more general perspective, as it does not assume that the given data necessarily arises as the solution of a linear program, and rather seeks to minimize the distance to the solution set of a linear program. Recent application of the inverse problem for linear programming may be found in [24], for example. These papers on inverse linear programming are foundational and have opened up a great deal of subsequent research. However the methods in them do not account in a systematic way for noise in the data provided, and for the incorporation of prior information. We address these issues by adopting a Bayesian formulation of the inverse problem for linear programming, concentrating on OT in particular; the ideas are readily generalized to inverse linear programming in general. The Bayesian approach not only allows for the quantification of uncertainty, but also leads to new (stochastic) optimization methods. An overview of the computational state of the art for Bayesian inversion may be found in [14]. The specific methods that we introduce have the desirable feature that they require only solution of the forward OT problem and the ability to generate random numbers.
1.3. Our Contribution
Our contributions to the subject of inverse problems within linear programming are as follows.
- •
We formulate inverse OT problems in a Bayesian framework.
- •
We provide a computational framework for solving inverse OT problems in an efficient fashion.
- •
We introduce graph-based cost functions for OT, using graph-shortest paths in an integral way.
- •
Graph-based OT has considerable potential for application, and we introduce a new way of studying migration flow data using inverse OT in the graph-based setting.
We emphasize that, whilst the graph-based formulation of cost corresponds to a rather specific way of designing cost functions for discrete linear programs, the framework and algorithms developed in this paper apply quite generally to inverse linear programming, and hence to OT in general. We develop the methodology in general, using graph-based migration flow as a primary illustrative example. In section 2 we define OT as a linear program, describe the cost criteria considered, and formulate inverse OT in a Bayesian setting. Section 3 presents algorithms for the forward and inverse OT problem and section 4 contains numerical results.
We will use the following notation throughout this manuscript. Let and denote the Euclidean norm and inner-product on and the Frobenius norm and inner-product on The spaces of probability matrices, probability vectors and probability matrices with specified marginals are defined as
[TABLE]
2. Inverse Optimal Transport
In this section we introduce the forward OT problem and discuss specific cost criteria, before formulating the respective inverse OT problem in the Bayesian framework.
2.1. Forward Problem
We consider two discrete probability vectors and and a given cost . Then the optimal transport problem corresponds to finding a map transporting to at minimal cost. Note that in OT the cost matrix has non-negative entries, which can be normalized to be an element of without loss of generality. The respective forward OT problem is to find
[TABLE]
Problem (1) falls into the more general class of linear programs. Linear programs (and their many variants) arise in various specific settings – such as the earth mover’s distance (EMD)[23] or cost network flows [5] – in different scientific communities. The problem (1) has, by virtue of being a specific class of linear programs, at least one solution; this solution lies on the boundary of the feasible set of solutions (defined by the equality constraints). If the solution is unique then we define mapping by
[TABLE]
In the non-unique setting we define to be a unique element determined by running a specific non-random algorithm for the linear program to termination, started at a specific initial guess.
We now consider (1) regularized by the addition of the discrete entropy, an approach popularized in [6, 17] and which has led to considerable analytical and computational developments. The resulting problem is
[TABLE]
where the matrix logarithm operation is applied elementwise. Then
[TABLE]
This problem has a unique minimizer , since is strongly convex. Following our previous notation we define the corresponding mapping by
[TABLE]
It is, in contrast to the optimal solution of (1), not sparse. It is known that solutions to (4) converge to minimisers of (1) as . Determining the rate of convergence is still an open problem. The special structure of this regularized problem can be used to construct efficient splitting algorithms. These methods are based on the equivalent formulation of finding the projection of the joint coupling with respect to the Kullback-Leibler divergence
[TABLE]
where the matrix logarithm and division operations are applied elementwise and is the Gibb’s kernel
[TABLE]
In particular
[TABLE]
The Kullback-Leibler divergence can be computed extremely efficiently using proximal methods, yielding for example the celebrated Sinkhorn algorithm. We will briefly outline the underlying ideas in Section 3.1.
2.2. Cost Criteria
Problems (1) or (4) are formulated for general cost matrices - the specific structure of depends on the application considered. We will investigate the behavior of the proposed methodologies for being:
- (i)
Toeplitz; 2. (ii)
non-symmetric; 3. (iii)
determined by an underlying graph structure.
We assume that all individuals move, hence for all in all three cases, Therefore ’staying’ is penalized by setting
[TABLE]
If is Toeplitz the cost depends on the difference between indices and has degrees of freedom. Case (ii) corresponds to general non-symmetric transportation cost, which in the context of migration flows could include factors such as sharing the same language, the ratio of the gross national income per capita or their EU membership. In case (iii) we assume that costs are related to an underlying discrete structure. In the context of migration flows the geographical position of countries defines an underlying graph with edges only between countries which share a border; see Figure 1. We assume that the total transportation cost corresponds to the sum of the individual costs of moving from one country to another along edges of the graph. In defining cost this way we are implicitly assuming that, between the European countries studied here, migration is primarily via land. This resulting discrete underlying structure, which relates the cost matrix to a directed graph representing the migration network between countries, is detailed in the following.
Let be a directed graph with vertices and a (possibly non-symmetric) weighted adjacency matrix We can then define a cost matrix whose entry is the shortest path cost of moving from vertex to according to the weighted adjacency matrix Let be the number of non-zero entries of and the vector defining the non-zero entries. Then we may define a mapping such that This can then be normalized to give a and we may define the solution of the resulting OT problem via (2). For this graph-based cost the solution of the OT problem may be viewed as a function of and . The minimal cost of moving between vertices of a graph can be computed using Dijkstra’s algorithm, recalled in Section 3.1 below.
We define a similar mapping in the case of Toeplitz cost. Here the respective cost matrix has free entries, before normalization to a probability vector and recalling that we fix the diagonal to penalize not moving, and so we define a mapping ; normalization then gives .
2.3. Inverse Problem
The inverse OT problem is to find and from the solution to the OT problem (1), or its regularized counterpart (4). We tackle this problem by introducing a space of componentwise positive and real-valued latent variables or which map to the unknowns and . It is easier, and more natural, to specify priors in terms of these real-valued latent variables. To this end we introduce mappings from into and from into as follows: is defined by
[TABLE]
and is defined by
[TABLE]
Note that for all ; the same holds for Then the forward problem (2) can be written as
[TABLE]
or, in the case of graph-based cost or Toepliz cost, we have
[TABLE]
This is readily generalized to the use of regularized optimal transport as the forward model, simply replacing by
We wish to invert the map , given noisy observations of Such problems are in general ill-posed, hence suitably regularized versions have to be considered. Different approaches can be found in the literature – we focus on the Bayesian framework, which allows us to estimate the posterior distribution of and (or ).
Depending on the structure of the cost matrix the inverse problem related to (9) or (10) can be over- or underdetermined. We recall that in case of Toeplitz cost the matrix has degrees of freedom. Then we have equations for unknowns (taking into account the normalization of , and ). Hence the inverse problem is overdetermined for . If is a general cost matrix with a set penalty on the diagonal, that is case (ii), the cost matrix has degrees of freedom. In total we have unknowns, and therefore the problem is underdetermined for . For graph-based cost (case (iii)) the matrix has degrees of freedom and therefore the problem is underdetermined if
[TABLE]
2.3.1. Likelihood
We define a Bayesian formulation of the inverse problem, working in the case where are the unknowns; the extension to as unknowns is similar. We assume that the observed transport maps are corrupted by noise, in particular
[TABLE]
where is a mean zero Gaussian random matrix with i.i.d. entries of variance In particular we want to find the conditional probability distribution of given noisy observations , that is . The estimation is based on the model-data misfit function
[TABLE]
Using Bayes’ formula
[TABLE]
the probability of observing given a realization of and , which is exactly the posterior distribution of and , is given by
[TABLE]
In (14) corresponds to the prior information about and . We assume that and have i.i.d. entries uniformly distributed in and denote the set of vectors and matrices which satisfy this componentwise constraint by In view of the scale invariance of the choice of unit interval is immaterial; any bounded interval would deliver identical posterior on .
Then the posterior distribution of and is given by
[TABLE]
with a normalization constant
[TABLE]
We can either sample from the posterior (16) (which corresponds to the full Bayesian approach) or maximize the posterior probability (16), which leads to the optimization problem of minimizing over . The first approach allows us to quantify uncertainty in the estimates of and , the latter gives a single estimate. We discuss how to sample from the posterior, using a Random walk Metropolis (RwM) method in Section 3.2
3. Algorithms For Inversion
In the following we present the numerical methods used in the computational experiments in Section 4. Since the proposed Bayesian framework requires the solution of an OT problem (1) (or its regularized version (4)) in every iteration of the sampling algorithm, computational efficiency is essential. We start by presenting the solvers for the forward OT problem followed by the Markov-Chain-Monte-Carlo methods used to sample from the posterior.
3.1. Computational Optimal Transport
Numerical methods for linear programming go back to the seminal works of Dantzig on the simplex method, see [7]. Solutions to the linear program (1) lie on the boundary of the feasible polytope, which is defined by the constraints. The simplex method iterates over the vertices of this polytope to find the optimal solution, see [16]. The method works well in practice, however examples in which the performance scales exponentially with the dimension of the problem, can be constructed. Different approaches to speed up computations have been proposed: for example network simplex algorithms are based on the fact that specific linear programs can be formulated as minimization problems on graphs. The particular structure of the underlying graph can be used to speed up the simplex method significantly. Further information on computational methods for linear programming can be found in [8].
More recently computational techniques, which are based on the regularized OT problem (4) have been proposed in the literature. These methods are extremely efficient, since they are based on the formulation of the OT problem in terms of the Kullback Leibler divergence (7). Its minimiser is given by
[TABLE]
Here is the Gibb’s kernel (6) and the vectors and satisfy the mass constraint
[TABLE]
This mass constraint can be enforced iteratively via
[TABLE]
This splitting, known as Sinkhorn’s algorithm, is very efficient as it involves matrix vector multiplications only. Since the entropic regularization term (3) introduces blurring in the otherwise sparse solution, one is interested in keeping as small as possible. Since the convergence of Sinkhorn’s algorithm (18) deteriorates as , it is important to keep a balance between regularization and computational stability. In practice small values of lead to diverging scaling factors in (18) and subsequent numerical instabilities. These problems can often be remedied using suitable scalings, see [26].
If the transportation costs depend on an underlying discrete structure, such as for our graph-based migration problem, then the computational burden of computing this cost must be take into consideration. For our example the total transportation cost corresponds to the sum of edge weights when between vertices traversed on the shortest path. Note that the transportation costs are not necessarily the same in both directions since we consider directed graphs. We use Dijkstra’s algorithm to compute the shortest path from one node to all others in the graph, see [10]. Dijkstra’s algorithm is based on continuous updates of the shortest distance to a starting point, and excludes longer distances in updates. It is the graph-based methodology that underpins the fast marching method to solve the eikonal equation [27].
3.2. MCMC and Optimization
We propose the use of Markov Chain Monte-Carlo (MCMC) methods to sample from the posterior distribution (16). For the user interested simply in optimization the algorithm we propose may be viewed as a stochastic optimization method to reduce the model-data misfit. MCMC methods originated with the seminal paper [15] in which what is now termed the The Random walk Metropolis (RwM) algorithm was introduced for a specific high dimensional integral required in statistical physics. In our context the key desirable feature of the method is that it requires only solution of the forward OT (or regularized OT) problem, together with the generation of random numbers. Given a current (approximate) sample from the posterior distribution, a new sample is proposed by adding a mean zero Gaussian to the current one. This is rejected if the resulting new state leaves , and otherwise accepted with a probability designed to preserve detailed balance with respect to the posterior. The covariance of the Gaussian is an important tuning parameter: intuitively it should be chosen such that the acceptance rate is neither close to [math] or , as either of these limits lead to successive iterates which are highly correlated. The optimal scaling of RWM algorithms for different target densities has been investigated in [21, 22]; although the theory developed there applies in rather restricted scenarios, widespread experience and a variety of theories demonstrate that the work leads to useful rule-of-thumb for tuning acceptance probabilities within the RwM algorithm [29], arguably because it leads to average acceptance probabilities that stay away from [math] or .
In 1970 Hastings introduced a wide class of MCMC methods, now known as Metropolis-Hastings algorithms [13] and in principle this provides a wide-range of variants on RwM that may be used for our Bayesian formulation of inverse OT. A popular variation of MCMC that we have found useful in the inverse OT setting is Gibbs sampling. In high dimensional spaces it can be hard to design proposals which are accepted with a reasonable acceptance probability, and the idea of fixing subsets of the variables, and proposing moves in the remainder, is natural. The Gibbs sampler allows this to be achieved in a statistically consistent fashion. At each iteration one (or several) components of the unknown parameter is updated by sampling from its full conditional probability distribution, and cycling through all the variables. The method may be relaxed to allow a RwM step from the conditional probability distribution, rather than a full sample. The corresponding RwM-within-Gibbs method is outlined in Algorithm 1. In this algorithm we consecutively update , and (or ). We generate proposals for each variable, which we accept or reject. Note that in general, for all the methods described here, any proposal which descreases the value of and remains in is accepted with probability one. Thus the Algorithm 1 may be viewed as an optimization method which induces a stochastic gradient; the numerics will demonstrate that this acts to minimize the misfit.
4. Numerical Results
In this section we demonstrate the behavior of MCMC methods for inverse OT, and Algorithm 1 in particular. We start by presenting results for the migration flow example introduced at the beginning and use it as a ’proof-of-concept’ for the proposed framework. We then continue with systematic numerical investigation to study the identifiability of the cost matrix in a variety of scenarios, as well as discussing the behavior of the proposed methodology. We focus on the three cost criteria discussed in Section 2.2: Toeplitz cost (i), non-symmetric cost (ii) and graph-based cost (iii). We use the following functions implemented in the POT library [11] to solve the linear program (1) as well as its regularized version (4):
- •
emd - this solver for linear programs is based on the respective network OT flow formulation of the problem and was introduced in [5].
- •
sinkhorn - implements the Sinkhorn-Knopp scaling algorithm to solve the regularized OT problem (4) as proposed in [6].
We test the proposed methodologies using simulated data as well as real migration data. In making simulated data we compute the optimal transportation maps for a given set of vectors , and and add i.i.d. Gaussian noise with mean [math] and variance , see (12). Note that the resulting perturbed map may have negative entries and is not an element of . Therefore we set all negative entries to zero and normalize it, to ensure that it is an admissible solution.
We illustrate the performance of the methodologies with plots of the running means and the respective posterior distributions. All posterior distributions are calculated after RwM iterations with a burn-in of . The performed numerical experiments indicate that this number is sufficient for the convergence of MCMC. Note that we always plot the scaled vectors and matrices (unless stated otherwise). The penalty in (8) is set to . Numerical simulations show that its absolute magnitude does not influence the posterior distributions significantly once above a certain level.
4.1. European migration flows
We start by presenting estimates for the European network shown in Figure 1. We recall that vertices represent countries and that edges connect countries sharing a border. The weights of these edges correspond to the cost of moving from one country to another. The network shown in Figure 1 consists of countries, which are connected by edges. We use the estimated transportation map reported in [19] and assume that the noise level is . The variance for the proposals is set to . We perform two runs of the RwM-within-Gibbs algorithm, using the exact solver in the first and Sinkhorn’s algorithm with in the second. The acceptance rate of the exact solver is ((, , ) for the different components , and ), for Sinkhorn we have (, , ). The running average of three components of , and are shown in Figure 2 and the corresponding posterior distributions in Figure 3. We observe that both runs give comparable results, however the misfit for Sinkhorn is smaller, see Figure 4. This difference might be explained by the fact that we underestimate the noise level or that the actual transportation maps look more like solutions of regularized OT problems than the OT problem itself.
4.2. Graph-Based Cost
Next we investigate the behavior of the proposed methodologies for graph-based cost more thoroughly. We will see that
- •
The identification of , and is robust with respect to the sampling variances, see Figure 6.
- •
The posterior estimates are consistent using different solvers, see Figure 7.
These results are obtained using noisy transportation maps for a graph connecting nodes with edges. In doing so we solve problem (1) for given vectors , and and add noise. Note that this inverse problem is overdetermined since .
Influence of the Sampling Variance
We start by investigating the impact of the sampling variance . We perform MCMC runs for different combinations of , and (listed in Table 2) and compute the running average and posterior distributions of some components. Note that these parameters affect the rate of convergence of the algorithm, but not the posterior distribution itself. The variance of the samples determines how much new samples differ from the previous iterates - the larger the variance the more adventurous the search, but the less likely to accept leading to highly correlated samples because of rejections. On the other hand smaller variance has a higher probability of accepting but is not adventurous and hence leads to highly correlated samples. It is thus desirable to have an acceptance rate that is neither close to [math] or . The running averages of three components of , and are shown in Figure 5, the respective posteriors in Figure 6. We see that the results are consistent for all combinations of ’s. However the respective convergence rates vary, see Table 2. We observe a generally higher acceptance rate when sampling from the marginal distribution of , and a decreased rate when increasing the sampling variance.
Exact vs. Sinkhorn
Next we investigate the sensitivity of the results with respect to the forward solver used in Algorithm 1. We run two RwM simulations - the first one using the exact solver and the second one using the Sinkhorn algorithm. We observe that both runs give similar posterior distributions if we choose the regularization parameter in a sensible way, see Figure 7. Generally speaking it seems advisable to choose it similar to the noise level (as in the shown results). We will investigate the impact of the regularization parameter in the next subsection in more detail.
4.3. Toeplitz Cost
In the following we present more detailed numerical experiments if is Toeplitz. The findings of the numerical experiments performed in this subsection can be summarized as follows:
- •
The posterior distributions of , and are consistent for varying ranges of proposal variances , see Figure 8 and Figure 9.
- •
The exact solver and Sinkhorn’s algorithm converge to similar posteriors, if the entropic regularization parameter is chosen sensibly, see Figure 10 and Figure 11.
- •
The variance of the posteriors increases with the noise level in the data, as shown for example in Figure 13 and Figure 14.
- •
Sinkhorn’s algorithm gives a higher acceptance rate and a more monotone decrease of the data-misfit function, see Figure 15.
We underpin these statements with numerical simulations using generated noisy transportation maps. We recall that has degrees of freedom in case of Toeplitz cost (i). This defines, as in the case of graph-based cost (iii), a mapping from the vector to the cost matrix , that is with . Hence we generate proposals for the vector , which define the entries of .
We set and generate a noisy realization for a given set of vectors , and (which is mapped to the respective Toeplitz cost matrix ). Then is obtained by adding noise with variance (unless stated otherwise) and subsequent normalization of the distorted map. Note that this problem is overdetermined, since is Toeplitz and .
Influence of the Sample Variance
As in the case of graph-based cost, we investigate the performance of the RwM-within-Gibbs algorithm for different combinations of , and . Table 3 lists the considered -combinations together with the acceptance rates. The running average of the or different components of the posteriors are shown in Figures 8 and 9. The figures show, as expected, that the posterior distributions are independent of the choice of the parameters. They also show that, in the ranges chosen, the rate of convergence does not vary in any considerable way – the method is fairly robust.
Exact vs. Sinkhorn:
Next we take a closer look how results change if we use Sinkhorn’s algorithm instead of the exact solver. In particular we investigate how the size of the regularization parameter as well as the way we generate data effects the performance and results of the RwM algorithm.
We start by generating a noisy transportation map using the exact solver for (1). Then we compare the posterior distributions using the exact solver for the reconstruction in the first run and the Sinkhorn algorithm with and in the next two test runs. Figure 10 shows the running average of three components of the vectors , and (left to right). The color coding relates to the solver used - red corresponds to the exact forward solver, blue and yellow when the Sinkhorn algorithm was used. Figure 11 shows the posterior distribution of the second component of and as well as the fifth entry of the vector . We observe that we obtain similar posteriors when using the exact solver (LP) and Sinkhorn with . If the regularization parameter is chosen larger, which results in blurred (and therefore less sparse) transportation maps, the posterior distributions are less pronounced and close to uniform on the respective scaled intervals (due to the normalization constraint).
Next we generate the noisy transportation map using the Sinkhorn algorithm. We set the regularization parameter and we distort the computed map with and noise. In each case we perform two different RwM runs, first using the Sinkhorn algorithm and then the exact solver. The respective posterior distributions are shown in Figure 13 and Figure 14. We observe no significant difference in the quality of the posteriors. Figure 15 illustrates an interesting difference in the convergence behavior of the RwM algorithm. The data misfit term (13) shows multiple drops when using the exact solver. Such jumps haven not been observed when using the Sinkhorn algorithm. We recall that the Sinkhorn algorithm solves the respective regularized (convex) optimization problem, which has a unique minimum. We believe that the non-uniqueness of the exact forward problem leads to several local minima in the inverse problem, in which the RwM algorithm gets stuck.
4.4. General Cost
So far we investigated overdetermined problems only. Hence we conclude by considering general non-symmetric costs, that is case (ii), for . This identification problem is underdetermined and we expect poorer identifiability and quality of posteriors. This presumption is confirmed by our numerical experiments, see for example Figure 17.
We investigate the identification from generated data in case of noise. We perform two RwM test runs using the exact solver to calculate the posterior distributions of , and . In the first run we set the sample variance to and in the second to and . Figure 16 shows the running averages for different components of , and for both runs. We observe that the components of and converge much faster than the ones of and that the convergence is consistent for both sets of ’s. The posterior distributions of and give reasonable results, while the posteriors of the cost matrix are close to uniform on the respective scaled intervals (due to the normalization constraint). This indicates that the components of the cost matrix are difficult to identify. We expect that the identifiability gets worse as the dimension increases.
5. Conclusions
This paper introduces a systematic approach to infer unknown costs from noisy observations of optimal transportation plans. It is based on the Bayesian framework for inverse problems and allows to quantify uncertainty in the obtained estimates; however the methodology may also be viewed as a stochastic optimization procedure in its own right, tuning the unknowns so that the optimal transport plan better fits the data. The performance of the developed methodologies is investigated using the example of international migration flows. In this context reported annual migration flow statistics can be interpreted as noisy observations of optimal transportation plans with cost related to the geographical position of countries. We formulate the graph-based problem, estimate the weights, which represent the costs of moving between neighbouring countries, and quantify uncertainty in the weights. Our numerical investigation show that the proposed methodologies are robust and consistent for different cost functions and parametrizations. We observed that the distributions as well as the costs can be accurately determined for a variety of settings, if the problem is overdetermined. The identifiability declines as the dimensionality increases or if the problem becomes underdetermined.
The proposed framework provides the basis for a multitude of future research directions in applied mathematics and other scientific disciplines. The next steps will focus on several questions related to the use of the Sinkhorn algorithm in the context of inverse optimal transport, such as the convergence rate of the regularized problem (4) as or the optimal choice of with respect to the noise level ; furthermore hierarchical algorithms which learn parameters such as these from the data would also be of interest. In the context of migration flows, different modeling aspects, such as the coupling to age structured population models or the formulation of the OT problem on the continuous level, will be investigated. Furthermore the application of the developed methodologies for general linear programs, which play an important role in transportation research, manufacturing, economics and demography, will be of interest.
Acknowledgments. The authors are grateful to Venkat Chandrasekaran helpful discussions about the literature in inverse linear programming. The work of AMS is funded by US National Science Foundation (NSF) grant DMS 1818977 and AFOSR Grant FA9550-17-1-0185. The work of MTW was partly supported by The Royal Society International Exchanges grant IE 161662.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1[1] Handbook on Measuring International Migration through Population Censuses . UN, New York, 2017.
- 2[2] G. J. Abel and N. Sander. Quantifying global international migration flows. Science , 343(6178):1520–1522, 2014.
- 3[3] R. K. Ahuja and J. B. Orlin. Inverse optimization. Operations Research , 49(5):771–783, 2001.
- 4[4] J. J. Azose and A. E. Raftery. Estimation of emigration, return migration, and transit migration between all pairs of countries. Proceedings of the National Academy of Sciences , 116(1):116–122, 2019.
- 5[5] N. Bonneel, M. van de Panne, S. Paris, and W. Heidrich. Displacement interpolation using lagrangian mass transport. ACM Trans. Graph. , 30(6):158:1–158:12, December 2011.
- 6[6] M. Cuturi. Sinkhorn distances: Lightspeed computation of optimal transport. In Advances in neural information processing systems , pages 2292–2300, 2013.
- 7[7] G. Dantzig. Linear programming and extensions . Princeton University Press, 2016.
- 8[8] J. de Beer, J Raymer, R van der Erf, and L. van Wissen. Overcoming the problems of inconsistent international migration data: A new method applied to flows in europe. European Journal of Population / Revue européenne de Démographie , 26(4):459–481, Nov 2010.
