Quadratically regularized optimal transport
Dirk A. Lorenz, Paul Manns, Christian Meyer

TL;DR
This paper explores quadratic regularization in optimal transport, deriving dual formulations, proving strong duality, and developing algorithms that produce sparse transport plans, contrasting with entropic regularization.
Contribution
It introduces quadratic regularization for optimal transport, providing dual problem derivations, solution existence proofs, and two algorithms with numerical analysis.
Findings
Algorithms perform well even with small regularization parameters
Quadratic regularization yields sparse optimal transport plans
Dual problem formulation and strong duality established
Abstract
We investigate the problem of optimal transport in the so-called Kantorovich form, i.e. given two Radon measures on two compact sets, we seek an optimal transport plan which is another Radon measure on the product of the sets that has these two measures as marginals and minimizes a certain cost function. We consider quadratic regularization of the problem, which forces the optimal transport plan to be a square integrable function rather than a Radon measure. We derive the dual problem and show strong duality and existence of primal and dual solutions to the regularized problem. Then we derive two algorithms to solve the dual problem of the regularized problem: A Gauss-Seidel method and a semismooth quasi-Newton method and investigate both methods numerically. Our experiments show that the methods perform well even for small regularization parameters. Quadratic regularization is of…
| Coefficient | Solver Quantity | Conversion |
|---|---|---|
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
∎
11institutetext: Dirk A. Lorenz, 11email: [email protected] 22institutetext: TU Braunschweig, Institute of Analysis and Algebra, Germany33institutetext: Paul Manns, 33email: [email protected] 44institutetext: TU Braunschweig, Institute of Mathematical Optimization, Germany55institutetext: Christian Meyer, 55email: [email protected] 66institutetext: TU Dortmund, Fakultät für Mathematik, Germany
Quadratically regularized optimal transport
Dirk A. Lorenz
Paul Manns
Christian Meyer
Abstract
We investigate the problem of optimal transport in the so-called Kantorovich form, i.e. given two Radon measures on two compact sets, we seek an optimal transport plan which is another Radon measure on the product of the sets that has these two measures as marginals and minimizes a certain cost function.
We consider quadratic regularization of the problem, which forces the optimal transport plan to be a square integrable function rather than a Radon measure. We derive the dual problem and show strong duality and existence of primal and dual solutions to the regularized problem. Then we derive two algorithms to solve the dual problem of the regularized problem: A Gauss-Seidel method and a semismooth quasi-Newton method and investigate both methods numerically. Our experiments show that the methods perform well even for small regularization parameters. Quadratic regularization is of interest since the resulting optimal transport plans are sparse, i.e. they have a small support (which is not the case for the often used entropic regularization where the optimal transport plan always has full measure).
Keywords:
optimal transport, regularization, semismooth Newton method, Gauss-Seidel method, duality
MSC:
49Q20, 65D99, 90C25
1 Introduction
In this paper we will investigate a regularized version of the optimal transport problem. Optimal transport dates back to the work of Monge in 1781 but the problem formulation we use here is the one of Kantorovich kantorovich1942translocation . Let us fix some notation and formulate the problem: Let , be two compact domains, denote , and assume we are given two positive regular Radon measures and on and , respectively. Further we assume that a cost function is given that models the cost of transporting a unit of mass from to . The optimal transport problem asks to find a transport plan , which is a Radon measure on , such that it has minimal overall transport cost among all measures which have and as first and second marginals, respectively, i.e. for all Borel sets it holds that and for all Borel sets it holds that . This problem has been studied extensively and we refer to the books rachev1998massI ; rachev1998massII ; villani2003topics ; villani2008optimal ; santambrogio2015optimal . One particular result is, that an optimal plan exists and that the support of optimal plans is contained in the so-called -superdifferential of a -concave function (ambrosio2013user, , Theorem 1.13). For many cost functions , this means that optimal transport plans are supported on small sets and that they are in fact singular with respect to the Lebesgue measure on . This makes the numerical treatment of optimal transport problems difficult and one can employ regularization to obtain approximately optimal plans that are functions on . The regularization method that has got the most attention recently is regularization with the negative entropy of and we refer to papadakis2014proximal ; cuturi2016smoothed ; carlier2017convergence . Entropic regularization has gotten popular in machine learning applications due to the fact that it allows for the very simple Sinkhorn algorithm (in the discrete case), see cuturi2013sinkhorn ; genevay2018learning and also peyre2019computational for a recent and thorough review of the computational aspects of optimal transport.
Regularizations different from entropic regularization has been much less studied. We are only aware of works in the discrete case, e.g. blondel2018smoothOT ; essid2018quadratically . In this work we will investigate the case where we regularize the problem in . The paper is organized as follows: In Section 2 we state the problem and analyze existence and duality. It will turn out that existence of solutions of the dual problem will be quite tricky to show, but we will show that dual solutions exist in respective spaces and that a straightforward optimality system characterizes primal-dual optimality. In Section 3 we derive two different algorithms for the discrete version of the quadratically regularized optimal transport problem, and in Section 4 we comment on a simple discretization scheme and report numerical examples.
Notation.
We will abbreviate (and will apply this also to functions and to measures where + will mean the positive part from the Hahn-Jordan decomposition). By we denote that space of continuous functions on (and we will always work on compact sets) equipped with the supremum norm and by we denote the space of Radon measures on a compact domain and we use the norm . The Lebesgue measure will be (and we also use and to specify the Lebesgue measure on sets and , respectively). For convenience, we use for the Lebesgue measure of the set . Furthermore, for a Radon measure , we denote the absolutely and singular part arising from the Lebesgue decomposition with respect to the Lebesgue measure by and , i.e. they satisfy and . Duality pairings are denoted by . If both arguments of the duality pairing are positive and the duality pairing does not necessarily exist, e.g. for and , we set .
2 Quadratic regularization in the continuous case
For the quadratically regularized optimal transport problem we seek a transport plan which for a given cost function , a regularization parameter , and given functions , solves
[TABLE]
where the constraints are understood pointwise almost everywhere.
2.1 Solutions of the primal problem
It is straight forward to show, that optimal transport plans exist:
Lemma 1
Problem (1) has an optimal solution if and only if , , almost everywhere, and .
Proof
Assume that there is an optimal solution . By Jensen’s inequality we get
[TABLE]
which shows . The argument for is similar. Non-negativity of and follows from non-negativity of . Finally, by Fubini’s theorem
[TABLE]
Conversely, if and and we set . Then is feasible for (1) and since the objective is continuous, coercive, and strongly convex a (unique) minimizer exists.∎
2.2 Dual problem and existence of dual solutions
In the following section, we apply the classical Lagrange duality to the linear-quadratic program (1). To this end, let us define the Lagrangian associated with (1). In order to shorten the notation, we set
[TABLE]
Furthermore, we define
[TABLE]
and denote the the primal objective by
[TABLE]
Then, the Lagrangian associated with (1) is given by
[TABLE]
Then, by standard arguments, the primal problem in (1) is equivalent to
[TABLE]
while its (Lagrangian) dual is given by
[TABLE]
The main part of the upcoming analysis is devoted to the existence of solutions to (DP). Once this is established, the necessary and sufficient optimality condition associated with (1) in form of the variational inequality will allow us to derive an optimality system that is also amenable for numerical computations.
To show existence for (DP), we first reformulate the dual problem. Since is quadratic w.r.t. , the inner -problem is solved by
[TABLE]
where the mapping is defined via
[TABLE]
for almost all and all , .
Remark 1
The map is related to the adjoints of the projections and from (2) by .
Inserting (4) into (DP) yields
[TABLE]
Again, the inner optimization problem is quadratic w.r.t. so that its solution is given by
[TABLE]
Inserted in (6), this results in the following dual problem
[TABLE]
To prove existence of solutions for this problem, we need to require the following
Assumption 1
The domains and are compact. Moreover, the cost function is in and fulfills . Furthermore, the marginals and satisfy and , . In addition we assume that .
Remark 2
The last assumption on the normalization of the marginals is just to ease the subsequent analysis and can be relaxed by , which is needed anyway to ensure the existence of a solution to the primal problem, see Lemma 1.
Remark 3
Note that there is an obvious source of non-uniqueness for the dual problem (D): We can add a constant to and subtract it from and this does not change the dual objective, i.e for any constant it holds that . This non-uniqueness will not cause trouble in the proofs and when convenient, we remove it, e.g. by demanding that .
Observe that the objective in (D) is also well defined for functions in with . This gives rise to the following auxiliary dual problem:
[TABLE]
Our strategy to prove existence of solutions to (D) is now as follows:
First, we show that (D’) admits a solution , see Proposition 1. 2. 2.
Then, we prove that and possess higher regularity, namely that they are functions in , , cf. Theorem 2.1. 3. 3.
Thus, is feasible for (D) and, since the feasible set of (D’) contains the one of (D), while the objective of (D’) restricted to -functions coincides with the objective in (D), this finally gives that is indeed optimal for (D).
The reason to consider (D’) is essentially that the objective is not coercive in , but only in (at least w.r.t. the negative part of ). Therefore, we have to deal with weakly∗ converging sequences in the space of Radon measures within the proof of existence of solutions. For this purpose, we need to extend the objective to a suitable set. To that end, let us define
[TABLE]
Note that, thanks to , it holds
[TABLE]
Of course, is also well defined as a functional on the feasible set of (D’) and we will denote this functional by the same symbol to ease notation. In order to extend to the space of Radon measures, consider for a given measure , the Hahn-Jordan decomposition and assume that . Then, we set . With a slight abuse of notation, we denote this mapping by , too. Furthermore, for , is finite for as in Assumption 1. Regarding, the negative part, we define , where this expression is not properly defined, as and are both positive. Combining this, we obtain that .
Note in this context that, if the singular part of (w.r.t. the Lebesgue measure) vanishes, then also and -a.e. in so that both functionals coincide on , which justifies this notation. Furthermore, we also generalize the map to the measure space by setting
[TABLE]
Again, it is easily seen that, for , , this definition boils down to the one in (5). Also Remark 1 applies in that we can express in terms of the adjoints of and from (2) when defined appropriately.
The next lemma is rather obvious and covers the coercivity of in as indicated above.
Lemma 2
Let Assumption 1 hold and suppose that a sequence fulfills
[TABLE]
Then, the sequences and are bounded in and , respectively.
Proof
We rewrite as . The positivity of then implies
[TABLE]
which gives the first assertion. To see the second one, we use to estimate
[TABLE]
which finishes the proof.∎
The next lemma provides a lower semicontinuity result for w.r.t. weak∗ convergence in . Note that, here, we need the extension of as introduced above.
Lemma 3
Let Assumption 1 be fulfilled and a sequence be given such that in and for all . Then there holds and
[TABLE]
Proof
By virtue of Lemma 2, is bounded in and thus, there is a subsequence of , to ease notation denoted by the same symbol, that converges weakly in to some . Since the set is clearly weakly closed, we have a.e. in . With a little abuse of notation, we denote the Radon measure induced by by , too. If we define , then in with . Thus we have with two positive Radon measures , . The maximality property of the Hahn-Jordan decomposition then implies . Since is absolutely continuous w.r.t. , the same thus holds for , i.e. . Applying again , which clearly also holds for the densities pointwise -almost everywhere, we moreover deduce from the weak convergence of in that
[TABLE]
which implies as claimed. Since the above reasoning applies to every subsequence that is weakly converging in , (11) holds for the whole sequence , which together with the weak∗ convergence of and the definition of , gives (10).
∎
Before we are in the position to prove existence for (D’), we need two additional results on the -operator in the space of Radon measures.
Lemma 4
If , and , then it holds that
[TABLE]
Proof
We estimate
[TABLE]
Taking and using gives
[TABLE]
Now we start again at (12) and estimate from below by taking to get
[TABLE]
which implies
[TABLE]
which completes the proof.∎
The next lemma will be used to show that the negative part of the minimizer of (D) does not have a singular part.
Lemma 5
Let and for with Lebesgue decompositions, satisfying and for .
It holds that
[TABLE] 2. 2.
If is absolutely continuous for , then for for , it holds that
[TABLE]
Proof
We first proof point 1. The measures , exist by Lebesgue’s decomposition theorem, see Theorem 1.155 in fonseca2007calculusofvariations . We combine these decompositions with to arrive at Lebesgue’s decomposition of with respect to , namely
[TABLE]
(which holds true because ). Now, we consider the Hahn-Jordan decomposition of ,
[TABLE]
and obtain from (14) that
[TABLE]
Furthermore,
[TABLE]
where the singularity with respect to is due to (15) and (16) and the singularity with respect to is due to (17). Thus,
[TABLE]
as is a positive measure. Consequently,
[TABLE]
Repeating this argument with the Hahn-Jordan decomposition of yields the claim.
The second part of the lemma is a direct consequence of the first: Since , the first summand in the functional is equal for and . However, the second summand in can not decrease since , and if the duality pairing does not exist. ∎
Now we are ready to prove the existence result for (D’):
Proposition 1
Under Assumption 1 the minimization problem (D’) admits a solution .
Proof
We proceed via the classical direct method of the calculus of variations. For this purpose, let with be a minimizing sequence for (D’), where we shift and by adding and subtracting constants such that we obtain . Note that, due to its additive structure, this does not change the objective in (D’), cf. Remark 3.
Next, let us define . Then, thanks to (9) and Lemma 2, the sequence is bounded in . Hence, there is a weakly∗ converging subsequence, which we denote by the same symbol w.l.o.g., i.e. in . Now, Lemma 3 applies giving that
[TABLE]
Since is bounded in , the same holds for and, as is normalized, Lemma 4 gives that is bounded in , . Therefore, we can select a further (sub-)subsequence, still denoted by the same symbol to ease notation, such that
[TABLE]
Since the mapping is the adjoint of the projection mapping C(\Omega)\ni\varphi\mapsto\big{(}\int_{\Omega_{2}}\varphi\,\mathrm{d}\lambda_{2},\int_{\Omega_{1}}\varphi\,\mathrm{d}\lambda_{1}\big{)}\in C(\Omega_{1})\times C(\Omega_{2}), see Remark 1, it is weakly∗ continuous so that
[TABLE]
Next, we investigate the singular parts of and . We start with the positive part and employ Lebesgue’s decomposition of and :
[TABLE]
In the following we will see that the regular parts , , are exactly the solution of (D’). For this purpose, we first show that the positive parts of and vanish. We have , , and, by uniqueness of Lebesgue’s decomposition, . But from (18), we know that . Combining this fact with Lemma 5, applied to the case , , and , we obtain
[TABLE]
and consequently, for by positivity. Therefore, are -functions rather than measures and
[TABLE]
Now Lemma 5 shows feasibility of for (D’) and we also see that
[TABLE]
which demonstrates the optimality of . ∎
In the following, we assume that . If this is not the case, then we can again shift and without changing the value of , cf. Remark 3.
Theorem 2.1
Let Assumption 1 hold. Then every optimal dual solution from Proposition 1 satisfies , , and is therefore also a solution of the original dual problem (D). Moreover, the negative parts of are bounded and the function has marginals the and .
Proof
We again consider the positive and the negative part separately and start with . Let and be fixed, but arbitrary. Then, thanks to
[TABLE]
Proposition 1 implies that so that is feasible for (D’). Therefore, the optimality of for (D’) yields
[TABLE]
Owing to the continuous differentiability of , the first integrand converges to -a.e. in for . Moreover, the Lipschitz continuity of the -function gives that
[TABLE]
holds for . Hence, due to Lebesgue’s dominated convergence theorem, we are allowed to pass to the limit and obtain in this way
[TABLE]
Since was arbitrary, the fundamental lemma of the calculus of variations thus gives
[TABLE]
Next, define the following sequence of functions in :
[TABLE]
where is the lower bound for from Assumption 1. Then we have -a.e. and -a.e. in so that the monotone convergence theorem gives
[TABLE]
Thus there exists such that
[TABLE]
where is the threshold for from Assumption 1. Now assume that -a.e. on a set of of positive Lebesgue measure. Then
[TABLE]
which contradicts (23). Therefore, -a.e. in , which even implies that . Concerning , one can argue in exactly the same way to conclude that , too.
For the positive parts we find
[TABLE]
where we used (21) and the boundedness of the negative parts proven above. Note that the constant shift, potentially needed to ensure has no effect on the equation in (21) due to the additive structure of .
We have thus shown that is feasible for (D). Since solves (D’), whose objective is the same as in (D), while its feasible set is larger, this implies that we have found a solution to (D). ∎
We now show that, if is of the form with two functions , , and has the marginals and , respectively, then it solves the necessary and sufficient optimality conditions of the primal problem (1) in form of the following variational inequality:
[TABLE]
Herein, is the (convex) feasible set of (1), i.e.
[TABLE]
For this purpose, let be fixed but arbitrary. Multiplying the equality constraints in with and , respectively, integrating the arising equations and add them yields
[TABLE]
where we used for the last inequality. Using the feasibility of , we find similarly
[TABLE]
Combining (25) and (26) now yields (VI). As (1) is a strictly convex minimization problem, this shows that, if has the form with functions and satisfies , then it is a solution of (1). On the other hand, we know from Theorem 2.1 that, under Assumption 1 (more or less needed for the existence of solutions of (1) anyway), there always exist so that satisfies the equality constraints in . Therefore, in summary we have deduced the following:
Theorem 2.2 (Necessary and Sufficient Optimality Conditions for (1))
Under Assumption 1, is a solution of (1) if and only if there exist functions , , such that the following optimality system is fulfilled:
[TABLE]
The significance of Theorem 2.2 lies in the fact that we can characterize optimality of by just two equalities in and , respectively, namely (27b) and (27c). Thus, we effectively reduce the size of the problem from searching one function on to searching two functions, one on and one on (similarly as for entropic regularization, cf. carlier2017convergence ). This will be exploited numerically in Section 3.
2.3 Regularization of the dual problem
As seen before, the dual problem in (D) is not uniquely solvable. One source of non-uniqueness is of course the kernel of the map . This kernel is one-dimensional and is spanned by the function , which could be easily taken into account in an algorithmic framework. However, there is another source of non-uniqueness due to the max-operator that cuts of the negative part. Here is a simple example where dual solutions are not unique: For , , and
[TABLE]
one can show by a straight forward calculation that, for every , the tuple
[TABLE]
solves the optimality system (27b)–(27c). This shows that the potential structure of non-uniqueness might become fairly intricate. A situation like this can certainly happen in the discretized problem we will derive in Section 2.4 and can lead to problems when we derive algorithms for the discrete problem since non-unique solutions imply a degenerate Hessian at the optimum.
Therefore, we investigate the following regularization of the dual problem:
[TABLE]
with a regularization parameter . It is clear that the additional quadratic terms in the regularized objective yield that the latter is strictly convex and coercive in . Therefore, for every , (Dε) admits a unique solution.
Proposition 2
Let be a sequence converging to zero and denote the solutions of (Dε) with by . Then the sequence admits a weak accumulation point, every weak accumulation point is also strong one and a solution of the original dual problem (D).
Proof
Let denote an arbitrary globally optimal solution of (D) (whose existence is guaranteed by Theorem 2.1). Then the optimality of for (D) and of for (Dε) (with ) gives
[TABLE]
which implies
[TABLE]
Thus, the boundedness of in . This in turn gives the existence of a weak accumulation point as claimed.
Now assume that is such a weak accumulation point, i.e.
[TABLE]
(for a subsequence). Using again the optimality of and , respectively, we obtain
[TABLE]
On the other hand, by convexity and weak lower semicontinuity of we get from (29) and (30) that
[TABLE]
which gives in turn the optimality of the weak limit. Estimate (28) for the choice shows that
[TABLE]
and thus, we have
[TABLE]
but would imply
[TABLE]
and consequently, we have in . ∎
Theorem 2.3
Let be a sequence converging to zero and denote the solutions of (Dε) with again by . Moreover, define
[TABLE]
Then converges strongly in to the unique solution of (1).
Proof
From (28), we know that is bounded and hence, is bounded in . Thus,
[TABLE]
for some subsequence. Now we show that is the optimal for (1). Weak closedness of implies . Integrating the first-order optimality conditions for (Dε)
[TABLE]
against some , inserting the definition of , and integrating over yields
[TABLE]
Passing to the limit we obtain
[TABLE]
and thus, satisfies the first equality constraint in (1). The second equality constraint can be verified analogously.
To show optimality of , we test the optimality conditions in (33) and (34) with and , respectively, and get
[TABLE]
where is the primal objective from (3). Similarly, we get
[TABLE]
where is the unique solution of (1) and solves the dual problem (D). Now, putting everything so far together, we obtain
[TABLE]
On the other hand, is weakly lower semicontinuous, and therefore
[TABLE]
This gives the optimality of and by strict convexity also uniqueness, i.e. . Thus, the weak limit is unique and a well known argument by contradiction therefore implies the weak convergence of the whole sequence to . Finally, strong convergence follows from a standard argument. ∎
2.4 The discrete dual problem
We show a simple discretization of the quadratically regularized optimal transport problem (1) by piecewise constant approximation in Appendix A. To keep the notation concise, we state the corresponding discrete optimal transport problem and illustrate the duality already here. This will be the basis of our algorithms we derive in Section 3. A discrete version of the continuous problem (1) is the finite-dimensional problem
[TABLE]
where denotes the vector of all ones, , denote the discretized marginals with , and denotes the discretized cost. Note that we slightly changed the notation from and to and , respectively. For the discrete form of the optimality system (27) we further replace the Lagrange multipliers and by and , respectively, and get
[TABLE]
where , and is the “outer sum”. The discrete counterpart of from (D) is
[TABLE]
where denotes the Frobenius norm.
We write the optimality condition (36b)-(36c) as a non-smooth equation in with
[TABLE]
(note that and ). Since is the composition of Lipschitz continuous and semismooth functions, we have the following result (for the chain rule for semismooth functions, see e.g. (OPTpdecon, , Thm. 2.10)):
Lemma 6
The function (and thus, the gradient of ) is (globally) Lipschitz continuous and semismooth.
3 Algorithms
The optimality system (36b), (36c) for the smooth and convex problem (D) can be solved by different methods. In blondel2018smoothOT the authors propose to use a generic L-BFGS solver and also derive an alternating minimization scheme, which is similar to the non-linear Gauss-Seidel method in the next section, but differs slightly in the numerical realization and roberts2017gini also uses an off-the-shelf solver. Here we propose methods that exploit the special structure of the optimality system: A non-linear Gauss-Seidel method and a semismooth Newton method.
3.1 Non-linear Gauss-Seidel
The method in this section is similar to the one described in the Appendix of blondel2018smoothOT , but we describe it here for the sake of completeness. A close look at the optimality system
[TABLE]
shows that we can solve all equations in (38a) for the in parallel (for fixed ) since the th equation depends on only. Similarly, all equations in (38b) can be solved for the if is fixed. Hence, we can perform a non-linear Gauss-Seidel method for these non-smooth equations (also known as alternating minimization, nonlinear SOR or coordinate descent method for chen2002convergence ; wright2015coordinatedescent ), i.e. alternatingly solving the equations (38a) for (for fixed ) and then the equations (38b) for (for fixed ). The whole method is stated in Algorithm 1. Since is convex with Lipschitz continuous gradient (cf. Lemma 6) the convergence of the algorithm follows from results in bertsekas2016nlp .
Each equation for an or is just a single scalar equation for a scalar quantity and the structure of the equation is of the following form: For a given vector and right hand side , solve
[TABLE]
Of course, one can solve this problem by bisection, but here are two other, more efficient methods to solve equations of the type (39):
Direct search.
If we denote by the -th smallest entry of (i.e. we sort in an ascending way), we get that
[TABLE]
To obtain the solution of (39) we evaluate at the break points until we find the interval in which the solution lies (by finding such that ), and then setting
[TABLE]
The complexity of the method is dominated by the sorting of the vector , its complexity is .
Semismooth Newton.
Although is non-smooth, we may perform Newton’s method here. The function is piecewise linear and on each interval is has the slope (a simple situation with is shown in Figure 1). At the break points we may define and then we iterate
[TABLE]
If we start with , the method will produce a monotonically decreasing sequence which converges in at most steps. Actually, we can initialize the method with any that is strictly larger than .
Note that we do not need to sort the values of to calculate the derivative since we have . In practice, the method usually needs much less iterations than .
3.2 Semismooth Newton
As seen in Lemma 6, the mapping is semismooth and hence, we may use a semismooth Newton method chen1997 ; chen2000semismooth .
A simple calculation proves the following lemma.
Lemma 7
A Newton derivative of from (37) at is given by
[TABLE]
where is given by
[TABLE]
A step of the semismooth Newton method for the solution of would consist of setting
[TABLE]
However, the next lemma shows, that has a non-trivial kernel.
Lemma 8
Let be the Newton derivative of at defined in Lemma 7. Then the following holds true:
* is symmetric,* 2. 2.
* is positive semi-definite,* 3. 3.
* if and only if for all , .*
Proof
Symmetry of is clear by construction. To see that is positive semi-definite we calculate
[TABLE]
Due to the non-negativity of , this also shows the last point.∎
The third point of the lemma shows that the kernel of may have a high dimension, depending on the matrix . Hence we resort to a quasi Newton method where we regularize the Newton step arising from the dual problem from Section 2.2 by setting
[TABLE]
with a small . By chen1997 , the method still converges, but only a local linear rate is guaranteed. We note that we have not applied the semismooth Newton method to the regularized dual problem from Section 2.3. This would also be possible, but lead not only to the regularized Newton matrix from above but we would also have to adapt the objective in the computation of the update.
Let us make a few remarks on the the regularized Newton step and its numerical treatment.
- •
The matrix (and hence the Newton matrix ) is usually very sparse. The closer and are to the optimal ones, the closer is to the optimal regularized transport plan and for small this usually very sparse.
- •
Since is positive semi-definite, the regularized step could be done by the method of conjugate gradients. However, any linear solver that can exploit the sparsity of can be used.
As usual, the regularized semismooth Newton method may not converge globally. A simple globalization technique is an Armijo linesearch in the Newton direction. The full method is described in Algorithm 2.
4 Numerical examples
4.1 Illustration of
In our first numerical example we illustrate the how the solutions of the regularized problem converge for vanishing regularization parameter . We generate some marginals, fix a transport cost and compute solutions of the discretized transport problems (35) for a sequence and illustrate the optimal transport plans (and the related regularized transport costs). Our marginals are non-negative functions sampled at equidistant points , in the interval and we used and the cost is the squared distance between the sampling points. The results are shown in Figure 2. One observes that the optimal transport plans converge to a measure that is singular and is supported on the graph of a monotonically increasing function, exactly as the fundamental theorem of optimal transport ambrosio2013user predicts.
We repeat the same experiment where the cost is the (non-squared) distance . Here we had to choose larger regularization parameters as it turned out that values similar to Figure 2 would lead to almost undistinguishable results. The results are shown in Figure 3. Note the different structure of the transport plan (which is again in agreement with the predicted results from the fundamental theorem of optimal transport). In Figure 4 we show the results for the concave but increasing cost and again observe the expected effect that a concave transport cost encouraged that as much mass as possible stays in place (as can be seen by the concentration of mass along the diagonal of the transport plan).
4.2 Mesh independence and comparsion of SSN and NLGS
While we did not analyze our algorithms in the continuous case, we made an experiment to see how the methods converge when we change the mesh size of the discretization. To that end, we did a simple piecewise constant approximation of the marginals, the cost and the transport plan as described in Appendix A. This derivation shows that one has to scale up the marginals for finer discretization (or, equivalently, scale down the regularization parameter ) to get consistent results. We also took care to adapt the termination criteria so that we terminate the algorithms when the continuous counterpart of the termination criteria is satisfied (again, see Table 1 in Appendix A for details).
We used marginals of the form
[TABLE]
with varying and appropriate normalization factors and quadratic cost and discretized each instance of the problem with varying from to . We solved the problem for each size for regularization parameter with the semismooth Newton method from Algorithm 2 (with parameters and Armijo parameters and ) up to tolerance and report the number of iterations needed in Figure 5. As can be observed, the number of iterations is comparable for each instance of the problem. Moreover, it seems that the number of iterations does not grow with finer discretization (however, the number of iterations seems to oscillate unpredictable for coarse discretization). The would hint at mesh independence of the method and one could hope to prove this is future research. We performed a similar experiment for the nonlinear Gauss-Seidl method from Algorithm 1 (with larger regularization parameter and only up to and show the results in Figure 6. We see an overall increase of the number of iterations but only very slightly (with several instances where the number of iterations does not increasing with finer discretization).
4.3 Optimal transport between empirical distributions
As an example in two space dimensions, we consider two distributions . Instead of using these as marginals, we consider empirical distributions, i.e. we generate samples , sampled from and , sampled from . These samples give empirical approximations
[TABLE]
The optimal transport problem (1) with these two marginals does no fulfill Assumption 1, since the marginals are not -functions. However, we can consider it as a discrete problem optimal transport problem in the form (35) when we denote (for some cost ) and marginals and , respectively. We solve this discrete optimal transport problem and obtain a transport plan . Since we use quadratic regularization, the plan will be sparse and hence, we can visualize it by plotting arrows from to and we make the thickness of the arrows proportional to the size of the entry . In other words: The thickness of the arrow from to indicates how much of the mass in has been transported to . In Figure 7 we show the result for samples from an anisotropic Gaussian distribution (centered at the origin) and samples from a uniform distribution on a segment of an annulus. We used with the Euclidean norm and regularization paramater . The resulting plan has non-zero entries. For a comparison we show the result of entropically regularized optimal transport in the same situation in Figure 8. We used (which is the smallest value for which our naive implementation of Sinkhorn algorithms is still stable). The resulting plan has nonzero entries and we only plot lines for the transport which are larger than 1% of the largest entry in the optimal transport plan.
5 Conclusion
We analyzed the quadratically regularized optimal transport problem in Kantorovich form. While it is straight forward to derive the dual problem, our proof of existence of dual optima is quite intricate. We note that we are not aware of any proof of existence of the dual of other regularized transport problems in the continuous case besides the very recent clason2019entropic for entropic regularization. We derived two algorithms to solve the dual problems, both of which converge by standard results. It turns out that the semismooth quasi-Newton methods converges fast in all cases and that it behaves stably with respect to the regularization parameter in our numerical experiments. We even observe mesh independence of the method in the experiments. One drawback of the semismooth Newton method is (compared with, e.g. the Sinkhorn iteration cuturi2013sinkhorn ), is that we need to assemble the Newton matrix in each step. While this matrix is usually very sparse, one still needs to check cases, which may be too large for large scale problems. We did not investigate, how special structure of the cost function may help to reduce the cost to assemble the sparse matrix .
Acknowledgements.
We would like to thank the reviewer for helpful suggestions that lead to an improvement presentation and also Stephan Walther (TU Dortmund) for helping with the construction of the counterexample in Section 2.3.
Appendix A Discretization with piecewise-constant ansatz functions
For sake of brevity, we just consider an equidistant discretization of into intervals using piecewise constant ansatz functions, i.e.
[TABLE]
for coefficients and assume analogous definitions for the quantities , , , and . They have to coincide on average over the intervals. Again, we study this for and obtain that the identity
[TABLE]
holds. Again, analogous identities hold for the quantities , , , and . The ones with one-dimensional domain are scaled by instead of .
Now, we consider the discrete Algorithm 2, which operates on discrete quantities and establish a consistent mapping of the quantities from the discretization to the ones of the solver. We denote its input quantities by , , and its output quantities by , , , and . It solves for
[TABLE]
which we desire to correspond to
[TABLE]
We plug in the ansatz functions and obtain the identity
[TABLE]
We set and obtain
[TABLE]
Thus, the choice gives a consistent conversion. Similarly, we obtain . We proceed with the objective. Plugging in the ansatz functions into the continuous objective gives
[TABLE]
The solver computes
[TABLE]
Plugging in , and gives
[TABLE]
Thus, the consistent identity follows if we choose and . The solver computes as the solution of
[TABLE]
whereas the discretization of the corresponding continuous equation reads
[TABLE]
in terms of the coefficients. Plugging in the choices , , and yields equivalence of the latter equation to
[TABLE]
which is equivalent to the equation that is solved by Algorithm 2. The argument for is carried out analogously.
Regarding termination, the solver checks the criteria
[TABLE]
We only consider the first and plug the identity into it, which gives equivalence to
[TABLE]
This in turn is equivalent to
[TABLE]
Moreover, the ansatz functions for and are constant on , which induces equivalence to
[TABLE]
This implies that if the solver terminates, we have
[TABLE]
We summarize the choices for the consistent mapping of quantities arising from the discretization to quantities the solver operates on in Table 1.
Finally, we make a note on the calculation of the coefficients for the cost function :
[TABLE]
Conflict of Interest: The authors declare that they have no conflict of interest.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1[1] Luigi Ambrosio and Nicola Gigli. A user’s guide to optimal transport. In Modelling and optimisation of flows on networks , pages 1–155. Springer, 2013.
- 2[2] Dimitri P. Bertsekas. Nonlinear programming . Athena Scientific Optimization and Computation Series. Athena Scientific, Belmont, MA, third edition, 2016.
- 3[3] Mathieu Blondel, Vivien Seguy, and Antoine Rolet. Smooth and sparse optimal transport. In Amos Storkey and Fernando Perez-Cruz, editors, Proceedings of the Twenty-First International Conference on Artificial Intelligence and Statistics , volume 84 of Proceedings of Machine Learning Research , pages 880–889, Playa Blanca, Lanzarote, Canary Islands, 09–11 Apr 2018. PMLR.
- 4[4] Guillaume Carlier, Vincent Duval, Gabriel Peyré, and Bernhard Schmitzer. Convergence of entropic schemes for optimal transport and gradient flows. SIAM Journal on Mathematical Analysis , 49(2):1385–1418, 2017.
- 5[5] Xiaojun Chen. Superlinear convergence of smoothing quasi-newton methods for nonsmooth equations. Journal of Computational and Applied Mathematics , 80(1):105 – 126, 1997.
- 6[6] Xiaojun Chen. On convergence of SOR methods for nonsmooth equations. Numer. Linear Algebra Appl. , 9(1):81–92, 2002.
- 7[7] Xiaojun Chen, Zuhair Nashed, and Liqun Qi. Smoothing methods and semismooth methods for nondifferentiable operator equations. SIAM J. Numer. Anal. , 38(4):1200–1216, 2000.
- 8[8] Christian Clason, Dirk A Lorenz, Hinrich Mahler, and Benedikt Wirth. Entropic regularization of continuous optimal transport problems. ar Xiv preprint ar Xiv:1906.01333 , 2019.
