A Column Generation Approach to the Discrete Barycenter Problem
Steffen Borgwardt, Stephan Patterson

TL;DR
This paper introduces two column generation methods to efficiently compute the discrete Wasserstein barycenter, significantly reducing memory usage and computation time compared to traditional linear programming approaches.
Contribution
It develops and analyzes two novel column generation strategies, including a Dantzig-Wolfe decomposition, for solving large-scale discrete barycenter problems.
Findings
Both strategies outperform full linear programming in speed.
Memory requirements are dramatically reduced.
Performance varies depending on input data.
Abstract
The discrete Wasserstein barycenter problem is a minimum-cost mass transport problem for a set of discrete probability measures. Although an exact barycenter is computable through linear programming, the underlying linear program can be extremely large. For worst-case input, a best known linear programming formulation is exponential in the number of variables, but has a low number of constraints, making it an interesting candidate for column generation. In this paper, we devise and study two column generation strategies: a natural one based on a simplified computation of reduced costs, and one through a Dantzig-Wolfe decomposition. For the latter, we produce efficiently solvable subproblems, namely, a pricing problem in the form of a classical transportation problem. The two strategies begin with an efficient computation of an initial feasible solution. While the structure of the…
| LP (1.1) | Column Generation | |||||||||||
| 1-col | -col | all-col | DW-L | DW-A | ||||||||
| n | Var | Time | Var | Time | Var | Time | Var | Time | Var | Time | Var | Time |
| 12 | 2,177,280 | 5.22 | 270 | 22.96 | 825 | 7.01 | 1,199,800 | 10.60 | 646 | 18.78 | 478 | 14.67 |
| 14 | 4,976,640 | 13.68 | 234 | 50.42 | 634 | 11.73 | 2,688,032 | 35.39 | 581 | 39.70 | 638 | 41.35 |
| 14 | 5,971,968 | 56.68 | 232 | 60.31 | 669 | 14.51 | 3,272,679 | 29.11 | 684 | 56.50 | 422 | 49.91 |
| 12 | 25,288,704 | 235.17 | 322 | 354.28 | 903 | 90.70 | 12,727,161 | 344.33 | 489 | 162.41 | 625 | 218.18 |
| 14 | 28,449,792 | 358.84 | 308 | 423.30 | 1,004 | 109.97 | 14,572,552 | 215.50 | 803 | 302.01 | 545 | 252.43 |
| 15 | 31,850,496 | 378.90 | 309 | 480.10 | 961 | 106.10 | 17,992,481 | 258.98 | 1,604 | 689.25 | 779 | 303.90 |
| 17 | 63,700,992 | 1754.08 | 364 | 1,559.79 | 1,395 | 404.12 | 33,848,981 | 1,129.83 | 2,403 | 3,174.96 | 2,081 | 1,980.67 |
| 17 | 84,934,656 | 1858.22 | 348 | 2,225.41 | 1,209 | 364.50 | 44,303,632 | 1,945.65 | 2,023 | 2,882.76 | 1,209 | 1,538.60 |
| 18 | 127,401,984 | 3588.32 | 386 | 3,145.88 | 1,189 | 699.37 | 62,383,222 | 2,467.41 | 2,121 | 4,402.53 | 2,924 | 5,105.93 |
| 17 | 148,635,648 | * | 341 | 3,498.02 | 1,155 | 844.44 | 83,870,587 | 4,001.37 | 1,119 | 3,204.85 | 724 | 1,621.57 |
| 18 | 191,102,976 | * | 364 | 3,507.80 | 1,376 | 807.66 | 100,681,949 | 10,015.66 | 2,019 | 5,588.05 | 899 | 4,477.56 |
| LP (1.1) | 1-col | -col | all-col | DW-L | DW-A | |
|---|---|---|---|---|---|---|
| 2,177,280 | 2,380 | 42 | 44 | 1,330 | 33 | 30 |
| 4,976,640 | 6,080 | 85 | 86 | 1,530 | 63 | 56 |
| 5,971,968 | 7,360 | 100 | 102 | 2,820 | 70 | 72 |
| 25,288,704 | 28,210 | 398 | 399 | 12,690 | 212 | 223 |
| 28,449,792 | 36,040 | 447 | 450 | 10,720 | 244 | 267 |
| 31,850,496 | 42,820 | 499 | 501 | 20,790 | 286 | 274 |
| 63,700,992 | 94,310 | 990 | 995 | 28,810 | 542 | 564 |
| 84,934,656 | 125,740 | 1,290 | 1,290 | 57,220 | 691 | 751 |
| 127,401,984 | 199,450 | 1,920 | 1,930 | 84,130 | 1,020 | 1,060 |
| 148,635,648 | * | 2,240 | 2,250 | 67,200 | 1,150 | 1,260 |
| 191,102,976 | * | 2,880 | 2,880 | 86,810 | 1,520 | 1,920 |
| 1-col | -col | DW-L | |||
| None | With | None | With | None | |
| 2,177,280 | 42 | 41 | 44 | 42 | 33 |
| 4,976,640 | 85 | 84 | 86 | 84 | 63 |
| 5,971,968 | 100 | 100 | 102 | 100 | 70 |
| 25,288,704 | 398 | 398 | 399 | 399 | 212 |
| 28,449,792 | 447 | 446 | 450 | 446 | 244 |
| 31,850,496 | 499 | 499 | 501 | 500 | 286 |
| 63,700,992 | 990 | 988 | 995 | 989 | 542 |
| 84,934,656 | 1,290 | 1,280 | 1,290 | 1,280 | 691 |
| 127,401,984 | 1,920 | 1,920 | 1,930 | 1,920 | 1,020 |
| 148,635,648 | 2,240 | 2,240 | 2,250 | 2,240 | 1,150 |
| 191,102,976 | 2,880 | 2,880 | 2,870 | 2,870 | 1,520 |
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
11institutetext: 11email: [email protected]; University of Colorado Denver 22institutetext: 22email: [email protected]; Louisiana State University in Shreveport
A Column Generation Approach to the
Discrete Barycenter Problem
Steffen Borgwardt 11
Stephan Patterson 22
Abstract
The discrete Wasserstein barycenter problem is a minimum-cost mass transport problem for a set of discrete probability measures. Although an exact barycenter is computable through linear programming, the underlying linear program can be extremely large. For worst-case input, a best known linear programming formulation is exponential in the number of variables, but has a low number of constraints, making it an interesting candidate for column generation.
In this paper, we devise and study two column generation strategies: a natural one based on a simplified computation of reduced costs, and one through a Dantzig-Wolfe decomposition. For the latter, we produce efficiently solvable subproblems, namely, a pricing problem in the form of a classical transportation problem. The two strategies begin with an efficient computation of an initial feasible solution. While the structure of the constraints leads to the computation of the reduced costs of all remaining variables for setup, both approaches may outperform a computation using the full program in speed, and dramatically so in memory requirement. In our computational experiments, we exhibit that, depending on the input, either strategy can become a best choice.
Keywords: discrete barycenter, optimal transport, linear programming, column generation
MSC 2010: 49M27, 90B80, 90C05, 90C08, 90C46
1 Introduction
Optimal transport problems involving joint transport to a set of probability measures appear in a variety of fields, including recent work in image processing [16, 23], machine learning [15, 22, 29], and graph theory [27], to name but a few. The so-called Wasserstein barycenters take a central role in many of these applications: a barycenter is another probability measure which minimizes the total distance to all input measures (i.e, images) and acts as an average distribution in the probability space; the squared Wasserstein distance is of particular consideration due to the preservation of the geometric structure of the input. The wide scope of optimal transport problems makes challenging even a reasonably comprehensive summary; for recent monographs on the Wasserstein distance and computational optimal transport, we refer the reader to [14] and [20], in addition to the seminal work of Villani [28].
In almost all applications, the probability measures have discrete support, i.e., a finite number of points to which positive mass is associated. This leads to the so-called discrete barycenter problem, defined as follows: Given a set of probability measures , each with a finite set of support points in and associated masses, and a set of nonnegative weights with , find a probability measure on , that is, a (Wasserstein) barycenter, satisfying
[TABLE]
where is the quadratic Wasserstein distance and is the set of all probability measures on with finite second moments [1]. Since have finite sets of support points, we call the measures discrete. As the are measures, the total mass of their support points sums up to . With representing the mass of , this can be denoted as . Because are discrete, the solution measure also has a finite set of support points, and the Wasserstein distance is the squared Euclidean distance [3, 8, 28].
The discrete barycenter problem is a multi-marginal optimal transport problem, and as such is significantly more challenging than the classical two-marginal optimal transport problem referenced later in Section 2. In fact, multi-marginal optimal transport is no longer a network flow problem [18], and a variant of the problem – finding optimal solutions with a bound on the size of the support set – has recently been shown to be NP-hard [7]. Further, there is current work on NP-hardness in all situations [2].
Therefore, considerable activity continues on exact, approximate and heuristic methods of computation, such as alternating minimization algorithms [21, 26, 30]. State-of-the art approximation methods solve the entropy regularized optimal mass transport problem introduced in [9]. Entropic regularization leads to a strongly convex program, and its smoothing effects give qualitatively different solutions from exact barycenters [4]. Entropy regularized transport problems can be solved efficiently, with a linear-in- complexity bound, using iterative Bregman projection algorithms [4, 10, 25], although work continues on the stability and complexity of these algorithms, e.g., [17]. In contrast, exact solutions to the discrete barycenter problem are commonly used for benchmarking purposes and only possible for small input; one of their advantages is that they find a barycenter of provably sparse support and associated transport that is non-mass splitting (see Definition 1). Exact barycenters can be computed by linear programming [1, 3, 8], but all known LP formulations scale exponentially.
The vast majority of algorithms in the literature, including the above examples, are based on an explicit specification of a discrete set of support points that may be allocated mass; see, e.g., [4, 5, 6, 8, 10, 17, 24]. The search for an optimal in Eq. 1 is replaced by a search over . The size of typically is the main bottleneck for the practical performance of algorithms [2].
Different types of input lead to a different level of challenge. In image processing, for example, the probability measures are supported on the same structured set (a pixel grid). This highly structured support is a best-case input. In this setting, a barycenter can be computed exactly in polynomial time [6]. In practice, the cost is still prohibitive: an exact barycenter lies in an -times finer grid. It is common practice to use a coarser grid to find an approximate barycenter. The original grid already contains a -approximation [5].
By contrast, a worst-case input occurs for measures with no known structure, such as in wildfire ignition points or crime locations [6]. Then it becomes difficult to specify a small set of possible support points for a ‘good’ approximation or one that allows for the computation of an exact barycenter. However, the existence of sparse solutions to the problem [3] for any input indicates that strategies which dynamically introduce support points, or collections of support points, would be promising to approach these difficult instances.
In this paper, we use linear programming theory and column generation techniques to take a step in this direction. We will advance the state-of-the-art on the computation of exact barycenters for such worst-case input. These computations will still remain costly.
1.1 Linear Programming for The Discrete Barycenter Problem
The discrete barycenter problem can be solved exactly by linear programming [1, 3, 6, 8], but all known LP formulations may require an exponential number of variables, scaling by the product of the sizes of the support sets of the input measures [6]. Some formulations also have an exponential number of constraints, but for any input there exists one with an extremely low number of constraints. These dimensions indicate the linear program is a promising candidate for column generation.
The so-called non-mass-splitting property, satisfied by all exact barycenters, is a crucial tool for the linear programming approach to the problem that we use in this paper. It states that any optimal transport plan (the support points in each to which each support point in transports mass, and the amount of mass transported) of a barycenter may not send mass to more than one support point in each measure [1, 3]. This property is fundamental to the modeling of many physical applications where a mass split would be infeasible.
Definition 1 (The Non-mass-splitting Property)
The mass of each barycenter support point is transported fully to a single support point in each measure; that is, for each with corresponding mass , , there exists exactly one , to which the entire mass is transported in any optimal transport plan.
Since the non-mass-splitting property holds for all barycenters, each support point in a barycenter is associated with a single combination of input support points, consisting of the points to which its mass is transported. The set of combinations of input support points is denoted , with elements , . Each combination has an associated weighted mean . The weighted mean is the optimal location for joint mass transport to the points in the combination . Therefore, specifying the set to contain all distinct weighted means makes it the set of all possible support points for the barycenter.
This notation allows us to formally describe the worst-case setting to which our algorithm will be tailored: when each combination produces a different weighted mean . Then we say the measures are in general position, and using to denote the size of the support set of , the number of distinct is . Thus the number of weighted means is exponential in the number of input measures – without additional knowledge, the set of possible support points would be of size .
We provide an example of a discrete measure in in Figure 1 (left), and three measures in general position in Figure 1 (right). For this tiny example, verifying that the measures are in general position is elementary but somewhat tedious, as the weighted means of all 27 combinations of support points must be computed and verified as unique; in general, verifying whether a particular set contains the correct possible support points is NP-hard [6]. A barycenter for these measures is displayed in Figure 2 (left), shown with associated transport in Figure 2 (right).
In this paper, we build on an LP formulation from [6] that, among known formulations, requires the fewest variables and constraints for general position measures. In this formulation, which we call LP (1.1), a variable is introduced for each combination of support points from : each has a corresponding variable representing the mass assigned to and transported fully to each , . The total transport cost of a unit of mass from is given by .
Constraints arise from the requirement that the total transport to each support point in each measure is exactly equal to its mass . This produces one equality constraint for each in each measure; that is,
[TABLE]
Recall that for each . Thus there exists a feasible solution to the union of all these constraints, and it has to satisfy . The constraints can be represented as a real matrix times a vector , equal to right-hand side . In , column contains ones in the rows where , and zeroes otherwise. The matrix is highly structured, which will be both good and bad for our purposes; we take a closer look at it in Section 2. With as the vector of variable masses , as the vector of associated costs , and as the vector of masses of the support points in the input, a full formulation of LP (1.1) is as follows:
[TABLE]
The number of constraints scales linearly, equal to the total number of support points in the input measures. Meanwhile, the number of variables scales exponentially in the number of input measures. This extreme difference in the scaling of number of rows and columns further motivates our interest in a column generation approach.
1.2 Outline
In Section 2, we develop two column generation algorithms and discuss how to exploit the structure of the problem for efficient pricing and a memory-efficient implementation. In Section 3, we devise a simple greedy algorithm to find an initial feasible solution to start these algorithms. Section 4 contains some computational experiments. The results demonstrate the practical advantages of the algorithms over direct computations using LP (1.1), and exhibit that the different algorithms become a best choice depending on the input. We finish with some concluding remarks in Section 5.
2 Column Generation
We briefly recall the basics of a column generation strategy for solving a linear program; for additional details see for instance [12]. Column generation is the process of dynamically adding variables to any linear program, and begins with a version of the linear program containing a small subset of the variables. This (reduced) linear program is called the master problem. Additional variables are chosen using a sub-problem, which we call the pricing problem, in which the current optimum of the restricted master problem is used to produce a new, potentially improving column. Column generation terminates when the optimal value of the pricing problem is no longer negative.
The pricing problem must contain enough information for its solution to be meaningful, but also remain sufficiently simple to be efficiently solvable. In our first column generation algorithm, we apply column generation naturally to LP (1.1), described in Section 2.1. In our second column generation algorithm, we apply column generation to an alternative linear program, described in Section 2.2. We produce an efficiently solvable pricing problem for the alternative LP by exploiting the structure of the constraint matrix; this structure is described in Section 2.3.
2.1 Column Generation for LP (1.1)
We first devise a strategy to apply column generation directly to LP (1.1). At any iteration of the column generation process, the master problem contains variables which correspond to a subset of the possible combinations of support points , and because the measures are assumed to be in general position, each variable corresponds to a unique point in the set of all possible support points . Thus an improving column produced by column generation is precisely a new support point to include in , and column generation for LP (1.1) corresponds directly to the stated goal of the dynamic generation of .
We produce a pricing problem as follows. Let y be the vector containing the dual values associated with the mass transport constraints of any master problem based on LP (1.1). Then a pricing problem using the vector of reduced costs is:
[TABLE]
where and is the column of . Since the vectors y and contain elements, each individual evaluation of is efficient, even if has not yet been computed. For instance, as in [6], can be computed directly and efficiently from the combination using
[TABLE]
Therefore the exponential scaling of with the number of measures is the sole source of inefficiency for using Equation (2) in column generation. The structure of , described in Section 2.3, allows a memory-efficient evaluation of the product .
Equation (2) produces one column to introduce to the master problem each iteration of column generation: the variable with index where is minimal. To reduce the number of times the exponential vector must be processed, in our computational experiments, we also consider two alternative pricing problems: one where we introduce all columns where is negative, and one where we introduce the best columns to the master problem at each iteration.
2.2 Dantzig-Wolfe Reformulation
In this section, we present a linear program based on the convex hull of vertices of a polyhedron. If the polyhedron is generated by the constraints of LP (1.1), that is, , then the linear program is an alternative to LP (1.1). Vertex-form linear programs are not generally considered for computation, since the majority of such polyhedra have exponentially many vertices. However, for general position measures, LP (1.1) already scales exponentially, so computations using a vertex formulation face the same challenges, and as we will see, have the same potential benefits from column generation with a low number of constraints and large number of variables.
In our second column generation algorithm, in addition to reformulating LP (1.1) for a new vertex-form master problem, we also perform a decomposition of the constraints. A decomposition of a vertex-form linear program is called a Dantzig-Wolfe reformulation, presented in [11]. The decomposition begins by partitioning the constraint matrix as and right-hand side . A preselected number of rows are assigned to the matrix for use in the separate pricing problem. Note that reordering the rows of does not affect the underlying polytope, so the matrix does not need to be precisely the first rows of ; however, in our analysis in Section 2.3 we will assume that the measures have been ordered such that those chosen for the pricing problem are first. The remaining rows of are assigned to the matrix and remain in the master problem.
The pricing problem produces vectors p that are vertices of the polyhedron ). These vectors represent potential distributions of mass to each possible combination in , but are typically not (individually) feasible for the full problem . They are added to the master problem through the products and in the objective and constraints, respectively. These products for all produced p are then combined in a convex combination with weights in the new variable vector to produce fully feasible solutions. The resulting master problem has both a limited number of variables due to the column generation, and a slightly reduced number of constraints due to the decomposition, and is now called the restricted master problem.
[TABLE]
We confirm that the structure that makes LP (1.1) a prime candidate for column generation is preserved in LP (2.2): the number of constraints is bounded above by , as LP (2.2) has just one additional constraint for convexity and a (possibly improper) subset of the rows. Ideally, only a fraction of the total number of vertices are used in LP (2.2), so that the number of columns remains low, as well.
Recall that the pricing problem uses the current optimum of the restricted master problem to produce a new column to introduce to LP (2.2). Specifically, the objective function of the pricing problem requires the dual solution to LP (2.2), where y was the dual solution corresponding to the constraints in LP (1.1). We will now denote the dual solution to by , where contains the dual values associated with the mass transport constraints , and is the dual value associated with the convexity constraint in LP (2.2). Then the base form of the pricing problem is:
[TABLE]
LP (2.2) is still an exponential-sized linear program: The constraint matrix has an exponential number of columns, as does the matrix , and the cost vector has an exponential number of elements. In fact, LP (2.2) contains the same number of variables as LP (1.1). We now develop an improved pricing problem using information specific to the barycenter problem.
2.3 The Structure of the Coefficient Matrix
Recall that contains only elements and [math]: in column , there is a when is in the tuple , that is, , and [math] otherwise. In fact, each column contains exactly nonzero coefficients. The pattern created within the matrix is displayed in Example 1: each row has consecutive ones alternating with consecutive zeros. For each measure, the consecutive ones start in the first column for the first constraint in each measure, then start in the second row immediately after the end of the previous consecutive ones, continuing to the last constraint of the measure, forming a block. The width of the block depends on the measure with which the constraints are associated. The number of consecutive ones equals the product of the sizes of the measures with a higher index: the rows of associated with , , contain consecutive ones. The block for the final measure is the identity matrix.
Example 1
The matrix for four measures with sizes and contains blocks of ones and zeros. The width of block structure for particular constraints depends on the index of the corresponding measure . Here there are 36 total columns, and the number of consecutive ones for each measure is 18, 6, 3, and 1, respectively.
[TABLE]
∎
Since the number of consecutive ones for each row of can be generated from the sizes of the support sets , the columns of are easily generated solely from the problem input. Formulas for generating a given column are provided in Algorithm 1. While Algorithm 1 is written with the sums and products in their respective formulas, recalculating these values on repeated runs can be avoided by storing the number of consecutive ones and the width of a full block for each measure.
Because is easily generated, the matrix is not required in memory. Instead, in the objective function, is calculated using the to determine which dual values should be added. For further computational efficiency, updates to elements of are only required for those values of which have changed from the previous iteration; due to the sparse nature of , many elements may remain unchanged.
By taking as the first rows of , the pattern of consecutive ones also guarantees that always has many duplicate columns. Continuing with the matrix from Example 1, in Example 2, we assign the constraints for the first two measures to , resulting in a matrix with six unique columns, each repeated six times. For any number of measures , partitioning the constraints for measures, , to results in unique columns, while the number of times each column is duplicated is . For a fixed , the number of unique columns is no longer exponential; we justify the choice momentarily.
Example 2
Using the matrix from Example 1, a decomposition of all constraints associated with the first two measures into the pricing problem gives this matrix .
[TABLE]
Every column is repeated six times: . The matrix of unique columns is .
[TABLE]
∎
Replacing the constraint matrix in LP (2.2) with the matrix of unique columns requires a corresponding change to the objective function. Noting that LP (2.2) is the minimization of a linear objective, only the most negative coefficient for each unique column is required: in an optimal solution, all mass is assigned to such a column. Thus, it suffices to keep a best-cost vector for the unique columns. These two substitutions produce LP (2.3).
[TABLE]
Using LP (2.3) improves solvability and memory requirements in two major ways: when , LP (2.3) requires just variables, a tremendous reduction from . The number of variables does not depend on . When the input measures have support sets of equal size , this eliminates variables. Additionally, the constraint matrix is stored in memory, so the benefit of replacing with is significant.
Using LP (2.3) instead of LP (2.2) requires additional preprocessing each iteration to construct the best-cost vector , which does use the exponential-sized vector . The preprocessing for LP (2.3), repeated each iteration, is given in Algorithm 2. In particular, Algorithm 2 highlights the important selection of which indices correspond to the unique columns used in LP (2.3).
2.4 Decomposition of Constraints for Exactly Two Measures
As in Example 2, we partition with ; that is, we always partition where , and subsequently the matrix of unique columns , contains all rows of constraints associated with exactly two measures. LP (2.3) has a linear objective function ; because of the linear objective and the structure of the constraints for two measures, LP (2.3) is a classical transportation problem [13, 19], a special case of a minimum-cost flow problem. Therefore, LP (2.3) can be solved in strongly polynomial time [3, 6].
Theorem 2.1
Let be the constraints associated with exactly two measures. Then LP (2.3) is a classical transportation problem and can be solved in strongly polynomial time.
We conclude this discussion with an examination of the efficiency of adding a column produced by LP (2.3) to LP (2.2). Once the column generation process has begun, the previous pricing problem LP (2.3) produces a solution q containing the nonzero elements of a new to be introduced to LP (2.2). This q has, trivially, at most nonzero elements, and in fact, there must exist a smaller solution of size . Recall from Section 2 that is easily generated, so the pricing problem does not require to be stored in memory. The restricted master problem also does not require to be stored; instead, Algorithm 1 is used to calculate the new column . Combined with the small number of nonzero elements of , a computation of , as well as of , can be done efficiently. For additional efficiency, the solver for LP (2.2) uses the previous solution as a warm start. Using the primal simplex method then typically finds a new optimal solution in just a few simplex steps for each update of LP (2.2).
Next, we turn to the master problem and describe a method for generating an initial feasible start for both column generation algorithms.
3 Constructing a Feasible Solution
To initialize column generation, both algorithms require enough variables such that an initial feasible solution exists, along with a feasible solution.
For LP (1.1), a feasible solution is any w which solves the full system , and a feasible master problem is produced by the variables associated with a positive value in w. A feasible solution for LP (2.2) is related to the feasible solution w as follows. First, consider the introduction of a single initial column (so begins at ). Then the convexity constraint requires , and the remaining constraints subsequently require . Furthermore, all p need to satisfy , and thus should also be a solution to the full system .
We considered two methods for constructing a vertex: a greedy construction and the 2-approximation algorithm from [5]. The appeal of the 2-approximation algorithm lies in its ability to efficiently provide a good potential optimum. However, we found in experiments that initialization with a 2-approximation vertex is consistently outperformed by initialization with a greedily constructed vertex (this was not due to increased time to generate the vertex but because additional iterations were required before strictly improving columns were found). Therefore, we present just the greedy construction algorithm, a generalization of the north-west corner rule.
The following algorithm greedily constructs a solution to . The process begins by generating a combination . In the first step, each has corresponding mass , and the maximum mass that can be assigned to without violating the non-mass-splitting property is the minimum mass among . That minimum mass is placed at index in , denoting that mass is transported to from the optimal location for such mass assignment (the corresponding weighted mean ). The algorithm than computes the remaining mass each of the points still needs to receive full mass . Then the combination is updated; for each measure, if the current support point has not yet received full mass (), the support point remains in the combination. However, at least one measure’s support point has been fully supplied by the greedy mass assignment; for these measures a new support point is chosen, guaranteeing a new combination. The process then repeats, assigning the minimum mass not yet received at each support point to new combinations until all mass has been supplied.
This process is given in Algorithm 3. Note that in the first step of each repeat of the algorithm, a combination of support points is formed before the corresponding index . Therefore Algorithm 3 uses the double indexed notation as the support point in measure with corresponding mass , and computes the index for the combination.
Theorem 3.1
Let be discrete probability measures. Then Algorithm 3 runs in in the arithmetic model of computation.
Proof
First, we show that the number of nonzero elements produced by Algorithm 3, which is also the number of repetitions of the outer loop of Algorithm 3, is between and The lower bound, , is an immediate consequence of the non-mass-splitting property maintained by Algorithm 3. For the upper bound, , note that the last iteration must fully supply the mass to points, one from all , because the total mass for each is the same (one). In each previous iteration, the minimum number of support points whose index changes is one, for a total of iterations.
Thus the outer loop runs in time. Since each step inside the loop of Algorithm 3 requires at most linear-in- elementary operations, we obtain Theorem 3.1. ∎
Corollary 1
For probability measures with support sets of size at most , Algorithm 3 runs in in the arithmetic model of computation.
Proof
Let be discrete probability measures with a bound on the size of their support sets. Then , and becomes ∎
As an additional consequence of the iteration bound , the number of nonzero mass elements of w are bounded. Therefore the setup of the master problem LP (2.2) using a result produced by Algorithm 3 is efficient.
We now show that Algorithm 3 produces a vertex of the polytope generated by the constraints .
Theorem 3.2
Algorithm 3 generates a vertex of the polytope .
Proof
Let , be given and let w be generated using Algorithm 3. We show there exists a such that w is the unique optimal solution to:
[TABLE]
Let be the set of nonzero elements of w, with size . Order the elements of in order of construction by Algorithm 3, . Also order the associated indices as calculated by Algorithm 3.
First, we show that w is a feasible solution to the above system. For each support point in each measure , the current value is initialized as its full mass . The algorithm begins with a combination of support points from each measure, identifies the smallest mass among them (line ) and sets the mass for this combination to (line to get the correct index of the combination; line for the assignment). Then the current masses of all support points in the combination are reduced by (line ). The current mass of at least one of the support points must have dropped to [math]; then a new support point is picked from the respective measure (lines and ) and the process is repeated. To see why this yields a feasible solution, recall that the total mass in each measure is precisely and note that, in line , the total current mass in each measure is dropped by the same value . The algorithm runs until (line ), i.e., until the total mass of each support point in each measure is fully accounted for. This gives feasibility of w.
Next, we construct a such that w is a unique optimal solution. Let , , …, and . Let all other , those whose -index is not in , be (Recall: ).
By construction, removing mass from a combination with a lower index and assigning it to a combination with higher index in , that is, from to with , will strictly increase the value of . This includes moving mass to a combination with no mass in w, that is, with an index not in .
So it suffices to show that mass cannot be reassigned from to , . The mass is chosen such that for at least one , the mass has been fully supplied. Therefore cannot be increased without violating the constraints .
Therefore w minimizes subject to , since the maximum mass allowable is assigned to the cheapest costs. Furthermore, w does so uniquely, since any change in its elements will strictly increase the value of due to the construction of . Therefore w is a vertex. ∎
In Figure 3 (left), we display an example with three measures, two with 10 support points and one with 11 support points. Each measure has equally distributed mass. Applying Algorithm 3 results in a feasible solution supported on 20 weighted means of varying mass, displayed in Figure 3 (right), along with the transport for three sample points.
4 Computations
The primary goal of these experiments is to demonstrate, for general position measures, the computational benefits of column generation algorithms over the full linear program. To this end, we construct measures from a real-world data set containing event locations given in longitude and latitude. Because the events occur without known structure, probability measures with these support points are in general position. The generated measures have varying numbers of support points with uniformly distributed mass, and the weight of each measure is inversely proportional to the number of support points. All computations have been run on a laptop (MacBook Pro, 2.4 GHz Intel Core i9, 32 GB of RAM, SSD). Data processing and the setup of the LPs were implemented in C++ and the LPs were solved using Gurobi 8.0. The source code is available at https://github.com/StephanPatterson/Barycenter-Formulations. For a meaningful comparison, we set Gurobi to run without presolvers and using the same algorithm (primal simplex method) in all experiments.
We want comparisons to exact computations, which as previously discussed, are hard [2, 7]. Even when the measures contain a small number of support points, LP (1.1) may contain millions of variables. Therefore, the following analysis focuses primarily on measures with small support sets (2-12 support points per measure); the improved scaling on our column generation algorithms would allow for measures of more moderate size, but not orders of magnitude larger. Throughout this section, we use the number of variables in LP (1.1) as a reference label for a particular instance.
The second goal of these experiments is to examine the practical behavior of variations in implementation. To this end, we compare three variants of column generation applied directly to LP (1.1) and two versions using Dantzig-Wolfe decomposition. The two variants for the Dantzig-Wolfe reformulation differ only in the choice of which two measures are moved to the pricing problem. In the “DW-L” variant, the two measures with the largest number of support points are moved to the pricing problem, while the “DW-A” variant makes an arbitrary choice of two measures.
The three variants for column generation directly on LP (1.1) vary on the number of columns introduced per iteration; “1-col” refers to the standard column generation strategy of introducing the variable with the greatest reduced cost, thus introducing one variable per iteration. We have also included the strategy introducing all variables with improved cost, labeled “all-col”, and a heuristic compromise between the strategies, introducing the best columns per iteration, labeled “-col”. We also considered, but ultimately discarded, a variant introducing the first columns each iteration; while this has the benefit of avoiding the processing of the full exponential-sized cost vector each iteration, several times more variables were introduced, leading to slower solution times and larger problem sizes than the best variant in all but one of our experiments. We believe this is due to the highly structured nature of .
The total running times for these experiments are shown in Table 1. All of the column generation algorithms are able to find solutions to experiments for which LP (1.1) is too large for the laptop (*). The classic column generation algorithm, 1-col, typically does not show an improvement in solving speed over a direct computation using LP (1.1) for these experiments, which was our motivation for considering the other, heuristic strategies for introducing columns. The Dantzig-Wolfe reformulation algorithms, DW-L and DW-A, usually show minor improvements over LP (1.1), though one variant does not reliably outperform the other. The fastest run times consistently come from the approach that introduces the best columns per iteration. The all-columns approach also typically outperforms a direct solve; the size of the problem is approximately half of the full linear program and the algorithm completes in a handful of iterations (at most 4).
All column generation algorithms dramatically reduce the number of variables introduced, which results in significantly lower memory requirements. The maximum memory used during the execution of each experiment is shown in Table 2; the Dantzig-Wolfe reformulation is the most memory efficient algorithm due to its compact restricted master problem LP (2.2) and condensed pricing problem LP (2.3).
Since the fastest running times came from column generation on LP (1.1), but the best memory efficiency from the Dantzig-Wolfe reformulation, we re-ran the experiments with columns deletion – the removal of columns in the master problem when they leave the basis during the simplex method – to see if the memory requirements of column generation on LP (1.1) could be further reduced. The results of these experiments is given in Table 3; however, column deletion resulted in a very minor reduction in maximum memory usage while dramatically increasing running times. The memory reduction was not sufficient for the algorithms on LP (1.1) to be as efficient as a Dantzig-Wolfe implementation.
In the column generation algorithms on LP (1.1), the bottleneck for faster running times is the explicit choosing of new columns, which is dependent on the exponential-sized cost vector . The efficiency of each step of the Dantzig-Wolfe reformulation algorithm is somewhat less apparent; we examine the breakdown of running times each iteration in Table 4. The processing of to produce the updated, unique best-cost vector for the pricing problem is the majority of computational effort, while solving the pricing problem and subsequent master problem are efficient.
5 Concluding Remarks
The computation of an exact barycenter is costly in practice, and provably hard for data in general position [2, 7]. In this paper, we studied two column generation strategies - one on a suitable linear programming formulation for such data, one based on a Dantzig-Wolfe reformulation. While both of these provide significant improvements in scalability, especially though a memory-efficient implementation, computations remain hard. In this work, we used a couple of standard column generation techniques to improve the practical performance, such as the generation of multiple columns in each iteration, simple deletion strategies, or the generation of any (not necessarily best) improving columns. They typically have a positive impact, and we believe further refinements of the presented approach are interesting direction of future work, but they cannot overcome the underlying hardness of the problem.
As most barycenter algorithms require an explicit specification of a set of possible support points, and the size of this set is a bottleneck to computations, the direct and efficient generation of support points remains a key interest in the community. It translates to an efficient generation of columns for LP (1.1). It remains open whether it is possible to efficiently generate a (single) improving column; hardness of an exact barycenter computation implies that either the generation of a column itself or the number of columns that have to be generated cannot be polynomial.
The methods to do so will require a quite different approach: while we showed that it is efficient to evaluate the reduced cost for any given combination , the challenge lies in finding an improving one without an explicit evaluation of each combination in . We see potential for a competitive algorithm through an approximation of the data going into the reduced cost vector computation, which may lead to a heuristic algorithm, or through the setup and solution of an integer program for pricing, which may lead to further improvements for an exact computation.
Acknowledgments
We would like to thank Ethan Anderes for the implementation of a visualization basis for barycenters used in [3], which we modified to produce the figures of Sections 1 and 3. We would also like to thank Jon Lee for the many helpful discussions about transportation problems and total unimodularity.
The authors gratefully acknowledge support of this work by the National Science Foundation, Algorithmic Foundations, Division of Computing and Communication Foundations, under grant 2006183 Circuit Walks in Optimization; by the Airforce Office of Scientific Research under grant FA9550-21-1-0233 The Hirsch Conjecture for Totally-Unimodular Polyhedra; and by the Simons Foundation under Collaboration Grant 524210 Polyhedral Theory in Data Analytics before.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1[1] M. Agueh and G. Carlier. Barycenters in the Wasserstein space. SIAM Journal on Mathematical Analysis , 43(2):904–924, 2011.
- 2[2] J. Altschuler and E. Boix-Adserà. Wasserstein barycenters are NP-hard to compute. SIAM Journal on Mathematics of Data Science (SIMODS) , in press, 2021.
- 3[3] E. Anderes, S. Borgwardt, and J. Miller. Discrete Wasserstein Barycenters: Optimal Transport for Discrete Data. Mathematical Methods of Operations Research , 84(2):389–409, 2016.
- 4[4] J.-D. Benamou, G. Carlier, M. Cuturi, L. Nenna, and G. Peyré. Iterative Bregman Projections for Regularized Transportation Problems. SIAM Journal on Scientific Computing , 37(2):A 1111–A 1138, 2015.
- 5[5] S. Borgwardt. An LP-based, Strongly Polynomial 2-Approximation Algorithm for Sparse Wasserstein Barycenters. Operational Research , in press, 2020.
- 6[6] S. Borgwardt and S. Patterson. Improved Linear Programs for Discrete Barycenters. INFORMS Journal on Optimization , 2:14–33, 2020.
- 7[7] S. Borgwardt and S. Patterson. On the Computational Complexity of Finding a Sparse Wasserstein Barycenter. Journal of Combinatorial Optimization , 41:736––761, 2021.
- 8[8] G. Carlier, A. Oberman, and E. Oudet. Numerical methods for matching for teams and Wasserstein barycenters. ESAIM: Mathematical Modeling and Numerical Analysis , 49(6):1621–1642, 2015.
