Convergence Rates of Smooth Message Passing with Rounding in Entropy-Regularized MAP Inference
Jonathan N. Lee, Aldo Pacchiano, Michael I. Jordan

TL;DR
This paper analyzes the convergence rates of smooth message passing algorithms with rounding for entropy-regularized MAP inference in graphical models, providing theoretical guarantees on iteration complexity for recovering the true MAP solution.
Contribution
It offers the first theoretical analysis of convergence rates for entropy-regularized message passing algorithms in MAP inference, including conditions for exact recovery.
Findings
Convergence rates depend on regularization parameters and problem structure.
Under certain conditions, the algorithm guarantees recovery of the true MAP solution.
Provides bounds on the number of iterations needed for $ ext{epsilon}$-optimality.
Abstract
Maximum a posteriori (MAP) inference is a fundamental computational paradigm for statistical inference. In the setting of graphical models, MAP inference entails solving a combinatorial optimization problem to find the most likely configuration of the discrete-valued model. Linear programming (LP) relaxations in the Sherali-Adams hierarchy are widely used to attempt to solve this problem, and smooth message passing algorithms have been proposed to solve regularized versions of these LPs with great success. This paper leverages recent work in entropy-regularized LPs to analyze convergence rates of a class of edge-based smooth message passing algorithms to -optimality in the relaxation. With an appropriately chosen regularization constant, we present a theoretical guarantee on the number of iterations sufficient to recover the true integral MAP solution when the LP is tight and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsBayesian Modeling and Causal Inference · Machine Learning and Algorithms · Sparse and Compressive Sensing Techniques
Convergence Rates of Smooth Message Passing with Rounding in Entropy-Regularized MAP Inference
Jonathan N. Lee∗
Aldo Pacchiano∗
Michael I. Jordan
Stanford University
UC Berkeley
UC Berkeley
Abstract
Maximum a posteriori (MAP) inference is a fundamental computational paradigm for statistical inference. In the setting of graphical models, MAP inference entails solving a combinatorial optimization problem to find the most likely configuration of the discrete-valued model. Linear programming (LP) relaxations in the Sherali-Adams hierarchy are widely used to attempt to solve this problem, and smooth message passing algorithms have been proposed to solve regularized versions of these LPs with great success. This paper leverages recent work in entropy-regularized LPs to analyze convergence rates of a class of edge-based smooth message passing algorithms to -optimality in the relaxation. With an appropriately chosen regularization constant, we present a theoretical guarantee on the number of iterations sufficient to recover the true integral MAP solution when the LP is tight and the solution is unique.
1 INTRODUCTION
Undirected graphical models are a central modeling formalism in machine learning, providing a compact and powerful way to model dependencies between variables. Here we focus on the important class of discrete-valued pairwise models. Inference in discrete-valued graphical models has applications in many areas including computer vision, statistical physics, information theory, and genome research (Antonucci et al., 2014; Wainwright and Jordan, 2008; Mezard and Montanari, 2009).
We focus on the problem of identifying a configuration of all variables that has highest probability, termed maximum a posteriori (MAP) inference. This problem has an extensive literature across multiple communities, where it is described by various names, including energy minimization (Kappes et al., 2013) and constraint satisfaction (Schiex et al., 1995). In the binary case, the MAP problem is sometimes described as quadratic-pseudo Boolean optimization (Hammer et al., 1984) and it is known to be NP-hard to compute exactly (Kolmogorov and Zabin, 2004; Cooper, 1990) or even to approximate (Dagum and Luby, 1993). Consequently, much work has attempted to identify settings where polynomial-time methods are feasible. We call such settings “tractable” and the methods “efficient.” A general framework for obtaining tractable methodology involves “relaxation”—the MAP problem is formulated as an integer linear program (ILP) and is then relaxed to a linear program (LP). If the vertex at which the LP achieves optimality is integral, then it provides an exact solution to the original problem. In this case we say that the LP is tight. If the LP is performed over the convex hull of all integral assignments, otherwise known as the marginal polytope , then it will always be tight. Inference over the marginal polytope is generally intractable because it requires exponentially many constraints to enforce global consistency.
A popular workaround is to relax the marginal polytope to the local polytope (Wainwright and Jordan, 2008). Instead of enforcing global consistency, the local polytope enforces consistency only over pairs of variables, thus yielding pseudo-marginals which are pairwise consistent but may not correspond to any true global distribution. The number of constraints needed to specify the local polytope is linear in the number of edges. More generally, Sherali and Adams (1990) introduced a series of successively tighter relaxations of the marginal polytope, or convex hull, while retaining control on the number of constraints. However, even with these relaxations, it has been observed that standard LP solvers do not scale well (Yanover et al., 2006), motivating the study of solvers that exploit the structure of the problem, such as message passing algorithms.
Of particular interest to this paper are smooth message passing algorithms, i.e. algorithms derived from regularized versions of the relaxed LP (Meshi et al., 2012; Savchynskyy et al., 2011, 2012; Hazan and Shashua, 2008; Ravikumar et al., 2010). These regularized LPs conduce to efficient optimization in practice and have the special property that their fixed points are unique and optimal; however, this comes at the cost of solving an approximation of the true MAP problem and, without rounding, they do not recover integral solutions in general. Non-asymptotic convergence rates to the optimal regularized function value have been studied (Meshi et al., 2012), but guarantees on the number of iterations sufficient to recover the optimal integral assignment of the true MAP problem have not been considered to our knowledge.
In this work we provide a sharp analysis of the entropy-regularized MAP inference problem with Sherali-Adams relaxations. We first characterize the approximation error of the regularized LP in distance, based on new results on entropy-regularized LPs (Weed, 2018). We then analyze an edge-based smooth message passing algorithm, modified from the algorithms described in Werner (2007) and Ravikumar et al. (2010). We prove a rate of convergence of iterates in distance. Combining the approximation error and convergence results, we present a guarantee on the number of iterations sufficient to recover of the true integral MAP assignment using a standard vertex rounding scheme when the LP relaxation is tight and the solution is unique.
2 RELATED WORK
The idea of entropy regularization to aid optimization in inference problems is well studied. It is well known that solving a scaled and entropy-regularized linear program over the marginal polytope yields the scaled Gibbs free energy, intimately related to the log partition function, when the temperature parameter equals one (Wainwright and Jordan, 2008). As the temperature parameter is driven to zero, the calculation of the free energy reduces to the value of the MAP problem. However, this problem is intractable due to the difficulty of both computing the exact entropy and characterizing the marginal polytope (Deza and Laurent, 2009). Therefore, there has been much work in trying to turn this observation into tractable inference algorithms. The standard Bethe approximation instead minimizes an approximation of the true entropy (Bethe, 1935). It was show by Yedidia et al. (2003) that fixed points of the loopy belief propagation correspond to its stationary points, but still the optimization problem resulting from this approximation is non-convex and convergence is not always guaranteed.
To alleviate convergence issues, much work has considered convexifying the free energy problem leading to classes of convergent convex belief propagation often derived directly from convex regularizers (Meshi et al., 2009; Heskes, 2006; Hazan and Shashua, 2008; Johnson and Willsky, 2008; Savchynskyy et al., 2012). For instance, Weiss et al. (2007) proposed a general convexified belief propagation and explored some sufficient conditions that enable heuristically recovering the MAP solution of the LP via a convex sum-product variant. However, the approximation error was still unclear and non-asymptotic convergence rates were not considered. A number of algorithms have also been proposed to directly optimize the unregularized LP relaxation often with only asymptotic convergence guarantees such as block-coordinate methods (Werner, 2007; Globerson and Jaakkola, 2008; Kovalevsky and Koval, 1975; Tourani et al., 2018; Kappes et al., 2013) and tree-reweighted message passing (Wainwright et al., 2005; Kolmogorov, 2006). The relationship between the regularized and unregularized problems can equivalently be viewed as applying a soft-max to the dual objective typically considered in the latter to recover that of the former (Nesterov, 2005; Sontag et al., 2011). Many other convergent methods exist such as augmented Lagrangian (Martins et al., 2011; Meshi and Globerson, 2011), bundle (Kappes et al., 2012), and steepest descent (Schwing et al., 2012, 2014) approaches, but again they are difficult to compare without rates.
Most closely related to our work is recent work in convergence analysis of certain smoothed message passing algorithms that aim to solve the regularized LP objective. Savchynskyy et al. (2011) proposed an accelerated gradient method that achieves convergence to the optimal regularized dual objective value. Convergence of the primal iterates was only shown asymptotically. Meshi et al. (2012) considered a general dual coordinate minimization algorithm based on the entropy-regularized MAP objective. They proved upper bounds on the rate of convergence to the optimal regularized dual objective value; however, closeness to the true MAP assignment was not formally characterized. Furthermore, convergence in the dual objective value again does not make it easy to determine when the true MAP assignment can be recovered. Meshi et al. (2015) later studied the benefits of adding a quadratic term to the LP objective instead and proved similar guarantees. Ravikumar et al. (2010) also considered entropic and quadratic regularization, using a proximal minimization scheme with inner and outer loops. They additionally provided rounding guarantees to recover true primal solutions. However, as noted by the authors, the inexact calculation of the inner loop prevents a convergence rate analysis once combined with the outer loop. Additionally, rates on the inner loop convergence were not addressed.
The approach of this paper can be understood as the bridging the gap between Meshi et al. (2012) and Ravikumar et al. (2010). Our first contribution is a characterization of the approximation error of the entropy-regularized MAP inference problem. We then study an edge-based message passing algorithm that solves the regularized LP, which is essentially a smoothed max-sum diffusion (Werner, 2007) or the inner loop of the proximal steps of Ravikumar et al. (2010). For our main contribution, we provide non-asymptotic guarantees to the integral MAP assignment for this message passing algorithm when the LP is tight and the solution is unique. To our knowledge, this is the first analysis with rates guaranteeing recovery of the true MAP assignment for smooth methods.
3 BACKGROUND
We denote the -dimensional probability simplex as . The set of joint distributions which give rise to is defined as For any two vectors or matrices and having the same number of elements, we use to denote the dot product, i.e. elementwise multiplication then sum over all elements. We use to denote the sum of absolute values of the elements of . The Bregman divergence between with respect to a strictly convex function is We will consider the Bregman divergence with respect to the negative entropy , where need not be a distribution. When is a distribution, this corresponds to the Kullback-Leibler (KL) divergence. The Bregman projection with respect to of onto the set is defined as . The Hellinger distance between is defined as , where is the -norm. We denote the square of the Hellinger distance by . We will often deal with marginal vectors which are ordered collections of joint and marginal distributions in the form of matrices and vectors, respectively.
3.1 Pairwise Models
For a set of vertices, , and edges , a pairwise graphical model, , is a Markov random field that represents the joint distribution of variables , taking on values from the set of states . We assume that each vertex has at least one edge. For pairwise models, the joint distribution can be written as a function of doubletons and singletons: We wish to find maximum a posteriori (MAP) estimates of this model. That is, we consider the integer program:
[TABLE]
The maximization in (Int) can be written as a linear program by defining a marginal vector over variable vertices and variable edges . The vector represents the marginal distribution probabilities on vertex while the matrix represents the joint distribution probabilities shared between vertices and . We follow the notation of Globerson and Jaakkola (2008) and denote indexing into the vector and matrix variables with parentheses, e.g. for . The set of marginal vectors that are valid probability distributions is known as the marginal polytope and is defined as
[TABLE]
We can think of as the set of mean parameters of the model for which there exists a globally consistent distribution . We abuse notation slightly and dually view as a potential “vector.” The edge matrix is indexed as , indicating the element at the th row and th column. The vertex vector is indexed as , indicating the th element. The MAP problem in (Int) can be shown to be equivalent to the following LP (Wainwright and Jordan, 2008):
[TABLE]
where .
3.2 Sherali-Adams Relaxations
The number of constraints in is unfortunately superpolynomial (Sontag, 2010). This motivates considering relaxations of the marginal polytope to outer polytopes that involve fewer constraints. For example, the local outer polytope is obtained by enforcing consistency only on edges and vertices:
[TABLE]
Relaxations of higher orders have also been studied, in particular by Sherali and Adams (1990) who introduced a hierarchy of polytopes by enforcing consistency on joint distributions of increasing order up to : . The corresponding Sherali-Adams LP relaxation of order is then
[TABLE]
where . Because is an outer polytope of , we no longer have that the solution to (LP) recovers the true MAP solution of (Int) in general. However if the solution to (LP) is integral, then recovers the optimal solution of the true MAP problem. In this case, we say is tight.
4 ENTROPY-REGULARIZED MAP
In this section, we present our first main technical contribution, characterizing the approximation error in the entropy-regularized MAP problem for Sherali-Adams relaxations. In contrast to solving the exact (LP), we aim to solve the entropy-regularized LP:
[TABLE]
where and . The hyperparameter adjusts the level of regularization. Denote by the solution of (Reg) where we omit the reference to to alleviate notation. In addition to their extensive history in inference problems, entropy-regularized LPs have arisen in a number of other fields to aid optimization when standard LP solvers are insufficient. For example, recent work in optimal transport has relied on entropy regularization to derive alternating projection algorithms (Cuturi, 2013; Benamou et al., 2015) which admit almost linear time convergence guarantees in the size of the cost matrix (Altschuler et al., 2017). Some of our theoretical results draw inspiration from these works.
4.1 Approximation Error
When is tight and the solution is unique, we show that approximate solutions from solving (Reg) are not necessarily detrimental because we can apply standard vertex rounding schemes to yield consistent integral solutions. It was shown by Cominetti and San Martín (1994), and later refined by Weed (2018), that the approximation error of general entropy-regularized linear programs converges to zero at an exponential rate in . Furthermore, it is possible to determine how large should be chosen in order for rounding to exactly recover the optimal solution to (Int). The result is summarized in the following extension of Theorem 1 of Weed (2018)111The entropy is defined without the linear offset in Weed (2018)..
Theorem 1**.**
Let , , be the set of vertices of , and the set of optimal vertices with respect to . Let be the smallest gap in objective value between an optimal vertex and any suboptimal vertex of . Suppose is tight and . If , the following rounded solution is a MAP assignment:
[TABLE]
Proof.
Define , where denotes an all-ones vector with the same dimensions as . If then , the set of optimal vertices of with respect to , satisfies and . If ; and , then . Let . If , and then . And therefore, by Corollary 9 of Weed (2018) . Since is assumed to be tight and contains a single integral vertex , the last equation implies . ∎
Consequently, since and 222For we can get tighter bounds corresponding to the number of edges in the graph ., we have:
Corollary 1**.**
If is tight, , and , the rounded solution is a MAP assignment.
In general the dependence of on suggested by Theorem 1 is not improvable (Weed, 2018). Nevertheless, when and , since all vertices in have entries equal to either or —see Padberg (1989) or Theorem 3 of Weller et al. (2016)—if the entries of are all integral, we have , thus yielding a more concrete guarantee. The disadvantage of choosing exorbitantly large is that efficient computation of solutions often becomes more difficult in practice (Weed, 2018; Benamou et al., 2015; Altschuler et al., 2017). Thus, in practice, there exists a trade-off between computation time and approximation error that is controlled by . We will provide a precise theoretical characterization of the trade-off in Section 6. In our guarantees, multiplying by a constant (and therefore multiplying by ) is equivalent to multiplying by the same value.
4.2 Equivalent Bregman Projection
The objective (Reg) can be interpreted as a Bregman projection. This interpretation has been explored by Ravikumar et al. (2010) as a basis for proximal updates and also Benamou et al. (2015) for the optimal transport problem. The objective is equivalent to
[TABLE]
where . The derivation, based on a mirror descent step can be found in the appendix. The projection, however, cannot be computed in closed form in general due to the complex geometry of .
Ravikumar et al. (2010) proposed using the Bregman method (Bregman, 1966), which has been applied in many fields to solve difficult constrained problems (Benamou et al., 2015; Goldstein and Osher, 2009; Osher et al., 2005, 2010), to compute for the inner loop calculation of their proximal algorithm. While the outer loop proximal algorithm can be shown to converge at least linearly, the inner loop rate was not analyzed and the constants (possibly dependent on dimension) were not made clear. Furthermore, the Bregman method is in general inexact, which makes the approximation and the effect on the outer loop unclear (Liu and Ihler, 2013).
5 SMOOTH MESSAGE PASSING
We are interested in analyzing a class of algorithms closely inspired by max-sum diffusion (MSD) as presented by Werner (2007) and the proximal updates of Ravikumar et al. (2010) to solve (Proj) over the polytope. We describe it in detail here, with a few minor modifications and variations to facilitate theoretical analysis. In , the constraints occur only over edges between vertices333Written explicitly, the constraints actually occur between any pair of vertices, but these variables play no role in the objective or constraints.. Given an edge , we must enforce the constraints prescribed by (7), which is the intersection of the following sets:
[TABLE]
The normalization of the joint distribution in (b) and (d) is actually a redundant constraint, but it facilitates analysis as we demonstrate in Section 6. For each of these affine constraints, we can compute the Bregman projections in closed form with simple multiplicative updates.
Proposition 1**.**
For a given edge , the closed-form solutions of the Bregman projections for each of the above individual constraints are given below.
- (a)
Left consistency: If , then for all , and . 2. (b)
Left normalization: If , then for all , and . 3. (c)
Right consistency: If , then for all , and . 4. (d)
Right normalization: If , then for all , and .
These update rules are similar to a number of algorithms throughout the literature on LP relaxations. Notably, they can be viewed as a smoothed version of MSD (Werner, 2007; Kovalevsky and Koval, 1975) in that the updates enforce agreement between variables on the edges and vertices. Nearly identical smoothed updates were also initially proposed by Ravikumar et al. (2010). As in MSD, it is common for message passing schemes derived from LP relaxations to operate on dual objective instead. We presented the primal view here as the Bregman projections lend semantic meaning to the updates and ultimately the stopping conditions in the algorithms. An equivalent dual view is presented in Appendix C.1.
Based on these update rules, we formally outline the algorithms we wish to analyze, which we call edge-based message passing (EMP) for convenience. We consider two variants: EMP-cyclic (Algorithmic 1), which cyclically applies the updates to each edge in each iteration and EMP-greedy (Algorithmic 2), which applies a single projection update to only the edge with the greatest constraint violation in each iteration. We emphasize that these algorithms are not fundamentally new, but our analysis in the next section is our main contribution. EMP-cyclic is the Bregman method, almost exactly the inner loop proposed by Ravikumar et al. (2010). In both variants, is defined as the normalized value of . The GreedyEdge operation in EMP-greedy is defined as
[TABLE]
These procedures are then repeated again until the stopping criterion is met, which is that is -close to satisfying the constraint that the joint distributions sum to the marginals for all edges. Both algorithms also conclude with a rounding operation. Any fixed point of EMP must correspond to an optimal (see details in appendix). Computationally, EMP-greedy requires a search over the edges to identify the greatest constraint violation, which can be efficiently implemented using a max-heap (Nutini et al., 2015).
6 THEORETICAL ANALYSIS
We now present our main contribution, a theoretical analysis of EMP-cyclic and EMP-greedy. This result combines two aspects. First, we present a convergence guarantee on the number of iterations sufficient to solve (Proj), satisfying the constraints with error in distance. We note that, in finite iterations, the pseudo-marginals of EMP are not primal feasible in general due to this -error. We then combine this result with our guarantee on the approximation error in Theorem 1 to show a bound on the number of iterations sufficient to recover the true integral MAP assignment by rounding, assuming the LP is tight and the solution is unique. This holds with sufficient iterations and a sufficiently large regularization constant even though the pseudo-marginals may not be primal feasible. We emphasize that these theorems are a departure from usual convergence rates in the literature (Meshi et al., 2012, 2015). Prior work has guaranteed convergence in objective value to the optimum of the regularized objective (Proj), making it unclear whether the optimal MAP assignment can be recovered, e.g. by rounding. We address this ambiguity in our results.
We begin with the upper bound iterations to obtain -close solutions, which is the result of two facts which we show. The first is that the updates in Proposition 1 monotonically improve a Lyapunov (potential) function by an amount proportional to the constraint violation as measured via the Hellinger distance. The second is that the difference between the initial and optimal values of the Lyapunov function is bounded.
Let denote the maximum degree of graph and define:
[TABLE]
Theorem 2**.**
For any , EMP is guaranteed to satisfy and for all in iterations for EMP-cyclic and iterations for EMP-greedy.
Here, . In this theorem, we give our guarantee in terms of distance rather than function value convergence. As we will see, this is significant, allowing us to relate this result to Theorem 1 in order to derive the main result. The proof is similar in style to Altschuler et al. (2017). We leave the full proof for EMP-cyclic for the appendix due to a need to handle tedious edge cases, but we state several intermediate results and sketch the proof for EMP-greedy for intuition as it reveals possibly how similar message passing algorithms can be analyzed. We first introduce a Lyapunov function written in terms of dual variables , indexed by the edges and vertices to which they belong in . We denote the iteration-indexed dual variables as . For a given edge , constraints enforcing row and column consistency correspond to , respectively. Normalizing constraints correspond to . The Lyapunov function, , is shown in Figure 3.
We note that maximizing over satisfies all constraints and yields the solution to (Proj) by first-order optimality conditions. We now present a result that establishes the monotone improvement in due to the updates in Proposition 1.
Lemma 1**.**
For a given edge , let and denote the updated primal and dual variables after a projection from one of (a)–(d) in Proposition 1. We have the following improvements on . If is equal to:
- (a)
, then 2. (b)
, then 3. (c)
, then 4. (d)
, then .
This result shows that improves monotonically after each of the four updates in Proposition 1. Furthermore, at every update, improves by twice the squared Hellinger distance of the constraint violation between the joint and the marginals.
Lemma 2**.**
Let , denote the maximizers of . The difference in function value between the optimal value of and the first iteration value is upper bounded
[TABLE]
Turning to Theorem 2, the result is obtained by observing that as long as the constraints are violated by an amount (i.e., the algorithm has not terminated), then the Lyapunov function must improve by a known positive amount at each iteration. We provide a proof sketch for EMP-greedy.
Proof Sketch of Theorem 2 for EMP-greedy.
We now show how to combine the results of Lemma 1 and Lemma 2 to obtain Theorem 2. Let be the first iteration such that the termination condition in Algorithm 2 holds with respect to some . Then, for any satisfying , we have that selects such that either or .
Without loss of generality, suppose . Therefore, we have
[TABLE]
where again denotes the squared Hellinger distance and the last inequality is the Hellinger inequality. Since and are normalized for each iteration, this inequality is valid. Thus, improves by when occurs and by a non-negative amount when occurs by Lemma 1. Therefore, we can guarantee improvement of at least each iteration. Since the optimality gap is at most by Lemma 2, this means the algorithm must terminate in iterations.∎
We now turn to our main theoretical result. We combine our approximation and iteration convergence guarantees to fully characterize the convergence of EMP for to the optimal MAP assignment when the relaxation is tight and the solution is unique.
Theorem 3**.**
Let , and . If is tight and , the EMP algorithm returns a MAP assignment after iterations for EMP-cyclic and after iterations for EMP-greedy.
When is integral, , yielding a bound of all known parameters. The main technical challenge in producing this result is to relate the termination condition of EMP to the distance between and (the MAP assignment), as this may lie outside the polytope . It does not suffice to provide convergence guarantees in function value as the goal of MAP inference is to produce integral assignments. The proof proceeds in two steps. First we show that is the entropy-regularized solution to objective over a “slack” polytope . Where the slack vector corresponds to the constraint violations of . We use this characterization to “project” onto a nearby feasible point . Second, we can use the properties of the primal objective to bound and . The proof is in the appendix.
7 NUMERICAL EXPERIMENTS
We illustrate our theoretical results in a practical application of the EMP algorithms. Ravikumar et al. (2010) already gave empirical evidence that the basic EMP-cyclic is competitive with standard solvers. Therefore, the objective of these experiments is to understand how graph and algorithm properties affect approximation (Theorem 1) and convergence (Theorem 2). We consider the family of multi-label Potts models (Wainwright et al., 2005) with labels on . For each trial, the cost vector is , and
[TABLE]
where the parameters are random and . The graphs considered are structured as grids (Globerson and Jaakkola, 2008; Ravikumar et al., 2010; Erdogdu et al., 2017) and as Erdős-Rényi random graphs with edge probability . To evaluate recovery of the optimal MAP assignment, we first solved each graph with the ECOS LP solver (Domahidi et al., 2013) and selected graphs that were tight. Solving the LP to find the ground-truth was the main computational bottleneck. Further details can be found in Appendix E.
Approximation
In Figure 4, we evaluate the effect of regularization and graph size on the quality of the nearly converged solution from EMP for over iterations on grids. The box-plots indicate that large choices of often yield the exact MAP solution (cyan and purple). Moderate choices still yield competitive solutions but not optimal for larger graphs (orange and green). Low choices generally give poor solutions with high spread for all graph sizes (red and blue).
Convergence
We then investigate the effects of regularization on convergence for both variants. Figure 5 illustrates the distance of the rounded solution to the optimal MAP solution over projection steps on grids of size . EMP-greedy converges sharply and varying regularization has less of an effect on its convergence rate. Finally, in Figure 6, we look at Erdős-Rényi random graphs to observe the effect of the graph structure for both variants. We considered degree-limited random graphs with and . The figure shows convergence over projection steps for graphs of size . For both variants, the convergence rate deteriorates for higher degrees.
8 CONCLUSION
In this paper, we investigated the approximation effects of entropy regularization on MAP inference objectives. We combined these approximation guarantees with a convergence analysis of an edge-based message passing algorithm that solves the regularized objective to derive guarantees on the number of iterations sufficient to recover the true MAP assignment. We also showed empirically the effect of regularization and graph propertise on both the approximation and convergence. In future work, we wish to extend the analyses and proof techniques to higher order polytopes and general block-coordinate minimization algorithms.
Acknowledgements
We thank the anonymous reviewers and Marco Pavone for their invaluable feedback.
Appendix A Bregman Projection Derivation
The objective (Reg) can be equivalently interpreted as a Bregman projection. This interpretation has been explored by Ravikumar et al. (2010) as a basis for proximal updates and also Benamou et al. (2015) for the optimal transport problem. Here, we review the transformation because it is central to the algorithm of Ravikumar et al. (2010), upon which our main theoretical results are based.
By definition of the Bregman projection with respect to the negative entropy, , we have
[TABLE]
where is a vector of ones of the same size as the marginal vector and denotes the two sides are equal up to a constant. Substituting this into (Reg) and multiplying through by yields the objective:
[TABLE]
Note the similarity to a projected mirror descent update over starting from (Nemirovsky and Yudin, 1983; Bubeck, 2015). Using this insight and performing a single gradient update in the dual, we can transform the problem into a single Bregman projection of the vector. The unprojected marginal vector satisfies
[TABLE]
where is the dual map and is the inverse dual map. We have and the solution to the mirror descent update is . Therefore it is sufficient to solve the following Bregman projection problem:
[TABLE]
The projection, however, cannot be computed in closed form due to the complex geometry of . Sinkhorn-like algorithms such as those used in Cuturi (2013) are unavailable because the transportation polytopes are dependent on variables and which are also involved in the projection operation.
Appendix B Derivation of EMP Update Rules
We present the derivations of the update rules similar to Ravikumar et al. (2010) for a given edge based on the Bregman projections onto the individual constraint sets , , , . We refer the reader to Ravikumar et al. (2010) for the original algorithm and derivation. We derive only the first two projections; the last two can be found by exchanging the indices.
For the projection , where
[TABLE]
there are no constraints on any edges or vertices other than and . Therefore, , . Similarly, , .
The Lagrangian of the projection is given in terms of primal variables and dual variables :
[TABLE]
By the first-order optimality condition, the primal solution in terms of the dual variables is
[TABLE]
Substituting this solution back in to the Lagrangian, we have
[TABLE]
Again, by the first-order optimality condition, the dual solution is
[TABLE]
Substituting this value for into the primal solution yields the desired result.
Again, for the projection onto
[TABLE]
only and are affected. enforces that the variables and each sum to one. It is well known and easy to show that the Bregman projection with respect to the negative entropy is simply the and normalized by their sums. This normalization can also be written as a multiplicative update of the same form by observing that
[TABLE]
where and . Again, these can be derived via the Lagrangian.
Appendix C Extensions of EMP
C.1 Dual EMP
We may also equivalently interpret the multiplicative updates in Algorithm 1 and Algorithm 2 as additive updates of the dual variables. The dual interpretation is consistent with past work in dual MAP algorithms (Sontag et al., 2011) and may be more practical to avoid numerical issues in implementation. Instead of tracking the primal variables , we track a sum of the dual variables with for each vertex and edge. Enforcing consistency between a given joint distribution and its marginals in (a) yields updated dual variable sums
[TABLE]
where again . The same is done for the vertex in (c) with indices exchanged. The normalization step in (b) yields
[TABLE]
where and . Again, the same is done for (d). The primal marginal vector is recovered with
[TABLE]
We will later make explicit the dual formulation as it will aid in the theoretical analysis.
C.2 Clique Constraints
The version of EMP presented in the paper is for the local polytope, which enforces only pairwise consistency among the variables with edges, but this can be fairly easily extended. In this section, we discuss higher order pseudo-marginals and their constraints. Consider the polytope that enforces consistency on all subsets of of size and below, denoted by . We use the notation of Meshi et al. (2012). The constraint set is written as
[TABLE]
where denotes a marginalization over all variables except . For convenience, we may also now account for higher-order interactions in the model itself:
[TABLE]
The projection operation in (Proj) is the same for . Analogous update rules to Proposition 1 can derived with exactly the same procedure. For a given subset and vertex , we have that constitutes the update
[TABLE]
The normalization updates are identical as well. As in the presented EMP algorithm, we can design greedy and cyclic algorithms around these update equations. The theoretical analysis in Section 6 will focus on the case with edges only. We leave the general analysis of for future work.
Appendix D Omitted Proofs and Derivations from Section 6
D.1 Derivation of the Lyapunov function (8)
For convenience, is restated here:
[TABLE]
The Lagrangian of (Proj) with primal variables and dual variables can be written as
[TABLE]
where
[TABLE]
The partial derivatives with respect to and are given by
[TABLE]
Setting the derivatives to zero gives the solution in terms of the dual variables:
[TABLE]
By substituting in , we obtain the Lyapunov function .
D.2 Proof of Lemma 1
In this section we prove Lemma 1. We restate the result for the reader’s convenience.
Lemma 3**.**
For a given edge , let and denote the updated primal and dual variables after a projection from one of (a)–(d) in Proposition 1. We have the following improvements on . If is equal to:
- (a)
, then 2. (b)
, then 3. (c)
, then 4. (d)
, then .
Proof.
Let and denote the values of the Lyapunov function before and after the projection in each case.
(a) Due to the projection , only and change values.
[TABLE]
(b) Due to the projection change, again only and , but they are simply normalized. From the derivation of the updates, we can see that only dual variables and are updated in order for the normalization to occur. We have, from the update rule in Proposition 1
[TABLE]
The improvement on the Lyapunov function can then be written as
[TABLE]
where the second equality uses the fact that and both sum to one. This last expression can be shown to be non-negative by recognizing the classical inequality for all .
(c) The proof of improvement is identical to (a); however, we replace vertex with and all row sums with column sum .
(d) The proof of improvement is identical to (b), but we replace with for the vertex marginal normalization.
∎
D.3 Fixed points of EMP
We start this section by noting that all fixed points of EMP correspond to valid (constraint satisfying) primal solutions and therefore must equal global optima of the dual function.
First note that any fixed point of EMP corresponds to a candidate solution all whose constraints are satisfied. Indeed, at optimality satisfy:
[TABLE]
with . Since all constraints are satisfied, for all projection types in Lemma 1, .
For the converse, we proceed by contradiction. Let be a fixed point of EMP. As such, all the normalization constraints (ensuring the edge and node distributions each sum to one) must be satisfied. Assume then that a constraint of type (a) or (c) is not satisfied. Without loss of generality let be the unsatisfied constraint. As a consequence of 1, the Lyapunov objective can be strictly increased by performing the corresponding Bregman projection, and therefore EMP couldn’t have possibly be at a fixed point. We summarize these observations in the following proposition:
Proposition 2**.**
All maxima of are fixed points of EMP and all fixed points of EMP are maxima of .
D.4 Proof of Lemma 2
In this section we prove Lemma 2, we restate it here for readability:
Lemma 4**.**
Let , denote the maximizers of . The difference in function value between the optimal value of and the value at the first iteration is upper bounded as
[TABLE]
Proof.
We start by showing the upper bound:
[TABLE]
We have that when before any updates to the primal variables. By Lemma 1, . Then we have
[TABLE]
We may establish an upper bound on by finding a feasible point in the primal objective (Proj). It is easy to verify that is in if and , and . With this choice of , the value of (Proj) is
[TABLE]
where denotes the uniform distribution. where the last inequality follows from the fact that . Therefore,
[TABLE]
We now proceed to show the following (direct) bound on :
[TABLE]
We work under the assumption that at any time , all the component distributions of are normalized so its entries sum to . Notice that in this case
[TABLE]
If we initialize our algorithm to , and be the normalization factors corresponding to this choice of , then
[TABLE]
Notice that at optimality , for all and, for all ,
[TABLE]
And for all and for all ,
[TABLE]
Therefore, for all and for all :
[TABLE]
For all and for all :
[TABLE]
Summing Equations (14) and (15) over all , and yields:
[TABLE]
And, therefore,
[TABLE]
Notice that the RHS of the equation above is positive since: for all and all . Combining Equations (13) and (17) and the observation that (by virtue of Lemma 1) we obtain the final result. ∎
In the case when all entries of are positive it may be the case that .
D.5 Complete Proof of Theorem 2
In this section, we will complete the proof of Theorem 2 by handling the case of EMP-cyclic. We require two additional technical lemmas on the distance between updated variables. We will use and to denote row and column sums respectively of joint distribution matrices.
Lemma 5**.**
Let be two points in the simplex and let s.t. for all . Let defined as . Then:
[TABLE]
Proof.
We only need to prove that . From we obtain:
[TABLE]
Let . The following relationships hold:
[TABLE]
Note that
[TABLE]
and
[TABLE]
Therefore,
[TABLE]
The result follows.
∎
Let with elements be a matrix representing joint distribution probabilities. For , define
[TABLE]
where is a normalization term, such that the new probabilities matrix sums to one. The notation denotes the th element of row sum vector .
Lemma 6**.**
The following inequality holds on the difference between and :
[TABLE]
Proof.
[TABLE]
∎
This proof of Theorem 2 relies heavily on the primal and dual variables at given times throughout the algorithm. As such, it is necessary to define precise notation for these temporal events. We note that there are two loops in the algorithm: an outer loop that controls the iterations and an inner one that loops over all edges in . The outer loop’s current iteration is given by , as defined and updated in Algorithm 1. We denote the current step of the inner loop by where . This is due to the fact that there are four projections for each edge (, , , and ) in one full iteration for . Thus the algorithm alternates between enforcing consistency between an edge and vertex and normalizing the local distributions.
The value of at iteration and step within iteration is denoted by . For example, at the very start of the algorithm, we are at iteration and step with initial value , which is equal to with normalized vertex marginal and edge joint distributions. The constraint set onto which a projection is made at in any iteration is denoted by . Note that we drop in the constraint set notation because the order in which the projections occur is always the same.
Proof of Theorem 2.
Let be the first iteration such that the termination condition in Algorithm 1 with respect to is met. For such that , there exists such that or .
First consider the case where . Let be chosen such that . Note that can move within the -ball of between times and of the th iteration due to earlier projections involving vertex . However, for all because it is only updated at step where . Then, by repeatedly applying the triangle inequality, we have
[TABLE]
where and are sets of times before where a projection (for row and column consistency, respectively) caused to be updated:
[TABLE]
Therefore, is the result of enforcing consistency with another edge of and then normalizing . Let denote the edge (incident on ) onto which projections are occurring at step . From Lemma 5, if , then
[TABLE]
If , then
[TABLE]
Similarly, by combining Lemma 5 and Lemma 6, we have
[TABLE]
Note that since the variables are normalized at every even step, they are individually valid probability distributions, and so the Hellinger inequality can be applied. For distributions, and , the inequality states
[TABLE]
Therefore,
[TABLE]
The last inequality follows from telescoping over all steps in iteration due to Lemma 1. This proof was for the case when . For the case when , the procedure is identical except we may ignore the term since is constant within iteration until the projection onto . Thus, the improvement lower bound still holds.
Putting these results together with Lemma 2, we see that as long as a single constraint is violated above the threshold at the start of an iteration, it is possible to show that the value of increases by at least during the iteration. This implies that EMP-cyclic terminates in at most iterations.
∎
D.6 Proof of Theorem 3
We start by defining a version of with slack vectors. Let be a vector indexed in a similar way as , where . We define the slack as and . Then we define the slack polytope as
[TABLE]
Notice that by definition the slack vectors satisfy that, for all , . The main difference between and lies in that the joints do not marginalize exactly to the vertex probabilities but do so up to a slack. Consider the entropy-regularized linear program corresponding to :
[TABLE]
Introducing the exact same ensemble of dual variables as in the Lyapunov function derivation, its dual function equals
[TABLE]
Furthermore, if were a set of optimal dual variables, the optimal primal can be computed via
[TABLE]
They satisfy the same formulae as the problem without slack variables. Since dual optimality is equivalent to primal feasibility, whenever an iterate of EMP satisfies slack of , its corresponding primal solution is optimal for (Reg-slack).
We start with a useful manipulation lemma:
Lemma 7**.**
Let be two slack vectors and let . Assume .
If for all and , , then there exists a vector such that
[TABLE] 2. 2.
If 444We do not require that , then there exists a vector such that
[TABLE]
Proof.
First we consider the case when for all , is a valid distribution (in other words, all its entries are in and its values sum to ). In this case, we can argue for the existence of via the following:
Let for all . Let and observe that and . We invoke Lemma 7 in Altschuler et al. (2017) to claim the existence of such that and and
[TABLE]
Setting to be the ensemble with values and the result follows.
Now we consider the case when there exist such that does not lie in the probability simplex. In this case we will have to define different from . Consider some . Let be the set of neighbouring vertices to and we abuse notation slightly and use for to denote the slack on as of the edge marginal shared by and . We define in the following way:
If for all then let . 2. 2.
Otherwise, let be the entries of such that for all there exists at least one for which . Therefore, we must define such that
[TABLE]
which can be done by taking the convex combination of with the uniform distribution:
[TABLE]
Setting guarantees this outcome because we are given that . Furthermore, we have
[TABLE]
This, in turn, implies . Then, we apply the result of Altschuler et al. (2017) again to achieve existence of such that
[TABLE]
Summing over these yields . Therefore
∎
We additionally require a similar lemma which allows us to project from one polytope to another while bounding the probabilities away from zero.
Lemma 8**.**
Fix such that and a slack vector such that . If , then there exists a vector such that
[TABLE]
If , then there exists a vector such that
[TABLE]
Proof.
We address each case individually.
We use the first result from Lemma 7, which yields such that . If the probabilities are already bounded below , then we are done; however, we must handle the worst case. As in the proof of Lemma 7, we compute a convex combination of with the uniform distribution to draw the distribution away from zero values. Define
[TABLE]
where we set which ensures that and . Then, note that
[TABLE]
By the triangle inequality, we have
[TABLE] 2. 2.
In the second case, we start by constructing a distribution , which is nearly uniform but lives in the slack polytope is and bounded away from zero by at least .
For each , we take to be the uniform distribution where . Since , we perturb the uniform distribution with for each , generating . Again, we are abusing notation slightly by using to denote marginalization of edge to vertex . Note that , so we can define the product distribution , which, by construction, marginalizes such that the full vector given by the ensemble and is in . Furthermore, each component can be bounded below as
[TABLE]
Now, as before, we know there exists such that from Lemma 7. Therefore, we can take the convex combination of to get such that in all entries.
Taking ensures that . Furthermore, the difference can be computed as
[TABLE]
Therefore, we have , which by triangle inequality implies
[TABLE]
∎
We have now the necessary ingredients to prove the first theorem of this section, which provides a bound on the distance between the final iterate of Algorithms 1 and 2 and the solution of (Reg). Crucially we analyze these iterates under the assumption all their component distributions for and for are normalized.
Theorem 4**.**
Let is the th iterate of EMP and let be the slack vector corresponding to such that . In other words,
[TABLE]
Fix such that . Let be the pseudo-marginal vector in produced by the first case of Lemma 8 when fed with and . Then,
[TABLE]
Proof.
By definition . In fact, is the optimizer of the following regularized linear program:
[TABLE]
This observation follows because is in and its elements can be written as in (24) and (25), thus satisfying dual feasibility.
Recall that after every iteration all the component distributions are normalized. Recall that
[TABLE]
where is the negative entropy. The point is the optimal point of the information projection for points in . By the properties of information projections,
[TABLE]
Since for and , the sum of their entries is the same, by Pinsker’s inequality (applied to each of the component vertex and edge distributions) this in turn implies that
[TABLE]
Let in be the vector produced by Lemma 8 applied to . Note that we utilize the existence of and for analysis but we need not actually compute them. Expanding yields
[TABLE]
Term is negative since is the optimal point in the slack polytope. Because and were constructed such that all their probabilities are lower bounded by , it holds that the entropies are -Lipschitz in Terms and can be then bounded:
[TABLE]
The result then follow as
[TABLE]
∎
Theorem 4, combined with the EMP algorithm’s optimality condition can provide convergence guarantees for the case when is tight and the solution is unique. We restate the main result, Theorem 3, for readability.
Theorem 5**.**
Let , and . If is tight and , the EMP algorithm returns a MAP assignment after iterations for EMP-cyclic and after iterations for EMP-greedy.
Proof.
Let be the last internal iterate of the EMP algorithm before rounding. Since the stopping condition has been met, the slack vector corresponding to must satisfy for all so that .
Let be defined as in Theorem 4 and choose 555As long as at least, this guarantees , so we are free to use Theorem 4. Then, the bound from Theorem 4 becomes
[TABLE]
where the last inequality used the fact that for . Choosing ensures that
[TABLE]
Consequently for all
[TABLE]
and for all
[TABLE]
We also have
[TABLE]
which implies and (by the condition on , see Theorem 1). Putting these inequalities together by triangle inequality,
[TABLE]
For all . A similar statement holds for all :
[TABLE]
Therefore, assuming (the solution of ) is integral,
[TABLE]
and
[TABLE]
∎
Appendix E Experiment Details
In this section, we provide some additional details for the experiments in Section 7. As mentioned, empirical comparisons between state-of-the-art solvers and EMP-like algorithms have been studied extensively (Meshi et al., 2012; Ravikumar et al., 2010; Werner, 2007; Kappes et al., 2013). For instance, Meshi et al. (2012) found that the regularized star-based message passing algorithms greatly outperform standard optimization techniques such as FISTA and gradient descent, which do not exploit the coordinate structure of the problem.
The primary purpose of these experiments is to understand how the theoretical results in Section 6 manifest in a practical setting. In particular, we would like to understand how the convergence rates, in terms of the ability to round to the solution, behave as a function of the parameters of the problem such as graph size, choice of regularization , and connectivity of the graph. In all experiments, we ran an LP solver on the graph in order to obtain the ground-truth MAP assignment. We only considered problems that were tight. The solver specifically is the ECOS solver through a CVXPY wrapper.
E.1 Grid Experiments
As mentioned, our first set of experiments considered solving the MAP problem on grids, totallying vertices. The vertices were connected by edges to their vertical and horizontal neighbors in the grid. This setting is fairly standard in the literature (Erdogdu et al., 2017; Globerson and Jaakkola, 2008; Ravikumar et al., 2010).
We considered the MAP problem with labels and choose a cost vector in the family of multi-label Potts models, another well-studied application (Wainwright and Jordan, 2008). Potts models typically have diagonal potentials between edges. That is, we only penalize/reward when the labels on two connected vertices agree. We randomly generated the actual values of the vector. For vertex costs, we chose and for the edge costs we chose
[TABLE]
where . In the approximation results, we ran the algorithms until they had effectively converged after 80 iterations, where each iteration consisted of a full pass over the edges. For EMP-cyclic, this means simply going through all the edges once. For EMP-greedy, one iteration means the opportunity to update each edge exactly once, (although the algorithm will greedily select them in reality). Thus both algorithms update the same number of edges, though their choices will be different. Regardless, we found that 80 iterations was reasonably sufficient to observe the approximation properties. We measured the results in terms of the average Hamming distance between the LP’s solution, which is integral, and the rounded solution returned by the algorithms.
E.2 Random Graph Experiments
While the grid topology offers a consistent platform to evaluate the algorithms, we also considered randomly generated graphs, specifically Erdős-Rényi random graphs. These graphs are constructed by iterating through every pair of the vertices. Then, an edge is drawn between vertex and with probability . Specifically, we chose , which is just large enough that the graph is almost surely connected. We found these to be useful hyperparameter because any lower and the graph would largely be disconnected. Any higher and typically we found the LP was not tight. We chose the same multi-label Potts model for generating the cost vector .
With these experiments, we intended to understand how diverse graph topologies would affect convergence due to randomness. In particular, we restricted the degrees of the graph to to observe how the algorithms behave on denser graphs.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1Altschuler et al. (2017) Jason Altschuler, Jonathan Weed, and Philippe Rigollet. Near-linear time approximation algorithms for optimal transport via Sinkhorn iteration. In Advances in Neural Information Processing Systems , pages 1964–1974, 2017.
- 2Antonucci et al. (2014) Alessandro Antonucci, Cassio De Campos, and Marco Zaffalon. Probabilistic graphical models. In Introduction to Imprecise Probabilities . Wiley, 2014.
- 3Benamou et al. (2015) Jean-David Benamou, Guillaume Carlier, Marco Cuturi, Luca Nenna, and Gabriel Peyré. Iterative Bregman projections for regularized transportation problems. SIAM Journal on Scientific Computing , 37(2):A 1111–A 1138, 2015.
- 4Bethe (1935) Hans Bethe. Statistical theory of superlattices. Proceedings of the Royal Society of London. Series A-Mathematical and Physical Sciences , 150(871):552–575, 1935.
- 5Bregman (1966) Lev M Bregman. A relaxation method of finding a common point of convex sets and its application to problems of optimization. In Soviet Mathematics Doklady , volume 7, pages 1578–1581, 1966.
- 6Bubeck (2015) Sébastien Bubeck. Convex optimization: Algorithms and complexity. Foundations and Trends in Machine Learning , 8(3-4):231–357, 2015.
- 7Cominetti and San Martín (1994) Roberto Cominetti and Jaime San Martín. Asymptotic analysis of the exponential penalty trajectory in linear programming. Mathematical Programming , 67(1-3):169–187, 1994.
- 8Cooper (1990) Gregory Cooper. The computational complexity of probabilistic inference using Bayesian belief networks. Artificial Intelligence , 42(2-3):393–405, 1990.
