COSMO: A conic operator splitting method for convex conic problems

Michael Garstka; Mark Cannon; Paul Goulart

arXiv:1901.10887·math.OC·August 31, 2021·J. Optim. Theory Appl.

COSMO: A conic operator splitting method for convex conic problems

Michael Garstka, Mark Cannon, Paul Goulart

PDF

2 Repos

TL;DR

COSMO introduces an efficient operator splitting solver for large convex conic problems, leveraging sparsity and decomposition techniques, with promising benchmarks and an open-source Julia implementation.

Contribution

The paper presents COSMO, a novel conic operator splitting method that efficiently solves large structured convex conic problems using sparsity exploitation and decomposition.

Findings

01

Outperforms existing solvers on benchmark problems

02

Efficient for large-scale semidefinite programs

03

Open-source Julia implementation available

Abstract

This paper describes the Conic Operator Splitting Method (COSMO) solver, an operator splitting algorithm for convex optimisation problems with quadratic objective function and conic constraints. At each step the algorithm alternates between solving a quasi-definite linear system with a constant coefficient matrix and a projection onto convex sets. The low per-iteration computational cost makes the method particularly efficient for large problems, e.g. semidefinite programs that arise in portfolio optimisation, graph theory, and robust control. Moreover, the solver uses chordal decomposition techniques and a new clique merging algorithm to effectively exploit sparsity in large, structured semidefinite programs. A number of benchmarks against other state-of-the-art solvers for a variety of problems show the effectiveness of our approach. Our Julia implementation is open-source, designed…

Tables3

Table 1. Table 1: Normalized shifted geometric mean and failure rates for solvers tested on the Maros and Mészáros QP test set.

	OSQP	COSMO	MOSEK	SCS
Normalized shifted geometric mean	$1.0$	$1.168 839 870 811 628 3$	$1.896 856 715 452 622$	$75.015 173 814 748$
Number of fastest solve time	$75$	$27$	$33$	$0$
Failure rates $f_{r, s}$ [%]	$4.444 444 444 444 445$	$5.185 185 185 185 185$	$10.370 370 370 370 37$	$83.703 703 703 703 7$

Table 2. Table 5: Benchmark results for nearest correlation matrix problems.

	Solve time ( $s$ )			Iterations			Max error²
n	COSMO	MOSEK	SCS	COSMO	MOSEK	SCS	COSMO	MOSEK	SCS
$50$	$0.5$	$1.89$	\cellcolorred!25 $0.07$	25	4	20	$2.047 776 372 028 258 2 × 10^{- 5}$	$0.001 352 074 179 025 200 7$	$6.183 471 303 045 888 × 10^{- 7}$
$100$	$0.45$	$35.68$	\cellcolorred!25 $0.18$	25	4	20	$2.024 643 461 273 149 × 10^{- 5}$	$0.006 258 493 529 849 161 5$	$4.244 552 375 162 742 × 10^{- 6}$
$150$	\cellcolorred!25 $0.39$	$244.35$	$0.46$	25	4	20	$5.147 714 778 508 484 4 × 10^{- 5}$	$0.008 607 245 600 801 9$	$9.001 628 710 297 762 × 10^{- 6}$
$200$	$0.94$	$1032.25$	\cellcolorred!25 $0.74$	25	4	20	$7.586 271 996 820 478 × 10^{- 5}$	$0.003 641 904 488 103 727$	$8.412 082 256 913 692 × 10^{- 5}$
$400$	\cellcolorred!25 $1.14$	$* * * †$	$3.07$	25	***	20	$0.000 262 094 375 537 953 3$	$, , *$	$8.521 790 746 760 832 × 10^{- 5}$
$600$	\cellcolorred!25 $3.05$	$* * * †$	$7.38$	25	***	20	$0.000 153 427 735 171 351 84$	$, , *$	$0.000 101 140 427 136 423 36$
$800$	\cellcolorred!25 $5.08$	$* * * †$	$13.85$	25	***	20	$0.000 403 398 791 685 376 67$	$, , *$	$0.000 107 365 675 166 775 56$
$1000$	\cellcolorred!25 $8.43$	$* * * †$	$21.94$	25	***	20	$0.000 810 745 541 838 395 2$	$, , *$	$0.000 140 400 122 683 310 96$
$1200$	\cellcolorred!25 $12.1$	$* * * †$	$34.58$	25	***	20	$0.001 023 757 730 316 243$	$, , *$	$0.000 172 926 485 792 784 75$
$1400$	\cellcolorred!25 $17.19$	$* * * †$	$49.23$	25	***	20	$0.001 022 303 707 120 552 7$	$, , *$	$0.000 192 942 969 492 905 5$
$1600$	\cellcolorred!25 $22.69$	$* * * †$	$61.07$	25	***	20	$0.000 800 266 586 296 017 2$	$, , *$	$0.000 221 405 713 637 240 54$
$1800$	\cellcolorred!25 $29.41$	$* * * †$	$74.0$	25	***	20	$0.000 514 447 457 488 027 8$	$, , *$	$0.000 243 422 070 937 715 6$
$2000$	\cellcolorred!25 $37.74$	$* * * †$	$101.18$	25	***	20	$0.000 331 235 747 684 322 1$	$, , *$	$0.000 256 735 495 766 549 14$

Table 3. Table 6: Benchmark results for block arrow sparse SDPs with varying number of blocks N b subscript 𝑁 𝑏 N_{b} and block size d 𝑑 d .

	Solve time ( $s$ )				Iterations				Max error²
$d$	COSMO	COSMO^cd	MOSEK	SCS	COSMO	COSMO^cd	MOSEK	SCS	COSMO	COSMO^cd	MOSEK	SCS
$10$	$12.71$	\cellcolorred!25 $1.22$	$16.5$	$56.99$	125	175	9	1340	$0.000 150 072 085 836 178 37$	$0.000 559 958 921 577 102 9$	$0.001 164 423 021 143 846 4$	$0.000 900 749 508 068 270 9$
$12$	$23.95$	\cellcolorred!25 $2.3$	$28.72$	$68.72$	125	225	10	1120	$0.000 229 156 255 539 896 53$	$0.000 752 007 170 370 981 5$	$3.653 030 782 170 664 6 × 10^{- 5}$	$0.000 537 967 341 929 155 7$
$14$	$27.4$	\cellcolorred!25 $2.39$	$41.66$	$147.72$	150	175	10	1780	$7.915 695 395 411 079 × 10^{- 5}$	$0.000 860 162 011 584 706 3$	$0.000 684 317 980 750 578$	$0.000 546 108 038 611 922 5$
$16$	$36.4$	\cellcolorred!25 $3.6$	$58.04$	$254.93$	150	275	10	2360	$4.755 751 338 248 006 × 10^{- 5}$	$0.000 534 059 050 660 824 1$	$0.000 263 730 037 172 260 66$	$0.000 691 310 390 829 264 7$
$18$	$42.96$	\cellcolorred!25 $4.42$	$85.81$	$472.77$	150	275	11	3440	$0.000 266 229 348 802 400 8$	$0.000 576 826 639 568 577$	$0.000 254 824 980 419 294 97$	$0.000 325 376 397 877 594$
$20$	$41.13$	\cellcolorred!25 $5.08$	$112.56$	$251.39$	125	275	11	1460	$0.000 260 818 291 211 720 55$	$0.000 534 794 541 981 283 4$	$6.328 217 577 546 178 × 10^{- 5}$	$0.000 726 192 018 424 364$
$22$	$50.7$	\cellcolorred!25 $11.47$	$142.14$	$520.81$	125	700	11	2460	$0.000 286 708 260 976 857 7$	$0.000 584 522 529 190 938 4$	$0.000 331 061 847 584 176 3$	$0.000 485 706 361 682 841 4$
$24$	$125.12$	\cellcolorred!25 $10.52$	$178.75$	$445.37$	275	500	11	1780	$7.455 716 137 394 613 × 10^{- 5}$	$0.000 723 270 596 283 732 9$	$0.000 356 767 244 307 746 04$	$0.000 509 563 519 333 417$
$26$	$88.74$	\cellcolorred!25 $6.84$	$220.26$	$623.46$	150	200	11	2060	$8.169 110 186 283 036 × 10^{- 5}$	$0.000 327 481 590 424 160 86$	$2.887 518 519 592 912 3 × 10^{- 6}$	$0.000 138 527 976 122 053 54$
$28$	$98.83$	\cellcolorred!25 $7.47$	$263.52$	$1626.82$	150	200	11	4660	$3.733 949 719 961 775 7 × 10^{- 5}$	$0.000 892 534 449 832 218 5$	$2.617 454 968 281 645 6 × 10^{- 5}$	$0.000 459 199 558 705 318 05$

Equations186

\begin{array}[]{lll}\mbox{minimize}&f(x)&\\ \mbox{subject to}&g_{i}(x)\leq 0,&i=1,\ldots,l\\ &h_{i}(x)=0,&i=1,\ldots,k,\end{array}

\begin{array}[]{lll}\mbox{minimize}&f(x)&\\ \mbox{subject to}&g_{i}(x)\leq 0,&i=1,\ldots,l\\ &h_{i}(x)=0,&i=1,\ldots,k,\end{array}

svec (V) : = [V_{11}, 2 V_{12}, V_{22}, 2 V_{13}, 2 V_{23}, V_{33}, 2 V_{14}, \dots, V_{nn}]^{⊤} .

svec (V) : = [V_{11}, 2 V_{12}, V_{22}, 2 V_{13}, 2 V_{23}, V_{33}, 2 V_{14}, \dots, V_{nn}]^{⊤} .

S_{+}^{n} : = {s \in R^{\frac{n ( n + 1 )}{2}} : smat (s) \in S_{+}^{n}} .

S_{+}^{n} : = {s \in R^{\frac{n ( n + 1 )}{2}} : smat (s) \in S_{+}^{n}} .

K^{\circ} : = {y \in R^{n} ∣ x \in K sup ⟨ x, y ⟩ \leq 0},

K^{\circ} : = {y \in R^{n} ∣ x \in K sup ⟨ x, y ⟩ \leq 0},

N_{K} (x) : = {y \in R^{n} ∣ \overset{x}{ˉ} \in K sup ⟨ \overset{x}{ˉ} - x, y ⟩ \leq 0},

N_{K} (x) : = {y \in R^{n} ∣ \overset{x}{ˉ} \in K sup ⟨ \overset{x}{ˉ} - x, y ⟩ \leq 0},

K^{\infty} : = {y \in R^{n} ∣ x + a y \in K, \forall x \in K, \forall a \geq 0} .

K^{\infty} : = {y \in R^{n} ∣ x + a y \in K, \forall x \in K, \forall a \geq 0} .

prox_{f} (x) : = y argmin {f (y) + \frac{1}{2} ∥ y - x ∥_{2}^{2}} .

prox_{f} (x) : = y argmin {f (y) + \frac{1}{2} ∥ y - x ∥_{2}^{2}} .

I_{C} (x) : = {0 + \infty x \in C otherwise,

I_{C} (x) : = {0 + \infty x \in C otherwise,

Π_{C} (x) : = y \in C argmin ∥ x - y ∥_{2}^{2} .

Π_{C} (x) : = y \in C argmin ∥ x - y ∥_{2}^{2} .

σ_{C} (x) : = y \in C sup ⟨ x, y ⟩ .

σ_{C} (x) : = y \in C sup ⟨ x, y ⟩ .

\begin{array}[]{ll}\mbox{minimize}&\textstyle{\frac{1}{2}}x^{\top}Px+q^{\top}x\\ \mbox{subject to}&Ax+s=b\\ &s\in\mathcal{K},\end{array}

\begin{array}[]{ll}\mbox{minimize}&\textstyle{\frac{1}{2}}x^{\top}Px+q^{\top}x\\ \mbox{subject to}&Ax+s=b\\ &s\in\mathcal{K},\end{array}

K = K_{1}^{m_{1}} \times K_{2}^{m_{2}} \times \dots \times K_{N}^{m_{N}},

K = K_{1}^{m_{1}} \times K_{2}^{m_{2}} \times \dots \times K_{N}^{m_{N}},

- \frac{1}{2} x^{⊤} P x + b^{⊤} y - σ_{K} (y)

- \frac{1}{2} x^{⊤} P x + b^{⊤} y - σ_{K} (y)

P x - A^{⊤} y = - q

y \in (K^{\infty})^{\circ},

A x + s

A x + s

P x + q - A^{⊤} y

s \in K,

P

P

D

\frac{1}{2} \tilde{x}^{⊤} P \tilde{x} + q^{⊤} \tilde{x} + I_{A x + s = b} (\tilde{x}, \tilde{s}) + I_{K} (s)

\frac{1}{2} \tilde{x}^{⊤} P \tilde{x} + q^{⊤} \tilde{x} + I_{A x + s = b} (\tilde{x}, \tilde{s}) + I_{K} (s)

(\tilde{x}, \tilde{s}) = (x, s),

L (x, s, \tilde{x}, \tilde{s}, λ, y) = \frac{1}{2} \tilde{x}^{⊤} P \tilde{x} + q^{⊤} \tilde{x} + I_{A x + s = b} (\tilde{x}, \tilde{s}) + I_{K} (s) + \frac{σ}{2} \tilde{x} - x + \frac{1}{σ} λ_{2}^{2} + \frac{ρ}{2} \tilde{s} - s + \frac{1}{ρ} y_{2}^{2},

L (x, s, \tilde{x}, \tilde{s}, λ, y) = \frac{1}{2} \tilde{x}^{⊤} P \tilde{x} + q^{⊤} \tilde{x} + I_{A x + s = b} (\tilde{x}, \tilde{s}) + I_{K} (s) + \frac{σ}{2} \tilde{x} - x + \frac{1}{σ} λ_{2}^{2} + \frac{ρ}{2} \tilde{s} - s + \frac{1}{ρ} y_{2}^{2},

(\tilde{x}^{k + 1},

(\tilde{x}^{k + 1},

x^{k + 1}

s^{k + 1}

λ^{k + 1}

y^{k + 1}

\frac{1}{2} \tilde{x}^{⊤} P \tilde{x} + q^{⊤} \tilde{x} + \frac{σ}{2} \tilde{x} - x^{k}_{2}^{2} + \frac{ρ}{2} \tilde{s} - s^{k} + \frac{1}{ρ} y^{k}_{2}^{2}

\frac{1}{2} \tilde{x}^{⊤} P \tilde{x} + q^{⊤} \tilde{x} + \frac{σ}{2} \tilde{x} - x^{k}_{2}^{2} + \frac{ρ}{2} \tilde{s} - s^{k} + \frac{1}{ρ} y^{k}_{2}^{2}

A \tilde{x} + \tilde{s} = b .

L (\tilde{x}, \tilde{s}, ν) = \frac{1}{2} \tilde{x}^{⊤} P \tilde{x} + q^{⊤} \tilde{x} + \frac{σ}{2} \tilde{x} - x^{k}_{2}^{2} + \frac{ρ}{2} \tilde{s} - s^{k} + \frac{1}{ρ} y^{k}_{2}^{2} + ν^{⊤} (A \tilde{x} + \tilde{s} - b),

L (\tilde{x}, \tilde{s}, ν) = \frac{1}{2} \tilde{x}^{⊤} P \tilde{x} + q^{⊤} \tilde{x} + \frac{σ}{2} \tilde{x} - x^{k}_{2}^{2} + \frac{ρ}{2} \tilde{s} - s^{k} + \frac{1}{ρ} y^{k}_{2}^{2} + ν^{⊤} (A \tilde{x} + \tilde{s} - b),

\frac{\partial L}{\partial x ~} = P \tilde{x}^{k + 1} + q + σ (\tilde{x}^{k + 1} - x^{k}) + A^{⊤} ν^{k + 1}

\frac{\partial L}{\partial x ~} = P \tilde{x}^{k + 1} + q + σ (\tilde{x}^{k + 1} - x^{k}) + A^{⊤} ν^{k + 1}

\frac{\partial L}{\partial s ~} = ρ (\tilde{s}^{k + 1} - s^{k} + \frac{1}{ρ} y^{k}) + ν^{k + 1}

A \tilde{x}^{k + 1} + \tilde{s}^{k + 1} - b

[P + σ I A A^{⊤} - \frac{1}{ρ} I] [\tilde{x}^{k + 1} ν^{k + 1}]

[P + σ I A A^{⊤} - \frac{1}{ρ} I] [\tilde{x}^{k + 1} ν^{k + 1}]

\tilde{s}^{k + 1} = s^{k} - \frac{1}{ρ} (ν^{k + 1} + y^{k}) .

\tilde{s}^{k + 1} = s^{k} - \frac{1}{ρ} (ν^{k + 1} + y^{k}) .

s^{k + 1} = Π_{K} (α \tilde{s}^{k + 1} + (1 - α) s^{k} + \frac{1}{ρ} y^{k}) .

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Full text

COSMO: A conic operator splitting method

for convex conic problems

Michael Garstka∗

Mark Cannon∗

Paul Goulart The authors are with the Department of Engineering Science, University of Oxford, Oxford, OX1 3PJ, UK. Email: {michael.garstka, mark.cannon, paul.goulart}@eng.ox.ac.uk

Abstract

This paper describes the Conic Operator Splitting Method (COSMO) solver, an operator splitting algorithm for convex optimisation problems with quadratic objective function and conic constraints. At each step the algorithm alternates between solving a quasi-definite linear system with a constant coefficient matrix and a projection onto convex sets. The low per-iteration computational cost makes the method particularly efficient for large problems, e.g. semidefinite programs that arise in portfolio optimisation, graph theory, and robust control. Moreover, the solver uses chordal decomposition techniques and a new clique merging algorithm to effectively exploit sparsity in large, structured semidefinite programs. A number of benchmarks against other state-of-the-art solvers for a variety of problems show the effectiveness of our approach. Our Julia implementation is open-source, designed to be extended and customised by the user, and is integrated into the Julia optimisation ecosystem.

1 Introduction

We consider convex optimisation problems in the form

[TABLE]

where we assume that both the objective function $f:\mathbb{R}^{n}\to\mathbb{R}$ and the inequality constraint functions $g_{i}:\mathbb{R}^{n}\to\mathbb{R}$ are convex, and that the equality constraints $h_{i}(x):=a_{i}^{\top}x-b_{i}$ are affine. We will denote an optimal solution to this problem (if it exists) as $x^{*}$ . Convex optimisation problems feature heavily in a wide range of research areas and industries, including problems in machine learning [CV95], finance [BBD*+*17], optimal control [BEGFB94], and operations research [BHH01]. Concrete examples of problems fitting the general form (1) include linear programming (LP), convex quadratic programming (QP), second-order cone programming (SOCP), and semidefinite programming (SDP) problems. Methods to solve each of these standard problem classes are well known and a number of open- and closed-source solvers are widely available. However, the trend for data and training sets of increasing size in decision making problems and machine learning poses a challenge for state-of-the-art software.

Algorithms for LPs were first used to solve military planning and allocation problems in the 1940s [Dan63]. In 1947 Danzig developed the simplex method that solves LPs by searching for the optimal solution along the vertices of the inequality polytope. Extensions to the method led to the general field of active-set methods [Wol59] that are able to solve both LPs and QPs, and which search for an optimal point by iteratively constructing a set of active constraints. Although often efficient in practice, a major theoretical drawback is that the worst-case complexity increases exponentially with the problem size [WN99].

The most common approach taken by modern convex solvers is the * interior-point method* [WN99], which stems from Karmarkar’s original projective algorithm [Kar84], and is able to solve both LPs and QPs in polynomial time. Interior point methods have since been extended to problems with positive semidefinite (PSD) constraints in [HRVW96] and [AHO98]. The primal-dual interior point methods apply variants of Newton’s method to iteratively find a solution to a set of optimality KKT conditions. At each iteration the algorithm alternates between a Newton step that involves factoring a Jacobian matrix and a line search to determine the magnitude of the step to ensure a feasible iterate. Most notably, the Mehrotra predictor-corrector method in [Meh92] forms the basis of several implementations because of its strong practical performance [Wri97]. However, interior-point methods typically do not scale well for very large problems since the Jacobian matrix has to be calculated and factored at each step.

Two main approaches to overcome this limitation are active research areas. Firstly, a renewed focus on first-order methods with computationally cheaper per-iteration-cost, and secondly the exploitation of sparsity in the problem data. First-order methods are known to handle larger problems well at the expense of reduced accuracy compared to interior-point methods. In the 1960s Everett [Eve63] proposed a dual decomposition method that allows one to decompose a separable objective function, making each iteration cheaper to compute. Augmented Lagrangian methods by Miele ([MCIL71, MCL71, MMLC72]), Hestenes [Hes69], and Powell [Pow69] are more robust and helped to remove the strict convexity conditions on problems, while losing the decomposition property. By splitting the objective function, the alternating direction method of multipliers (ADMM), first described in [GM75], allowed the advantages of dual decomposition to be combined with the superior convergence and robustness of augmented Lagrangian methods. Subsequently, it was shown that ADMM can be analysed from the perspective of monotone operators and that it is a special case of Douglas-Rachford splitting [Eck89] as well as of the proximal point algorithm in [Roc76], which allowed further insight into the method.

ADMM methods are simple to implement and computationally cheap, even for large problems. However, they tend to converge slowly to a high accuracy solution and the detection of infeasibility is more involved compared to interior-point methods. They are therefore most often used in applications where a modestly accurate solution is sufficient [PB*+*14]. Most of the early advances in first-order methods such as ADMM happened in the 1970s/80s long before the demand for large scale optimisation, which may explain why they stayed less well-known and have only recently resurfaced.

A method of exploiting the sparsity pattern of PSD constraints in an interior-point algorithm was developed in [FKMN01]. This work showed that if the coefficient matrices of an SDP exhibit an aggregate sparsity pattern represented by a chordal graph, then both primal and dual problems can be decomposed into a problem with many smaller PSD constraints on the nonzero blocks of the original matrix variable. These blocks are associated with subsets, called cliques, of graph vertices. Moreover, it can be advantageous to merge some of these blocks, for example if they overlap significantly. The optimal way to merge the blocks, or equivalently the corresponding graph cliques, after the initial decomposition depends on the solver algorithm and is still an open question. Sparsity in SDPs has also been studied for problems with underlying graph structure, e.g. optimal power-flow problems in [MHLD13] and graph optimisation problems in [Ali95].

Related Work

Widely used solvers for conic problems, especially SDPs, include SeDuMi [Stu99], SDPT3 [TTT99] (both open source, MATLAB), and MOSEK [MOS17] (commercial, C) among others. All of these solvers implement primal-dual interior-point methods.

Both Fukuda [FKMN01] and Sun [SAV14] developed interior-point solvers that exploit chordal sparsity patterns in PSD constraints. Some heuristic methods to merge cliques have been proposed for interior-point methods in [MHLD13] and been implemented in the SparseCoLO package [FKK*+*09] and the CHOMPACK package [AV15].

Several solvers based on the ADMM method have been released recently. The solver OSQP [SBG*+*18] is implemented in C and detects infeasibility based on the differences of the iterates [BGSB17], but only solves LPs and QPs. The C-based SCS [OCPB16] implements an operator splitting method that solves the primal-dual pair of conic programs in order to provide infeasibility certificates. The underlying homogeneous self-dual embedding method has been extended by [ZFP*+*19] to exploit sparsity and implemented in the MATLAB solver CDCS. The conic solvers SCS and CDCS are unable to handle quadratic cost functions directly. Instead they are forced to reformulate problem with quadratic objective functions by adding a second-order cone constraint, which increases the problem size. Moreover, they rely on primal-dual formulations to detect infeasibility.

Outline

In Section 2 we define the general conic problem format, its dual problem, as well as optimality and infeasibility conditions. Section 3 describes the ADMM algorithm that is used by COSMO. Section 4 explains how to decompose SDPs in a preprocessing step provided the problem data has an aggregated sparsity pattern. In Section 5 we describe a new clique merging strategy and compare it to existing approaches. Implementation details and code related design choices are discussed in Section 6. Section 7 shows benchmark results of COSMO vs. other state-of-the art solvers on a number of test problems. Section 8 concludes the paper.

Contributions

With the solver package described in this paper we make the following contributions:

We implement a first-order method for large conic problems that is able to detect infeasibility without the need of a homogeneous self-dual embedding. 2. 2.

COSMO directly supports quadratic objective functions, thus reducing overheads for applications with both quadratic objective function and PSD constraints. This also avoids a major disadvantage of conic solvers compared to native QP solvers, i.e no additional matrix factorisation for the conversion is needed and favourable sparsity in the objective can be maintained. 3. 3.

Large structured positive semidefinite programs are analysed and, if possible, chordally decomposed. This typically allows one to solve very large sparse and structured problems orders of magnitudes faster than competing solvers. For complex sparsity patterns, further performance improvements are achieved by recombining some of the sub-blocks of the initial decomposition in an optimal way. For this purpose, we propose a new clique graph based merging strategy and compare it to existing heuristic approaches. 4. 4.

The open-source solver is written in a modular way in the fast and flexible programming language Julia. The design allows users to extend the solver by specifying a specific linear system solver and by defining their own convex cones or custom projection methods.

Notation

The following notation and definitions will be used throughout this paper. Denote the space of real numbers $\mathbb{R}$ , the n-dimensional real space $\mathbb{R}^{n}$ , the n-dimensional zero cone $\{0\}^{n}$ , the nonnegative orthant $\mathbb{R}_{+}^{n}$ , the space of symmetric matrices $\mathbb{S}^{n}$ , and the set of positive semidefinite matrices $\mathbb{S}^{n}_{+}$ .

In some of the following sections matrix data is considered in vectorized form. Denote the vectorization of a matrix $X$ by stacking it columns as $x\coloneqq\text{vec}(X)$ and the inverse operation as $\text{vec}^{-1}(X)=\text{mat}(x)$ . For symmetric matrices it is often computationally beneficial to work only with the upper-triangular elements of the matrix. Denote the transformation of a symmetric matrix $V\in\mathbb{S}^{n}$ with i,j-th element $V_{ij}$ into a vector of upper-triangular elements as

[TABLE]

Here the scaling factor of $\sqrt{2}$ preserves the matrix inner product, i.e. $\mathrm{tr}(AB)=\mathrm{svec}(A)^{\top}\mathrm{svec}(B)$ for symmetric matrices $A,B\in\mathbb{S}^{n}$ . The inverse operation is denoted by $\mathrm{svec}^{-1}(s)\eqqcolon\mathrm{smat}(s)$ . Denote the Kronecker product of two matrices $A$ and $B$ as $A\otimes B$ . The Frobenius norm of a matrix $A$ is given by $||A||_{F}=\bigl{(}\sum_{ij}|a_{ij}|^{2})\bigr{)}^{1/2}$ .

Sometimes we consider positive semidefinite constraints in vector form, so we define the space of vectorized symmetric positive semidefinite matrices as

[TABLE]

For a convex cone $\mathcal{K}$ denote the polar cone by

[TABLE]

the normal cone of $\mathcal{K}$ by

[TABLE]

and, following [Roc70], the recession cone of $\mathcal{K}$ by

[TABLE]

The proximal operator of a convex, closed and proper function $f\colon\mathbb{R}^{n}\rightarrow\mathbb{R}$ is given by

[TABLE]

We denote the indicator function of a nonempty, closed convex set $\mathcal{C}\subseteq\mathbb{R}^{n}$ by

[TABLE]

and the projection of $x\in\mathbb{R}^{n}$ onto $\mathcal{C}$ by:

[TABLE]

We further denote the support function of $\mathcal{C}$ by:

[TABLE]

2 Conic Problems

We will address convex optimisation problems with a quadratic objective function and a number of conic constraints in the form:

[TABLE]

where $x\in\mathbb{R}^{n}$ is the primal decision variable and $s\in\mathbb{R}^{m}$ is the primal slack variable. The objective function is defined by positive semidefinite matrix $P\in\mathbb{S}^{n}_{+}$ and vector $q\in\mathbb{R}^{n}$ . The constraints are defined by matrix $A\in\mathbb{R}^{m\times n}$ , vector $b\in\mathbb{R}^{m}$ and a non-empty, closed, convex cone $\mathcal{K}$ which itself can be a Cartesian product of cones in the form

[TABLE]

with cone dimensions $\sum_{i=1}^{N}m_{i}=m$ . Note that any LP, QP, SOCP, or SDP can be written in the form (3) using an appropriate choice of cones, as well as problems involving the power or exponential cones and their duals.

The dual problem associated with (3) is given by:

[TABLE]

with dual variable $y\in\mathbb{R}^{m}$ .

The conditions for optimality (assuming linear independence constraint qualification) follow from the Karush-Kuhn-Tucker (KKT) conditions:

[TABLE]

Assuming strong duality, if there exists a $x^{*}\in\mathbb{R}^{n}$ , $s^{*}\in\mathbb{R}^{m}$ , and $y^{*}\in\mathbb{R}^{m}$ that fulfil (6a)–(6c) then the pair $(x^{*},s^{*})$ is called the primal solution and $y^{*}$ is called the dual solution of problem (3).

2.1 Infeasibility certificates

Primal and dual infeasibility conditions were developed for ADMM in [BGSB17]. These conditions are directly applicable to problems of the form (3). To simplify the notation of the conditions, define the cone $\bar{\mathcal{K}}\coloneqq\mathcal{-K}+\{b\}$ . Then, the following sets provide certificates for primal and dual infeasibility:

[TABLE]

The existence of some $y\in\mathcal{D}$ certifies that problem (3) is primal infeasible, while the existence of some $x\in\mathcal{P}$ certifies dual infeasibility.

3 ADMM Algorithm

We use the same splitting as in [SBG*+*18] to transform problem (3) into standard ADMM format. The problem is rewritten by introducing the dummy variables $\tilde{x}=x$ and $\tilde{s}=s$ :

[TABLE]

where the indicator functions of the sets $\{(x,s)\in\mathbb{R}^{n}\times\mathbb{R}^{m}\mid Ax+s=b\}$ and $\mathcal{K}$ were used to move the constraints of (3) into the objective function. The augmented Lagrangian of (9) is given by

[TABLE]

with step size parameters $\rho>0$ and $\sigma>0$ and dual variables $\lambda\in\mathbb{R}^{n}$ and $y\in\mathbb{R}^{m}$ . The corresponding ADMM iteration is given by:

[TABLE]

where we relaxed the $z$ -update and the dual variable update with relaxation parameter $\alpha\in(0,2)$ according to [EB92]. Notice from (11b) and (11d) that the dual variable corresponding to the constraint $x=\tilde{x}$ satisfies $\lambda^{k}=0\text{ for all }k$ .

3.1 Solution of the equality constrained QP

The minimization problem in (11a) has the form of an equality-constrained quadratic program:

[TABLE]

The solution of (12) can be obtained by solving a single linear system. The corresponding Lagrangian is given by:

[TABLE]

where the Lagrangian multiplier $\nu\in\mathbb{R}^{m}$ accounts for the equality-constraint $Ax+s=b$ . Thus, the KKT optimality conditions for this equality constrained QP are given by:

[TABLE]

Elimination of $\tilde{s}^{k+1}$ from these equations leads to the linear system:

[TABLE]

with

[TABLE]

Note that the introduction of the dummy variable $\tilde{x}$ led to the term $\sigma I$ in the upper-left corner of the coefficient matrix in (17). Consequently, the coefficient matrix in (17) is always quasi-definite [Van95], i.e. it always has a positive definite upper-left block and a negative definite lower-right block, and is therefore full rank even when $P=0$ or $A$ is rank deficient. Following [Van95] the left hand side of (17) always has a well-defined $LDL^{\top}$ factorization with a diagonal $D$ .

3.2 Projection step

As shown in [PB*+*14] the minimization problem in (11c) can be interpreted as the $\rho$ -weighted proximal operator of the indicator function $I_{\mathcal{K}}$ . It is therefore equivalent to the Euclidean projection $\Pi_{\mathcal{K}}$ onto the cone $\mathcal{K}$ , i.e.

[TABLE]

If $\mathcal{K}$ is a Cartesian product of cones as in (4) this projection is equivalent to the projection of the relevant components of the argument of $\Pi_{\mathcal{K}}(\cdot)$ onto each cone $\mathcal{K}_{i}$ . A problem with $N$ SDP constraints therefore requires $N$ projections, but, since each of these operates on an independent segment of the input vector, they can be performed in parallel.

3.3 Algorithm steps

The calculations performed at each iteration are summarized in Algorithm 1.

Observe that the coefficient matrix of the linear system in (17) is constant, so that one can precompute and cache the $LDL^{\top}$ factorization and efficiently evaluate line 2 with changing right hand sides. A refactorisation is only necessary if the step size parameters $\rho$ and $\sigma$ are updated.

Lines 3, 4, and 6 are computationally inexpensive since they involve only vector addition and scalar-vector multiplication. The projection in line 5 is crucial to the performance of the algorithm depending on the particular cones employed in the model: projections onto the zero-cone or the nonnegative orthant are inexpensive, while a projection onto the positive-semidefinite cone of dimension $N$ involves an eigenvalue decomposition. Since direct methods for eigen-decompositions have a complexity of approximately $\mathcal{O}(N^{3})$ , this turns line 5 into the most computationally expensive operation of the algorithm for large SDPs, and improving the efficiency of this step will be the objective of much of Sections 4 and 5.

3.4 Algorithm convergence

For feasible problems, Algorithm 1 produces a sequence of iterates $(x^{k},s^{k},y^{k})$ that converges to a limit satisfying the optimality conditions in (6) as $k\rightarrow\infty$ . The authors in [SBG*+*18] show convergence of Algorithm 1 by applying the Douglas-Rachford splitting to a problem reformulation.

In the Douglas-Rachford formulation, lines 5 and 6 become projections onto $\mathcal{K}$ and $\mathcal{K}^{\circ}$ respectively, so that the conic constraints in (6c) always hold. Furthermore, convergence of the residual iterates in (6a)–(6b) can be concluded from the convergence of the splitting variables

[TABLE]

which generally holds for Douglas-Rachford splitting [BC11].

For infeasible problems, [BGSB17] showed that Algorithm 1 leads to convergence of the successive differences between iterates

[TABLE]

For primal infeasible problems $\delta y=\lim_{k\rightarrow\infty}\delta y^{k}$ will satisfy condition (8), whereas for dual infeasible problems $\delta x=\lim_{k\rightarrow\infty}\delta x^{k}$ is a certificate of (7).

3.5 Scaling the problem data

The rate of convergence of ADMM and other first-order methods depends in practice on the scaling of the problem data; see [GB14]. Particularly for badly conditioned problems, this suggests a preprocessing step where the problem data is scaled in order to improve convergence. For certain problem classes an optimal scaling has been found, see [GB14, GB15, Gre97]. However, the computation of the optimal scaling is often more complicated than solving the original problem. Consequently, most algorithms rely on heuristic methods such as matrix equilibration.

We scale the equality constraints by diagonal positive definite matrices $D$ and $U$ . The scaled form of (3) is given by:

[TABLE]

with scaled problem data

[TABLE]

and the scaled convex cone $U\mathcal{K}\coloneqq\{Uv\in\mathbb{R}^{m}\mid v\in\mathcal{K}\}$ . After solving (22) the original solution is obtained by reversing the scaling:

[TABLE]

One heuristic strategy that has been shown to work well in practice is to choose the scaling matrices $D$ and $U$ to equilibrate, i.e. reduce the condition number of, the problem data. The Ruiz equilibration technique described in [Rui01] iteratively scales the rows and columns of a matrix to have an infinity-norm of $1$ and converges linearly. We apply the modified Ruiz algorithm shown in Algorithm 2 to reduce the condition number of the symmetric matrix $R$

[TABLE]

which represents the problem data.

Since $R$ is symmetric it suffices to consider the columns $R_{c,i}$ of $R$ . At each iteration the scaling routine calculates the norm of each column. For the columns with norms higher than the tolerance $\tau$ scaling vector $c$ is updated with the inverse square root of the norm111For the presented results a value of $\tau=10^{-6}$ was chosen. . If the norm is below the tolerance, the corresponding column will be scaled by 1.

Since the matrix $U$ scales the (possibly composite) cone constraint, the scaling must ensure that if $s\in\mathcal{K}$ then $U^{-1}s\in\mathcal{K}$ . Let $\mathcal{K}$ be a Cartesian product of $N$ cones as in (4) and partition $U$ into blocks

[TABLE]

with block $U_{i}\in\mathbb{R}^{m_{i}\times m_{i}}$ which scales the constraint corresponding to $\mathcal{K}_{i}$ . For each cone $\mathcal{K}_{i}\in\mathbb{R}^{m_{i}}$ that requires a scalar or symmetric scaling, e.g. a second-order cone or positive semidefinite cone, the corresponding block $U_{i}$ is replaced with

[TABLE]

where the mean value of the diagonal entries of the original block in $U$ , $u_{i}=\text{tr}(U_{i})/m_{i}$ , was chosen as a heuristic scaling factor.

3.6 Termination criteria

The termination criteria discussed in this section are based on the unscaled problem data and iterates. Thus, before checking for termination the solver first reverses the scaling according to equations (23)–(24). To measure the progress of the algorithm, we define the primal and dual residuals of the problem as:

[TABLE]

According to Section 3.3 of [BPC*+*11] a valid termination criterion is that the size of the norms of the residual iterates in (28) are small. Our algorithm terminates if the residual norms are below the sum of an absolute and a relative tolerance term:

[TABLE]

where $\epsilon_{\mathrm{abs}}$ and $\epsilon_{\mathrm{rel}}$ are user defined tolerances.

Following [BGSB17], the algorithm determines if the one-step differences $\delta x^{k}$ and $\delta y^{k}$ of the primal and dual variable fulfil the normalized infeasibility conditions (8)–(7) up to certain tolerances $\epsilon_{\text{p,inf}}$ and $\epsilon_{\mathrm{d,inf}}$ . The solver returns a primal infeasibility certificate if

[TABLE]

holds and a dual infeasibility certificate if

[TABLE]

holds.

4 Chordal Decomposition

As noted in Section 3.3, for large SDPs the eigen-decomposition in the projection step (19) is the principal performance bottleneck for the algorithm. However, since large-scale SDPs often exhibit a certain structure or sparsity pattern, a sensible strategy is to exploit any such structure to alleviate this bottleneck. If the aggregated sparsity pattern is chordal, Agler’s [AHMR88] and Grone’s [GJSW84] theorems can be used to decompose a large PSD constraint into a collection of smaller PSD constraints and additional coupling constraints. The projection step applied to the set of smaller PSD constraints is usually significantly faster than when applied to the original constraint. Since the projections are independent of each other, further performance improvement can be achieved by carrying them out in parallel. Our approach is to apply chordal decomposition to a standard form SDP in the form (3) to produce a decomposed problem, and then transform the resulting decomposed problem back to a problem in the form (3) but with more variables and a collection of smaller PSD cone constraints. This process allows us to apply chordal decomposition as a preprocessing step before the problem is handed to the solver. As discussed in Section 4.2, a chordal sparsity pattern can be imposed on any PSD constraint.

4.1 Graph preliminaries

In the following we define graph-related concepts that are useful to describe the sparsity structure of a problem. A good overview of this topic is given in the survey paper [VA*+*15]. Consider the undirected graph $G(V,E)$ with vertex set $V=\{1,\ldots,n\}$ and edge set $E\subseteq V\times V$ . If all vertices are pairwise adjacent, i.e. $E=\left\{\{v,u\}\mid v,u\in V,\,v\neq u\right\}$ , the graph is called complete. We follow the convention of [VA*+*15] by defining a clique as a subset of vertices $\mathcal{C}\subseteq V$ that induces a maximal complete subgraph of $G$ . The number of vertices in a clique is given by the cardinality $|\mathcal{C}|$ . A cycle is a path of edges (i.e. a sequence of distinct edges) joining a sequence of vertices in which only the first and last vertices are repeated.

The following decomposition theory relies on a subset of graphs that exhibit the important property of chordality. A graph $G$ is chordal (or triangulated) if every cycle of length greater than three has a chord, which is an edge between nonconsecutive vertices of the cycle. A non-chordal graph can always be made chordal by adding extra edges.

An undirected graph with $n$ vertices can be used to represent the sparsity pattern of a symmetric matrix $S\in\mathbb{S}^{n}$ . Every nonzero entry $S_{ij}\neq 0$ in the lower triangular part of the matrix introduces an edge $(i,j)\in E$ . An example of a sparsity pattern and the associated graph is shown in Figure 1(a–b).

For a given sparsity pattern $G(V,E)$ , we define the following symmetric sparse matrix cones:

[TABLE]

Given the definition in (32) and a matrix $S\in\mathbb{S}^{n}\left(E,0\right)$ , note that the diagonal entries $S_{ii}$ and the off-diagonal entries $S_{ij}$ with $(i,j)\in E$ may be zero or nonzero. Moreover, we define the cone of positive semidefinite completable matrices as

[TABLE]

For a matrix $Y\in\mathbb{S}_{+}^{n}(E,?)$ we can find a positive semidefinite completion by choosing appropriate values for all entries $(i,j)\notin E$ . An algorithm to find this completion is described in [VA*+*15]. An important structure that conveys a lot of information about the nonzero blocks of a matrix, or equivalently the cliques of a chordal graph, is the clique tree (or junction tree). For a chordal graph $G$ let $\mathcal{B}=\{\mathcal{C}_{1},\ldots,\mathcal{C}_{p}\}$ be the set of cliques. The clique tree $\mathcal{T}(\mathcal{B},\mathcal{E})$ is formed by taking the cliques as vertices and by choosing edges from $\mathcal{E}\subseteq\mathcal{B}\times\mathcal{B}$ such that the tree satisfies the running-intersection property:

Definition 4.1 (Running intersection property).

For each pair of cliques $\mathcal{C}_{i}$ , $\mathcal{C}_{j}\in\mathcal{B}$ , the intersection $\mathcal{C}_{i}\cap\mathcal{C}_{j}$ is contained in all the cliques on the path in the clique tree connecting $\mathcal{C}_{i}$ and $\mathcal{C}_{j}$ .

This property is also referred to as the clique-intersection property in [NFF*+*03] and the induced subtree property in [VA*+*15]. For a given chordal graph, a clique tree can be computed using the algorithm described in [PS90]. The clique tree for an example sparsity pattern is shown in Figure 1(c).

In a post-ordered clique tree the descendants of a node are given consecutive numbers, and a suitable post-ordering can be found via depth-first search. For a clique $\mathcal{C}_{\ell}$ we refer to the first clique encountered on the path to the root as its parent clique $\mathcal{C}_{\mathrm{par}}$ . Conversely $\mathcal{C}_{\ell}$ is called the child of $\mathcal{C}_{\mathrm{par}}$ . If two cliques have the same parent clique we refer to them as siblings.

We define the function $\mathrm{par}:2^{V}\to 2^{V}$ and the multivalued function $\mathrm{ch}:2^{V}\rightrightarrows 2^{V}$ such that $\mathrm{par}(\mathcal{C}_{\ell})$ and $\mathrm{ch}(\mathcal{C}_{\ell})$ return the parent clique and set of child cliques of $\mathcal{C}_{\ell}$ , respectively, where $2^{V}$ is the power set (set of all subsets) of $V$ .

Note that each clique in Figure 1(c) has been partitioned into two sets. The upper row represents the separators $\eta_{\ell}=\mathcal{C}_{\ell}\cap\mathrm{par}(\mathcal{C}_{\ell})$ , i.e. all clique elements that are also contained in the parent clique. We call the sets of the remaining vertices shown in the lower rows the clique residuals or supernodes $\nu_{\ell}=\mathcal{C}_{\ell}\setminus\eta_{\ell}$ . Keeping track of which vertices in a clique belong to the supernode and the separator is useful as the information is needed to perform a positive semidefinite completion. Following the authors in [HS12], we say that two cliques $\mathcal{C}_{i}$ , $\mathcal{C}_{j}$ form a separating pair $\mathcal{P}_{ij}=\{\mathcal{C}_{i},\mathcal{C}_{j}\}$ if their intersection is non-empty and if every path in the underlying graph $G$ from a vertex $\mathcal{C}_{i}\setminus\mathcal{C}_{j}$ to a vertex $\mathcal{C}_{j}\setminus\mathcal{C}_{i}$ contains a vertex of the intersection $\mathcal{C}_{i}\cap\mathcal{C}_{j}$ .

4.2 Agler’s and Grone’s theorems

To explain how the concepts in the previous section can be used to decompose a positive semidefinite constraint, we first consider an SDP in standard primal form:

[TABLE]

with variable $X$ and coefficient matrices $A_{k},C\in\mathbb{S}^{n}$ . The corresponding dual problem is

[TABLE]

with dual variable $y\in\mathbb{R}^{m}$ and slack variable $S$ . Let us assume that the problem matrices in (35) and (36) each have their own sparsity pattern

[TABLE]

The aggregate sparsity of the problem is given by the graph $G(V,E)$ with edge set

[TABLE]

In general $G(V,E)$ will not be chordal, but a chordal extension can be found by adding edges to the graph. We denote the extended graph as $G(V,\bar{E})$ , where $E\subseteq\bar{E}$ . Finding the minimum number of additional edges required to make the graph chordal is an NP-complete problem [Yan81]. Consider a [math]– $1$ matrix $M$ with edge set $E$ . A commonly used heuristic method to compute the chordal extension is to perform a symbolic Cholesky factorisation of $M$ [FKMN01], with the edge set of the Cholesky factor then providing the chordal extension $\bar{E}$ . To simplify the notation in the remainder of the article, we henceforward assume that $E$ represents a chordal graph or has been appropriately extended.

Using the aggregate sparsity of the problem we can modify the constraints on the matrix variables in (35) and (36) to be in the respective sparse positive semidefinite matrix spaces:

[TABLE]

We further define the entry-selector matrices $T_{\ell}\in\mathbb{R}^{|\mathcal{C}_{\ell}|\times n}$ for a clique $\mathcal{C}_{\ell}$ as

[TABLE]

where $\mathcal{C}_{\ell}(i)$ is the $i$ th vertex of $\mathcal{C}_{\ell}$ . We can express the constraints in (37) in terms of multiple smaller coupled constraints using Grone’s and Agler’s theorems.

Theorem 1 (Grone’s theorem [GJSW84]).

Let $G(V,E)$ be a chordal graph with a set of maximal cliques $\{\mathcal{C}_{1},\ldots,\mathcal{C}_{p}\}$ . Then $X\in\mathbb{S}_{+}^{n}(E,?)$ if and only if

[TABLE]

for all $\ell=1,\ldots,p$ .

Applying this theorem to (35) while restricting $X$ to the positive semidefinite completable matrix cone as in (37) yields the decomposed problem:

[TABLE]

For the dual problem we utilise Agler’s theorem, which is the dual to Theorem 1:

Theorem 2 (Agler’s theorem [AHMR88]).

Let $G(V,E)$ be a chordal graph with a set of maximal cliques $\{\mathcal{C}_{1},\ldots,\mathcal{C}_{p}\}$ . Then $S\in\mathbb{S}^{n}_{+}(E,0)$ if and only if there exist matrices $S_{\ell}\in\mathbb{S}_{+}^{|\mathcal{C}_{\ell}|}$ for $\ell=1,\ldots,p$ such that

[TABLE]

Note that the matrices $T_{\ell}$ serve to extract the submatrix $S_{\ell}$ such that $S_{\ell}=T_{\ell}ST_{\ell}^{\top}$ has rows and columns corresponding to the vertices of the clique $C_{\ell}$ . With this theorem, we transform the dual form SDP in (36) with the restriction on $S$ in (37) to arrive at:

[TABLE]

4.3 Returning a decomposed problem into standard SDP form

After the decomposition results of Section 4.2 have been applied, the SDP problem (42) has to be transformed back into standard form (3). For the undecomposed problem in (36), this is achieved by first relabeling $y$ as $x$ . We then choose $P=0_{n\times n}$ , $q=-b$ , define $A=[\mathrm{svec}({A_{1}}),\ldots,\mathrm{svec}({A_{m}})]$ , $b=\mathrm{svec}({C})$ and $s=\mathrm{svec}({S})$ .

To transform the decomposed dual problem (42), we will make use of the fact that the decision variable $S\in\mathbb{S}^{n}$ and all the submatrices $S_{\ell}$ are symmetric and consider instead $s_{\ell}=\mathrm{svec}(S_{\ell})$ . The main challenge is to keep track of the overlapping entries in the individual blocks and ensure they sum to the original entry in $S$ . Different possible transformations achieving this are described in [KKMY11].

All the necessary information about the overlapping entries is stored in the clique tree $\mathcal{T}(\mathcal{B},\mathcal{E})$ that represents the sparsity pattern of $S$ . We assume that the clique tree is post-ordered with cliques $\mathcal{C}_{1},\ldots,\mathcal{C}_{p}$ . Define a vector to represent slack variables for overlapping entries $\theta\coloneqq[\theta_{1},\ldots,\theta_{n_{o}}]^{\top}$ , where $n_{o}\coloneqq\sum_{\ell=1}^{p}\frac{\left|\eta_{\ell}\right|(\left|\eta_{\ell}\right|+1)}{2}$ is the total number of overlapping entries in the upper triangle of the sparsity pattern. The $s$ vector of the decomposed problem is created by stacking the vectorized subblocks $s_{\ell}$ according to the postordering of the clique tree. Define $\omega(i,j,\ell)$ as the index of $s$ corresponding to the $(i,j)$ th element of block $S_{\ell}$ . The equality constraint in problem (42) can be represented in the equivalent standard form (3) as:

[TABLE]

with

[TABLE]

The matrices $\tilde{A}_{k}^{\ell}$ and $\tilde{B}^{\ell}$ take on the values of $A_{k}$ and $B$ in the locations corresponding to the elements in the submatrix $S_{\ell}$ . If a matrix entry of clique $\mathcal{C}_{l}$ in $A_{k,ij}^{\ell}$ and $B_{ij}^{\ell}$ is overlapped by an entry of the parent clique, i.e. both $(i,j)$ are included in $\mathrm{sep}({\mathcal{C}_{l}})$ , it is set to zero. Each column of the linking matrix $L\in\mathbb{R}^{m_{d}\times n_{o}}$ links one overlapping entry in the clique tree. $L$ is generated by first collecting all the matrix indices $(i,j)_{\ell}$ of the separators $\mathrm{sep}({\mathcal{C}_{\ell}})$ in the clique tree:

[TABLE]

Then $L$ is constructed column-by-column, each representing one overlapping entry. The column vector $l_{c}$ is equal to $1$ in the row $r$ corresponding to the $(i,j)_{\ell}$ -th entry of $S_{\ell}$ and $-1$ in the row corresponding to the same entry of the parent block $S_{k}$ , where $\mathcal{C}_{\ell}=\mathrm{ch}(\mathcal{C}_{k})$ . Thus, each element in $l_{c}$ is defined by:

[TABLE]

As an example consider a problem with $m=1$ and $p=3$ and the simple sparsity pattern and clique tree given by:

Using the transformation in (43) this constraint can be represented by three constraints on the submatrices $S_{1}$ , $S_{2}$ and $S_{3}$ and six overlap variables $\theta_{1},\dots,\theta_{6}$ :

[TABLE]

Notice how the overlap variables drop out if the original matrix $S$ is reassembled according to Theorem 2 by adding the entries of each subblock.

5 Clique Merging

Given an initial decomposition with (chordally completed) edge set $E$ and a set of cliques $\{\mathcal{C}_{1},\ldots,\mathcal{C}_{p}\}$ , we are free to re-merge any number of cliques back into larger blocks. This is equivalent to treating structural zeros in the problem as numerical zeros, leading to additional edges in the graph. Looking at the decomposed problem in (40) and (42), the effects of merging two cliques $\mathcal{C}_{i}$ and $\mathcal{C}_{j}$ are twofold:

We replace two positive semidefinite matrix constraints of dimensions $\left|\mathcal{C}_{i}\right|$ and $\left|\mathcal{C}_{j}\right|$ with one constraint on a larger clique with dimension $\left|\mathcal{C}_{i}\cup\mathcal{C}_{j}\right|$ , where the increase in dimension depends on the size of the overlap. 2. 2.

We remove consistency constraints for the overlapping entries between $\mathcal{C}_{i}$ and $\mathcal{C}_{j}$ , thus reducing the size of the linear system of equality constraints.

When merging cliques these two factors have to be balanced, and the optimal merging strategy depends on the particular SDP solution algorithm employed. In [NFF*+*03] and [SAV14] a clique tree is used to search for favourable merge candidates; We will refer to their two approaches as SparseCoLO and the parent-child strategy, respectively, in the following sections. We will then propose a new merging method in Section 5.2 whose performance is superior to both methods when used in ADMM. Given a merging strategy, Algorithm 3 describes how to merge a set of cliques within the set $\mathcal{B}$ and update the edge set $\mathcal{E}$ accordingly.

5.1 Existing clique tree-based strategies

The parent-child strategy described in [SAV14] traverses the clique tree in a depth-first order and merges a clique $\mathcal{C}_{\ell}$ with its parent clique $\mathcal{C}_{\mathrm{par}(\ell)}\coloneqq\mathrm{par}(\mathcal{C}_{\ell})$ if at least one of the two following conditions are met:

[TABLE]

with heuristic parameters $t_{\mathrm{fill}}$ and $t_{\mathrm{size}}$ . These conditions keep the amount of extra fill-in and the supernode cardinalities below the specified thresholds. The SparseCoLO strategy described in [NFF*+*03] and [FFN06] considers parent-child as well as sibling relationships. Given a parameter $\sigma>0$ , two cliques $\mathcal{C}_{i},\mathcal{C}_{j}$ are merged if the following merge criterion holds

[TABLE]

This approach traverses the clique tree depth-first, performing the following steps for each clique $\mathcal{C}_{\ell}$ :

For each clique pair $\left\{(\mathcal{C}_{i},\mathcal{C}_{j})\mid\mathcal{C}_{i},\mathcal{C}_{j}\in\mathrm{ch}\left(\mathcal{C}_{\ell}\right)\right\}$ , check if (49) holds, then:

•

$\mathcal{C}_{i}$ and $\mathcal{C}_{j}$ are merged, or

•

if $(\mathcal{C}_{i}\cup\mathcal{C}_{j})\supseteq\mathcal{C}_{\ell}$ , then $\mathcal{C}_{i}$ , $\mathcal{C}_{j}$ , and $\mathcal{C}_{\ell}$ are merged. 2. 2.

For each clique pair $\left\{\left(\mathcal{C}_{i},\mathcal{C}_{\ell}\right)\mid\mathcal{C}_{i}\in\mathrm{ch}\left(\mathcal{C}_{\ell}\right)\right\}$ , merge $\mathcal{C}_{i}$ and $\mathcal{C}_{\ell}$ if (49) is satisfied.

We note that the open-source implementation of the SparseCoLO algorithm described in [NFF*+*03] follows the algorithm outlined here, but also employs a few additional heuristics.

An advantage of these two approaches is that the clique tree can be computed easily and the conditions are inexpensive to evaluate. However, a disadvantage is that choosing parameters that work well on a variety of problems and solver algorithms is difficult. Secondly, clique trees are not unique and in some cases it is beneficial to merge cliques that are not directly related on the particular clique tree that was computed. To see this, consider a chordal graph $G(V,E)$ consisting of three connected subgraphs:

[TABLE]

and some additional vertices $\{1,2,m_{a}+1\}$ . The graph is connected as shown in Figure 2(a), where the complete subgraphs are represented as nodes $V_{a},V_{b},V_{c}$ . A corresponding clique tree is shown in Figure 2(b).

By choosing the cardinality $\left|V_{c}\right|$ , the overlap between cliques $\mathcal{C}_{1}=\{1,2\}\cup V_{c}$ and $\mathcal{C}_{3}=\{m_{a}+1\}\cup V_{c}$ can be made arbitrarily large while $\left|V_{a}\right|$ , $\left|V_{b}\right|$ can be chosen so that any other merge is disadvantageous. However, neither the parent-child strategy nor SparseCoLO would consider merging $\mathcal{C}_{1}$ and $\mathcal{C}_{3}$ since they are in a “nephew-uncle” relationship. In fact for the particular sparsity graph in Figure 2(a) eight different clique trees can be computed. Only in half of them do the cliques $\mathcal{C}_{1}$ and $\mathcal{C}_{3}$ appear in a parent-child relationship. Therefore, a merge strategy that only considers parent-child relationships would miss this favorable merge in half the cases.

5.2 A new clique graph-based strategy

To overcome the limitations of existing strategies we propose a merging strategy based on the reduced clique graph $\mathcal{G}(\mathcal{B},\xi)$ , which is defined as the union of all possible clique trees of $G$ ; see [HS12] for a detailed discussion. The set of vertices of this graph is given by the maximal cliques of the sparsity pattern. We then create the edge set $\xi$ by introducing edges between pairs of cliques $(\mathcal{C}_{i},\mathcal{C}_{j})$ if they form a separating pair $\mathcal{P}_{ij}$ . We remark that $\xi$ is a subset of the edges present in the clique intersection graph which is obtained by introducing edges for every two cliques that intersect. However, the reduced clique graph has the property that it remains a valid reduced clique graph of the altered sparsity pattern after performing a permissible merge between two cliques. This is not always the case for the clique intersection graph. For convenience, we will refer to the reduced clique graph simply as the clique graph in the following sections. Based on the permissibility condition for edge reduction in [HS12] we define a permissibility condition for clique merges:

Definition 5.1 (Permissible merge).

Given a reduced clique graph $\mathcal{G}(\mathcal{B},\xi)$ , a merge between two cliques $(\mathcal{C}_{i},\mathcal{C}_{j})\in\xi$ is permissible if for every common neighbour $\mathcal{C}_{k}$ it holds that $\mathcal{C}_{i}\cap\mathcal{C}_{k}=\mathcal{C}_{j}\cap\mathcal{C}_{k}$ .

We further define a monotone edge weighting function $e\colon 2^{V}\times 2^{V}\rightarrow\mathbb{R}$ that assigns a weight $w_{ij}$ to each edge $(\mathcal{C}_{i},\mathcal{C}_{j})\in\xi$ :

[TABLE]

This function is used to estimate the per-iteration computational savings of merging a pair of cliques depending on the targeted algorithm and hardware. It evaluates to a positive number if a merge reduces the per-iteration time and to a negative number otherwise. For a first-order method, whose per-iteration cost is dominated by an eigenvalue factorisation with complexity $\mathcal{O}\bigl{(}\left|\mathcal{C}\right|^{3}\bigr{)}$ , a simple choice would be:

[TABLE]

More sophisticated weighting functions can be determined empirically; see Section 7.5. After a weight has been computed for each edge $(\mathcal{C}_{i},\mathcal{C}_{j})$ in the clique graph, we merge cliques as outlined in Algorithm 4.

This strategy considers the edges in terms of their weights, starting with the permissible clique pair $(\mathcal{C}_{i},\mathcal{C}_{j})$ with the highest weight $w_{ij}$ . If the weight is positive, the two cliques are merged and the edge weights for all edges connected to the merged clique $\mathcal{C}_{m}=\mathcal{C}_{i}\cup\mathcal{C}_{j}$ are updated. This process continues until no edges with positive weights remain.

The clique graph for the clique tree in Figure 1(c) is shown in Figure 3(a) with the edge weighting function in (50). Following Algorithm 4 the edge with the largest weight is considered first and the corresponding cliques are merged, i.e. $\{3,6,7,8\}$ and $\{6,7,8,9\}$ . Note that the merge is permissible because both cliques intersect with the only common neighbour $\{4,5,8\}$ in the same way. The revised clique graph $\mathcal{G}(\hat{\mathcal{B}},\hat{\xi})$ is shown in Figure 3(b). Since no edges with positive weights remain, the algorithm stops.

After Algorithm 4 has terminated, it is possible to recompute a valid clique tree from the revised clique graph. This can be done in two steps. First, the edge weights in $\mathcal{G}(\hat{\mathcal{B}},\hat{\xi})$ are replaced with the cardinality of their intersection:

[TABLE]

Second, a clique tree is then given by any maximum weight spanning tree of the newly weighted clique graph, e.g. using Kruskal’s algorithm described in [Kru56].

Our merging strategy has some clear advantages over competing approaches. Since the clique graph covers a wider range of merge candidates, it will consider edges that do not appear in clique tree-based approaches such as the “nephew-uncle” example in Figure 2. Moreover, the edge weighting function allows one to make a merge decision based on the particular solver algorithm and hardware used. One downside is that this approach is more computationally expensive than the other methods. However, our numerical experiments show that the time spent on finding the clique graph, merging the cliques, and recomputing the clique tree represent only a very small fraction of the total computational savings relative to other merging methods when solving SDPs.

6 Open-Source Implementation

We have implemented our algorithm in the Conic Operator Splitting Method (COSMO), an open-source package written in Julia [BEKS17]. Julia allows the solver to be written in a flexible, modular and extensible way, while still maintaining the benefits of a fast compiled language. The source code and documentation are available at

https://github.com/oxfordcontrol/COSMO.jl.

COSMO offers the user two interfaces to describe the constraints of the optimisation problem: a direct interface, and an interface to the modelling languages JuMP [DHL17] and Convex.jl [UMZ*+*14]. These interfaces connect the solver to the Julia optimisation ecosystem which provide flexible problem description and automatic problem reformulation.

As shown in Algorithm 1 the two main steps of the algorithm are solving a linear system and projecting onto a Cartesian product of cones. The implementation of these two parts allows customisation by the user. For the solution of the linear system in (17) the user can either use the QDLDL [SBG*+*18] solver provided with COSMO, the standard sparse solver from SuiteSparse [D*+*15], or choose one of the provided interfaces to direct and indirect solvers, e.g. Pardiso [SGK*+*10, KAA*+*15], conjugate gradient, minimal residual method, or link their own implementation.

The second important part of the algorithm is the projection step onto a Cartesian product of convex sets. By default COSMO supports the zero cone, the nonnegative orthant, the hyperbox, the second-order cone, the PSD cone, the exponential cone and its dual, and the power cone and its dual. Our Julia implementation also allows the user to define their own convex cones222To allow infeasibility detection the user has to either define a convex cone, a convex compact set or a composition of the two. and custom projection functions. To implement a custom cone $\mathcal{K}_{c}$ the user has to provide:

•

a projection function that projects an input vector onto the cone

•

a function that determines if a vector is inside the dual cone $\mathcal{K}_{c}^{*}$

•

a function that determines if a vector is inside the recession cone of $-\mathcal{K}_{c}$

The latter two functions are required for our solver to implement checks for infeasibility as described in Sections 2.1 and 3.4. An example that shows the advantages of defining a custom cone is provided in Section 7.2.

The authors in [RGN19] used COSMO’s algorithm with a specialized implementation of the projection function for positive semidefinite constraints. The projection method used approximate matrix eigendecompositions to significantly reduce the projection time, while maintaining all the features of COSMO such as scaling, infeasibility detection and interfaces to linear system solvers. It was demonstrated that this can provide a significant, up to 20x, reduction in solve time.

Given a problem with multiple constraints, the projection step (19) can be carried out in parallel. This is particularly advantageous when used in combination with chordal decomposition, which typically yields a large number of smaller PSD constraints. For the eigendecomposition involved in the projection step of a PSD constraint, the LAPACK [ABB*+*99] function sveyr is used, which can also utilise multiple threads. Consequently, this leads to two-level parallelism in the computation, i.e on the higher level the projection functions are carried out in parallel and each projection function independently calls sveyr. Determining the optimal allocation of the CPU cores to each of these tasks depends on the number of PSD constraints and their dimensions and is a difficult problem. For the problem sets considered in section 7 we achieved the best performance by running sveyr single-threaded and using all physical CPUs to carry out the projection functions in parallel.

Moreover, Julia’s type abstraction features are used to enable the solver to solve problems of arbitrary floating-point precision. This allows for example to reduce the memory usage of the solver for very large problems by switching to 32-bit single-precision floating-point format.

7 Numerical Results

This section presents benchmark results of COSMO against the interior-point solver MOSEK v $9.0$ and the accelerated first-order ADMM solver SCS v $2.1.1$ . When applied to a quadratic program, COSMO’s main algorithm becomes very similar to the first-order QP solver OSQP. To test the performance penalty of using a pure Julia implementation against a similar C implementation we also compare our solver against OSQP v $0.6.0$ on QP problems.

We selected a number of problem sets to test different aspects of COSMO. The advantage of supporting a quadratic cost function in a conic solver is shown by solving QPs from the Maros and Mészáros QP repository [MM99] in Section 7.1 and SDPs with quadratic objectives in the form of nearest correlation matrix problems in Section 7.3.

To highlight the advantages of implementing custom constraints, we consider a problem set with doubly-stochastic matrices in Section 7.2. We then show how chordal decomposition can speed up the solver for SDPs that exhibit a block-arrow sparsity pattern in Section 7.4.

The performance of chordal decomposition is further explored in Section 7.5 by solving large structured problems from the SDPLib benchmark set [Bor99] as well as some non-chordal SDPs generated with sparsity patterns from the SuiteSparse Matrix Collections [DH11]. Using the same problems we additionally evaluate the performance of different clique merging strategies.

All the experiments were carried out on a computing node of the University of Oxford ARC-HTC cluster with 16 logical Intel Xeon E5-2560 cores and $64\text{\,}\mathrm{G}\mathrm{B}$ of DDR3 RAM. All the problems were run using Julia v $1.3$ and the problems were passed to the solvers via MathOptInterface [LDGL20].

To evaluate the accuracy of the returned solution we compute three errors adapted from the DIMACS error measures for SDPs [JPA00]:

[TABLE]

where $A_{a}$ , $b_{a}$ and $y_{a}$ correspond to the rows of $A$ , $b$ and $y$ that represent active constraints. This is to ensure meaningful values even if the problem contains inactive constraints with very large, or possibly infinite, values $b_{i}$ . The maximum of the three errors for each problem and solver is reported in the results below.

We configured COSMO, MOSEK, SCS and OSQP to achieve an accuracy of $\epsilon=$ {10}^{-3}$$. We set the maximum allowable solve time for the Maros and Mészáros problems to $5\text{\,}\mathrm{min}$ and to $30\text{\,}\mathrm{min}$ for the other problem sets. All other solver parameters were set to the solvers’ standard configurations. COSMO uses a Julia implementation of the QDLDL solver to factor the quasi-definite linear system. Similarly, we configured SCS to use a direct solver for the linear system.

7.1 Maros and Mészáros QP test set

The Maros and Mészáros test problem set [MM99] is a repository of challenging convex QP problems that is widely used to compare the performance of QP solvers. For comparison metrics we compute the failure rate, the number of fastest solve time and the normalized shifted geometric mean for each solver. The shifted geometric mean is more robust against large outliers (compared to the arithmetic mean) and against small outliers (compared to the geometric mean) and is commonly used in optimisation benchmarks; see [SBG*+*18, Mit]. The shifted geometric mean $\mu_{g,s}$ is defined as:

[TABLE]

with total solver time $t_{p,s}$ of solver $s$ and problem $p$ , shifting factor $\mathrm{sh}$ and size of the problem set $n$ . In the reported results a shifting factor of $sh=10$ was chosen and the maximum allowable time $t_{p,s}=$ 300\text{,}\mathrm{s}$$ was used if solver $s$ failed on problem $p$ . Lastly, we normalize the shifted geometric mean for solver $s$ by dividing by the geometric mean of the fastest solver. The failure rate $f_{r,s}$ is given by the number of unsolved problems compared to the total number of problems in the problem set. As unsolved problems we count instances where the algorithm does not converge within the allowable time or fails during the setup or solve phase. Table 7.1 shows the normalized shifted geometric mean and the failure rate for each solver. Additionally, the number of cases where solver $s$ was the fastest solver is shown.

Bibliography74

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[ABB + 99] Edward Anderson, Zhaojun Bai, Christian Bischof, L Susan Blackford, James Demmel, Jack Dongarra, Jeremy Du Croz, Anne Greenbaum, Sven Hammarling, Alan Mc Kenney, et al. LAPACK Users’ guide . SIAM, 1999.
2[ADV 10] Martin S Andersen, Joachim Dahl, and Lieven Vandenberghe. Implementation of nonsymmetric interior-point methods for linear optimization over sparse matrix cones. Mathematical Programming Computation , 2(3-4):167–201, 2010.
3[AHMR 88] Jim Agler, William Helton, Scott Mc Cullough, and Leiba Rodman. Positive semidefinite matrices with a given sparsity pattern. Linear Algebra and its Applications , 107:101–149, 1988.
4[AHO 98] Farid Alizadeh, Jean-Pierre A Haeberly, and Michael L Overton. Primal-dual interior-point methods for semidefinite programming: convergence rates, stability and numerical results. SIAM Journal on Optimization , 8(3):746–768, 1998.
5[Ali 95] Farid Alizadeh. Interior point methods in semidefinite programming with applications to combinatorial optimization. SIAM Journal on Optimization , 5(1):13–51, 1995.
6[AV 15] M. S. Andersen and L. Vandenberghe. CHOMPACK: A python package for chordal matrix computations, 2015.
7[BBD + 17] Stephen Boyd, Enzo Busseti, Steve Diamond, Ronald N Kahn, Kwangmoo Koh, Peter Nystrup, Jan Speth, et al. Multi-period trading via convex optimization. Foundations and Trends® in Optimization , 3(1):1–76, 2017.
8[BC 11] HH Bauschke and PL Combettes. Convex Analysis and Monotone Operator Theory in Hilbert Spaces . Springer, 1 edition, 2011.

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Code & Models

Videos

COSMO: A conic operator splitting method

Abstract

1 Introduction

Related Work

Outline

Contributions

Notation

2 Conic Problems

2.1 Infeasibility certificates

3 ADMM Algorithm

3.1 Solution of the equality constrained QP

3.2 Projection step

3.3 Algorithm steps

3.4 Algorithm convergence

3.5 Scaling the problem data

3.6 Termination criteria

4 Chordal Decomposition

4.1 Graph preliminaries

Definition 4.1** (Running intersection property).**

4.2 Agler’s and Grone’s theorems

Theorem 1** (Grone’s theorem [GJSW84]).**

Theorem 2** (Agler’s theorem [AHMR88]).**

4.3 Returning a decomposed problem into standard SDP form

5 Clique Merging

5.1 Existing clique tree-based strategies

5.2 A new clique graph-based strategy

Definition 5.1** (Permissible merge).**

6 Open-Source Implementation

7 Numerical Results

7.1 Maros and Mészáros QP test set

Definition 4.1 (Running intersection property).

Theorem 1 (Grone’s theorem [GJSW84]).

Theorem 2 (Agler’s theorem [AHMR88]).

Definition 5.1 (Permissible merge).