Asynchronous parallel primal-dual block coordinate update methods for   affinely constrained convex programs

Yangyang Xu

arXiv:1705.06391·math.OC·October 17, 2019

Asynchronous parallel primal-dual block coordinate update methods for affinely constrained convex programs

Yangyang Xu

PDF

TL;DR

This paper introduces an asynchronous parallel primal-dual block coordinate update method for convex problems with nonseparable linear constraints, demonstrating convergence and improved speed-up over synchronous methods.

Contribution

It proposes a novel randomized primal-dual BCU method with adaptive stepsize for multi-block affinely constrained problems, extending async-parallel optimization to nonseparable constraints.

Findings

01

Convergence in probability to the optimal value and zero constraint residual.

02

Ergodic $O(1/k)$ convergence rate established.

03

Numerical experiments show superior speed-up compared to synchronous methods.

Abstract

Recent several years have witnessed the surge of asynchronous (async-) parallel computing methods due to the extremely big data involved in many modern applications and also the advancement of multi-core machines and computer clusters. In optimization, most works about async-parallel methods are on unconstrained problems or those with block separable constraints. In this paper, we propose an async-parallel method based on block coordinate update (BCU) for solving convex problems with nonseparable linear constraint. Running on a single node, the method becomes a novel randomized primal-dual BCU with adaptive stepsize for multi-block affinely constrained problems. For these problems, Gauss-Seidel cyclic primal-dual BCU needs strong convexity to have convergence. On the contrary, merely assuming convexity, we show that the objective value sequence generated by the proposed algorithm…

Figures28

Click any figure to enlarge with its caption.

Tables1

Table 1. Table 1: Characteristics of two LIBSVM datasets

Name	#samples	#features	#nonzeros
rcv1	20,242	47,236	1,498,952
news20	19,996	1,355,191	9,097,916

Equations165

x min f (x_{1}, \dots, x_{m}) + i = 1 \sum m g_{i} (x_{i}), \mbox s . t . i = 1 \sum m A_{i} x_{i} = b,

x min f (x_{1}, \dots, x_{m}) + i = 1 \sum m g_{i} (x_{i}), \mbox s . t . i = 1 \sum m A_{i} x_{i} = b,

x min ∥ x ∥_{1}, \mbox s . t . A x = b .

x min ∥ x ∥_{1}, \mbox s . t . A x = b .

x min \frac{1}{2} x^{⊤} Σ x, \mbox s . t . i = 1 \sum m x_{i} \leq 1, i = 1 \sum m ξ_{i} x_{i} \geq c, x_{i} \geq 0, \forall i .

x min \frac{1}{2} x^{⊤} Σ x, \mbox s . t . i = 1 \sum m x_{i} \leq 1, i = 1 \sum m ξ_{i} x_{i} \geq c, x_{i} \geq 0, \forall i .

θ min \frac{1}{2} θ^{⊤} Diag (y) X^{⊤} X Diag (y) θ - e^{⊤} θ, \mbox s . t . y^{⊤} θ = 0, 0 \leq θ_{i} \leq C, \forall i,

θ min \frac{1}{2} θ^{⊤} Diag (y) X^{⊤} X Diag (y) θ - e^{⊤} θ, \mbox s . t . y^{⊤} θ = 0, 0 \leq θ_{i} \leq C, \forall i,

Φ (\overset{ˉ}{x}, x, λ) = F (\overset{ˉ}{x}) - F (x) - ⟨ λ, A \overset{ˉ}{x} - b ⟩ .

Φ (\overset{ˉ}{x}, x, λ) = F (\overset{ˉ}{x}) - F (x) - ⟨ λ, A \overset{ˉ}{x} - b ⟩ .

prox_{ψ} (x) = y arg min ψ (y) + \frac{1}{2} ∥ x - y ∥^{2} .

prox_{ψ} (x) = y arg min ψ (y) + \frac{1}{2} ∥ x - y ∥^{2} .

L_{β} (x, λ) = f (x) + g (x) - ⟨ λ, A x - b ⟩ + \frac{β}{2} ∥ A x - b ∥^{2},

L_{β} (x, λ) = f (x) + g (x) - ⟨ λ, A x - b ⟩ + \frac{β}{2} ∥ A x - b ∥^{2},

x_{i}^{k + 1} \in x_{i} arg min ⟨ \nabla_{i} f (x^{k}) - A_{i}^{⊤} (λ^{k} - β r^{k}), x_{i} ⟩ + g_{i} (x_{i}) + \frac{1}{2} ∥ x_{i} - x_{i}^{k} ∥_{P_{i}}^{2},

x_{i}^{k + 1} \in x_{i} arg min ⟨ \nabla_{i} f (x^{k}) - A_{i}^{⊤} (λ^{k} - β r^{k}), x_{i} ⟩ + g_{i} (x_{i}) + \frac{1}{2} ∥ x_{i} - x_{i}^{k} ∥_{P_{i}}^{2},

r^{k + 1} = r^{k} + A_{i_{k}} (x_{i_{k}}^{k + 1} - x_{i_{k}}^{k}),

r^{k + 1} = r^{k} + A_{i_{k}} (x_{i_{k}}^{k + 1} - x_{i_{k}}^{k}),

λ^{k + 1} = λ^{k} - ρ r^{k + 1} .

x_{i}^{k + 1} \in x_{i} arg min ⟨ v^{k} - A_{i}^{⊤} (λ^{k} - β r^{k}), x_{i} ⟩ + g_{i} (x_{i}) + \frac{1}{2} ∥ x_{i} - x_{i}^{k} ∥_{P_{i}}^{2},

x_{i}^{k + 1} \in x_{i} arg min ⟨ v^{k} - A_{i}^{⊤} (λ^{k} - β r^{k}), x_{i} ⟩ + g_{i} (x_{i}) + \frac{1}{2} ∥ x_{i} - x_{i}^{k} ∥_{P_{i}}^{2},

∥ \nabla_{i} f (x + U_{i} y) - \nabla_{i} f (x) ∥ \leq L_{i} ∥ y_{i} ∥, i = 1, \dots, m,

∥ \nabla_{i} f (x + U_{i} y) - \nabla_{i} f (x) ∥ \leq L_{i} ∥ y_{i} ∥, i = 1, \dots, m,

∥\nabla f (x + U_{i} y) - \nabla f (x) ∥ \leq L_{r} ∥ y_{i} ∥, i = 1, \dots, m .

∥\nabla f (x + U_{i} y) - \nabla f (x) ∥ \leq L_{r} ∥ y_{i} ∥, i = 1, \dots, m .

f (x + U_{i} y) \leq f (x) + ⟨ \nabla_{i} f (x), y_{i} ⟩ + \frac{L _{i}}{2} ∥ y_{i} ∥^{2}, \forall i, \forall x, y .

f (x + U_{i} y) \leq f (x) + ⟨ \nabla_{i} f (x), y_{i} ⟩ + \frac{L _{i}}{2} ∥ y_{i} ∥^{2}, \forall i, \forall x, y .

\displaystyle\mathbb{E}_{i_{k}}\left\langle\nabla_{i_{k}}f({\mathbf{x}}^{k}),{\mathbf{x}}_{i_{k}}^{k+1}-{\mathbf{x}}_{i_{k}}\right\rangle\geq-\big{(}1-\frac{1}{m}\big{)}[f({\mathbf{x}}^{k})-f({\mathbf{x}})]+\mathbb{E}_{i_{k}}\left[f({\mathbf{x}}^{k+1})-f({\mathbf{x}})-\frac{1}{2}\|{\mathbf{x}}^{k+1}-{\mathbf{x}}^{k}\|_{\mathbf{L}}^{2}\right].

\displaystyle\mathbb{E}_{i_{k}}\left\langle\nabla_{i_{k}}f({\mathbf{x}}^{k}),{\mathbf{x}}_{i_{k}}^{k+1}-{\mathbf{x}}_{i_{k}}\right\rangle\geq-\big{(}1-\frac{1}{m}\big{)}[f({\mathbf{x}}^{k})-f({\mathbf{x}})]+\mathbb{E}_{i_{k}}\left[f({\mathbf{x}}^{k+1})-f({\mathbf{x}})-\frac{1}{2}\|{\mathbf{x}}^{k+1}-{\mathbf{x}}^{k}\|_{\mathbf{L}}^{2}\right].

\mathbb{E}_{i_{k}}\langle\nabla_{i_{k}}f({\mathbf{x}}^{k}),{\mathbf{x}}_{i_{k}}^{k}-{\mathbf{x}}_{i_{k}}\rangle=\frac{1}{m}\langle\nabla f({\mathbf{x}}^{k}),{\mathbf{x}}^{k}-{\mathbf{x}}\rangle\geq\frac{1}{m}\big{[}f({\mathbf{x}}^{k})-f({\mathbf{x}})\big{]},

\mathbb{E}_{i_{k}}\langle\nabla_{i_{k}}f({\mathbf{x}}^{k}),{\mathbf{x}}_{i_{k}}^{k}-{\mathbf{x}}_{i_{k}}\rangle=\frac{1}{m}\langle\nabla f({\mathbf{x}}^{k}),{\mathbf{x}}^{k}-{\mathbf{x}}\rangle\geq\frac{1}{m}\big{[}f({\mathbf{x}}^{k})-f({\mathbf{x}})\big{]},

⟨ \nabla_{i_{k}} f (x^{k}), x_{i_{k}}^{k + 1} - x_{i_{k}}^{k} ⟩ \geq

⟨ \nabla_{i_{k}} f (x^{k}), x_{i_{k}}^{k + 1} - x_{i_{k}}^{k} ⟩ \geq

=

E_{i_{k}} ⟨ - A_{i_{k}}^{⊤} (λ^{k} - β r^{k}), x_{i_{k}}^{k + 1} - x_{i_{k}} ⟩

E_{i_{k}} ⟨ - A_{i_{k}}^{⊤} (λ^{k} - β r^{k}), x_{i_{k}}^{k + 1} - x_{i_{k}} ⟩

=

- \frac{β}{2} E_{i_{k}} [∥ r^{k + 1} ∥^{2} - ∥ r^{k} ∥^{2} + ∥ x^{k + 1} - x^{k} ∥_{A^{⊤} A}^{2}] .

E_{i_{k}} ⟨ y_{i_{k}}^{k}, x_{i_{k}}^{k + 1} - x_{i_{k}} ⟩ =

E_{i_{k}} ⟨ y_{i_{k}}^{k}, x_{i_{k}}^{k + 1} - x_{i_{k}} ⟩ =

=

=

\displaystyle\langle{\mathbf{y}}^{k},{\mathbf{x}}^{k+1}-{\mathbf{x}}\rangle=\langle-{\mathbf{A}}^{\top}{\boldsymbol{\lambda}}^{k+1},{\mathbf{x}}^{k+1}-{\mathbf{x}}\rangle+(\beta-\rho)\|{\mathbf{r}}^{k+1}\|^{2}-\beta\big{\langle}{\mathbf{A}}({\mathbf{x}}^{k+1}-{\mathbf{x}}^{k}),{\mathbf{A}}({\mathbf{x}}^{k+1}-{\mathbf{x}})\big{\rangle}.

\displaystyle\langle{\mathbf{y}}^{k},{\mathbf{x}}^{k+1}-{\mathbf{x}}\rangle=\langle-{\mathbf{A}}^{\top}{\boldsymbol{\lambda}}^{k+1},{\mathbf{x}}^{k+1}-{\mathbf{x}}\rangle+(\beta-\rho)\|{\mathbf{r}}^{k+1}\|^{2}-\beta\big{\langle}{\mathbf{A}}({\mathbf{x}}^{k+1}-{\mathbf{x}}^{k}),{\mathbf{A}}({\mathbf{x}}^{k+1}-{\mathbf{x}})\big{\rangle}.

\big{\langle}{\mathbf{A}}({\mathbf{x}}^{k+1}-{\mathbf{x}}^{k}),{\mathbf{A}}({\mathbf{x}}^{k+1}-{\mathbf{x}})\big{\rangle}=\frac{1}{2}\big{[}\|{\mathbf{r}}^{k+1}\|^{2}-\|{\mathbf{r}}^{k}\|^{2}+\|{\mathbf{x}}^{k+1}-{\mathbf{x}}^{k}\|_{{\mathbf{A}}^{\top}{\mathbf{A}}}^{2}\big{]},

\big{\langle}{\mathbf{A}}({\mathbf{x}}^{k+1}-{\mathbf{x}}^{k}),{\mathbf{A}}({\mathbf{x}}^{k+1}-{\mathbf{x}})\big{\rangle}=\frac{1}{2}\big{[}\|{\mathbf{r}}^{k+1}\|^{2}-\|{\mathbf{r}}^{k}\|^{2}+\|{\mathbf{x}}^{k+1}-{\mathbf{x}}^{k}\|_{{\mathbf{A}}^{\top}{\mathbf{A}}}^{2}\big{]},

\displaystyle\mathbb{E}_{i_{k}}\left\langle\tilde{\nabla}g_{i_{k}}({\mathbf{x}}_{i_{k}}^{k+1}),{\mathbf{x}}_{i_{k}}^{k+1}-{\mathbf{x}}_{i_{k}}\right\rangle\geq\mathbb{E}_{i_{k}}[g({\mathbf{x}}^{k+1})-g({\mathbf{x}})]-\big{(}1-\frac{1}{m}\big{)}[g({\mathbf{x}}^{k})-g({\mathbf{x}})],

\displaystyle\mathbb{E}_{i_{k}}\left\langle\tilde{\nabla}g_{i_{k}}({\mathbf{x}}_{i_{k}}^{k+1}),{\mathbf{x}}_{i_{k}}^{k+1}-{\mathbf{x}}_{i_{k}}\right\rangle\geq\mathbb{E}_{i_{k}}[g({\mathbf{x}}^{k+1})-g({\mathbf{x}})]-\big{(}1-\frac{1}{m}\big{)}[g({\mathbf{x}}^{k})-g({\mathbf{x}})],

\mathbb{E}_{i_{k}}\left\langle\tilde{\nabla}g_{i_{k}}({\mathbf{x}}_{i_{k}}^{k+1}),{\mathbf{x}}_{i_{k}}^{k+1}-{\mathbf{x}}_{i_{k}}\right\rangle\geq\mathbb{E}_{i_{k}}\big{[}g_{i_{k}}({\mathbf{x}}_{i_{k}}^{k+1})-g_{i_{k}}({\mathbf{x}}_{i_{k}})\big{]}.

\mathbb{E}_{i_{k}}\left\langle\tilde{\nabla}g_{i_{k}}({\mathbf{x}}_{i_{k}}^{k+1}),{\mathbf{x}}_{i_{k}}^{k+1}-{\mathbf{x}}_{i_{k}}\right\rangle\geq\mathbb{E}_{i_{k}}\big{[}g_{i_{k}}({\mathbf{x}}_{i_{k}}^{k+1})-g_{i_{k}}({\mathbf{x}}_{i_{k}})\big{]}.

\mathbb{E}_{i_{k}}\big{[}g_{i_{k}}({\mathbf{x}}_{i_{k}}^{k+1})-g_{i_{k}}({\mathbf{x}}_{i_{k}})\big{]}=\frac{1}{m}\big{[}g({\mathbf{x}}^{k})-g({\mathbf{x}})\big{]}+\mathbb{E}_{i_{k}}\big{[}g({\mathbf{x}}^{k+1})-g({\mathbf{x}}^{k})\big{]}.

\mathbb{E}_{i_{k}}\big{[}g_{i_{k}}({\mathbf{x}}_{i_{k}}^{k+1})-g_{i_{k}}({\mathbf{x}}_{i_{k}})\big{]}=\frac{1}{m}\big{[}g({\mathbf{x}}^{k})-g({\mathbf{x}})\big{]}+\mathbb{E}_{i_{k}}\big{[}g({\mathbf{x}}^{k+1})-g({\mathbf{x}}^{k})\big{]}.

E_{i_{k}} [F (x^{k + 1}) - F (x) - ⟨ λ^{k + 1}, r^{k + 1} ⟩ + (β - ρ) ∥ r^{k + 1} ∥^{2} - \frac{β}{2} ∥ r^{k + 1} ∥^{2}]

E_{i_{k}} [F (x^{k + 1}) - F (x) - ⟨ λ^{k + 1}, r^{k + 1} ⟩ + (β - ρ) ∥ r^{k + 1} ∥^{2} - \frac{β}{2} ∥ r^{k + 1} ∥^{2}]

+ \frac{1}{2} E_{i_{k}} [∥ x^{k + 1} - x ∥_{P}^{2} + ∥ x^{k + 1} - x^{k} ∥_{P - L - β A^{⊤} A}^{2}]

\leq

\nabla_{i_{k}} f (x^{k}) - A_{i_{k}}^{⊤} (λ^{k} - β r^{k}) + \tilde{\nabla} g_{i_{k}} (x_{i_{k}}^{k + 1}) + P_{i_{k}} (x_{i_{k}}^{k + 1} - x_{i_{k}}^{k}) = 0,

\nabla_{i_{k}} f (x^{k}) - A_{i_{k}}^{⊤} (λ^{k} - β r^{k}) + \tilde{\nabla} g_{i_{k}} (x_{i_{k}}^{k + 1}) + P_{i_{k}} (x_{i_{k}}^{k + 1} - x_{i_{k}}^{k}) = 0,

\displaystyle\mathbb{E}_{i_{k}}\Big{\langle}\nabla_{i_{k}}f({\mathbf{x}}^{k})-{\mathbf{A}}_{i_{k}}^{\top}({\boldsymbol{\lambda}}^{k}-\beta{\mathbf{r}}^{k})+\tilde{\nabla}g_{i_{k}}({\mathbf{x}}_{i_{k}}^{k+1})+{\mathbf{P}}_{i_{k}}({\mathbf{x}}_{i_{k}}^{k+1}-{\mathbf{x}}_{i_{k}}^{k}),{\mathbf{x}}_{i_{k}}^{k+1}-{\mathbf{x}}_{i_{k}}\Big{\rangle}=0.

\displaystyle\mathbb{E}_{i_{k}}\Big{\langle}\nabla_{i_{k}}f({\mathbf{x}}^{k})-{\mathbf{A}}_{i_{k}}^{\top}({\boldsymbol{\lambda}}^{k}-\beta{\mathbf{r}}^{k})+\tilde{\nabla}g_{i_{k}}({\mathbf{x}}_{i_{k}}^{k+1})+{\mathbf{P}}_{i_{k}}({\mathbf{x}}_{i_{k}}^{k+1}-{\mathbf{x}}_{i_{k}}^{k}),{\mathbf{x}}_{i_{k}}^{k+1}-{\mathbf{x}}_{i_{k}}\Big{\rangle}=0.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Full text

Asynchronous parallel primal-dual block coordinate update methods for affinely constrained convex programs††thanks: This work is partly supported by NSF grant DMS-1719549.

Yangyang Xu [email protected]. Department of Mathematical Sciences, Rensselaer Polytechnic Institute, Troy, NY

Abstract

Recent several years have witnessed the surge of asynchronous (async-) parallel computing methods due to the extremely big data involved in many modern applications and also the advancement of multi-core machines and computer clusters. In optimization, most works about async-parallel methods are on unconstrained problems or those with block separable constraints.

In this paper, we propose an async-parallel method based on block coordinate update (BCU) for solving convex problems with nonseparable linear constraint. Running on a single node, the method becomes a novel randomized primal-dual BCU for multi-block affinely constrained problems. For these problems, Gauss-Seidel cyclic primal-dual BCU is not guaranteed to converge to an optimal solution if no additional assumptions, such as strong convexity, are made. On the contrary, assuming convexity and existence of a primal-dual solution, we show that the objective value sequence generated by the proposed algorithm converges in probability to the optimal value and also the constraint residual to zero. In addition, we establish an ergodic $O(1/k)$ convergence result, where $k$ is the number of iterations. Numerical experiments are performed to demonstrate the efficiency of the proposed method and significantly better speed-up performance than its sync-parallel counterpart.

Keywords: asynchronous parallel, block coordinate update, primal-dual method

Mathematics Subject Classification: 90C06, 90C25, 68W40, 49M27.

1 Introduction

Modern applications in various data sciences and engineering can involve huge amount of data and/or variables [43]. Driven by these very large-scale problems and also the advancement of multi-core computers, parallel computing has gained tremendous attention in recent years. In this paper, we consider the affinely constrained multi-block structured problem:

[TABLE]

where the variable ${\mathbf{x}}$ is partitioned into multiple disjoint blocks ${\mathbf{x}}_{1},\ldots,{\mathbf{x}}_{m}$ , $f$ is a continuously differentiable and convex function, and each $g_{i}$ is a lower semi-continuous extended-valued convex but possibly non-differentiable function. Besides the nonseparable affine constraint, (1) can also include certain block separable constraint by letting part of $g_{i}$ be an indicator function of a convex set, e.g., nonnegativity constraint.

We will present a novel asynchronous (async-) parallel primal-dual method (see Algorithm 2) towards finding a solution to (1). Suppose there are multiple nodes (or cores, CPUs). We let one node (called master node) update both primal and dual variables and all the remaining ones (called worker nodes) compute and provide block gradients of $f$ to the master node. We assume each $g_{i}$ is proximable (see the definition in (5) below). When there is a single node, our method reduces to a novel serial primal-dual BCU for solving (1); see Algorithm 1.

1.1 Motivating examples

Problems in the form of (1) arise in many areas including signal processing, machine learning, finance, and statistics. For example, the basis pursuit problem [8] seeks a sparse solution on an affine subspace through solving the linearly constrained program:

[TABLE]

Partitioning ${\mathbf{x}}$ into multiple disjoint blocks in an arbitrary way, one can formulate (2) into the form of (1) with $f({\mathbf{x}})=0$ and each $g_{i}({\mathbf{x}}_{i})=\|{\mathbf{x}}_{i}\|_{1}$ .

Another example is the portfolio optimization [29]. Suppose we have a unit of capital to invest on $m$ assets. Let $x_{i}$ be the fraction of capital invested on the $i$ -th asset and $\xi_{i}$ be the expected return rate of the $i$ -th asset. The goal is to minimize the risk measured by $\sqrt{{\mathbf{x}}^{\top}\boldsymbol{\Sigma}{\mathbf{x}}}$ subject to total unit capital and minimum expected return $c$ , where ${\mathbf{x}}=(x_{1};\ldots;x_{m})$ and $\boldsymbol{\Sigma}$ is the covariance matrix. To find the optimal ${\mathbf{x}}$ , one can solve the problem:

[TABLE]

Introducing slack variables to the first two inequalities, one can easily write (3) into the form of (1) with a quadratic $f$ and each $g_{i}$ being an indicator function of the nonnegativity constraint set.

In addition, (1) includes as a special case the dual support vector machine (SVM) [10]. Given training data set $\{({\mathbf{x}}_{i},y_{i})\}_{i=1}^{N}$ with $y_{i}\in\{-1,+1\},\,\forall i$ , let ${\mathbf{X}}=[{\mathbf{x}}_{1},\ldots,{\mathbf{x}}_{N}]$ and ${\mathbf{y}}=[y_{1};\ldots;y_{N}]$ . The dual form of the linear SVM can be written as

[TABLE]

where ${\boldsymbol{\theta}}=[\theta_{1};\ldots;\theta_{N}]$ , and $C$ is a given number relating to the soft margin size. It is easy to formulate (4) into the form of (1) with $f$ being the quadratic objective function and each $g_{i}$ the indicator function of the set $[0,C]$ .

Finally, the penalized and constrained (PAC) regression problem [22] is also one example of (1) with $f({\mathbf{x}})=\frac{1}{N}\sum_{j=1}^{N}f_{j}({\mathbf{x}})$ and linear constraint of $J$ equations. As $N\gg J$ (that often holds for problems with massive training data), the PAC regression satisfies Assumption ‣ 2.2. In addition, if $m\gg 1$ and $N\gg 1$ , both (3) and (4) satisfy the assumption, and thus the proposed async-parallel method will be efficient when applied to these problems. Although Assumption ‣ 2.2 does not hold for (2) as $p>1$ , our method running on a single node can still outperform state-of-the-art non-parallel solvers; see the numerical results in section 4.1.

1.2 Block coordinate update

The block coordinate update (BCU) method breaks possibly very high-dimensional variable into small pieces and renews one at a time while all the remaining blocks are fixed. Although the problem (1) can be extremely large-scale and complicated, BCU solves a sequence of small-sized and easier subproblems. As (1) owns nice structures, e.g., coordinate friendly [31], BCU can not only have low per-update complexity but also enjoy faster overall convergence than the method that updates the whole variable every time. BCU has been applied to many unconstrained or block-separably constrained optimization problems (e.g., [40, 41, 30, 34, 45, 36, 46, 21]), and it has also been used to solve affinely constrained separable problems, i.e., in the form of (1) without $f$ term (e.g., [13, 12, 17, 18, 19]). However, only a few existing works (e.g., [20, 15, 14]) have studied BCU on solving affinely constrained problems with a nonseparable objective function.

1.3 Asynchronization

Parallel computing methods distribute computation over and collect results from multiple nodes. Synchronous (sync) parallel methods require all nodes to keep in the same pace. Upon all nodes finish their own computation, they altogether proceed to the next step. This way, the faster node has to wait for the slowest one, and that wastes a lot of waiting time. On the contrary, async-parallel methods keep all nodes continuously working and eliminate the idle waiting time. Numerous works (e.g., [35, 27, 28, 32]) have demonstrated that async-parallel methods can achieve significantly better speed-up than their sync-parallel counterparts.

Due to lack of synchronization, the information used by a certain node may be outdated. Hence the convergence of an async-parallel method cannot be easily inherited from its non-parallel counterpart but often requires a new tool of analysis. Most existing works only analyze such methods for unconstrained or block-separably constrained problems. Exceptions include [42, 48, 4, 5] that consider separable problems with special affine constraint.

1.4 Related works

Recent several years have witnessed the surge of async-parallel methods partly due to the increasingly large scale of data/variable involved in modern applications. However, only a few existing works discuss such methods for affinely constrained problems. Below we review the literature of async-parallel BCU methods in optimization and also primal-dual BCU methods for affinely constrained problems.

It appears that the first async-parallel method was proposed by Chazan and Miranker [6] for solving linear systems. Later, such methods have been applied in many others fields. In optimization, the first async-parallel BCU method was due to Bertsekas and Tsitsiklis [1] for problems with a smooth objective. It was shown that the objective gradient sequence converges to zero. Tseng [39] further analyzed its convergence rate and established local linear convergence by assuming isocost surface separation and a local Lipschitz error bound on the objective. Recently, [28, 27] developed async-parallel methods based on randomized BCU for convex problems with possibly block separable constraints. They established convergence and also rate results by assuming a bounded delay on the outdated block gradient information. The results have been extended to the case with unbounded probabilistic delay in [33], which also shows convergence of the async-parallel BCU methods for nonconvex problems. On solving problems with convex separable objective and linear constraints, [42] proposed to apply the alternating direction method of multipliers (ADMM) in an asynchronous and distributive way. Assuming a special structure on the linear constraint, it established $O(1/k)$ ergodic convergence result, where $k$ is the total number of iterations. In [48, 4, 5, 2], the async-ADMM is applied to distributed multi-agent optimization, which can be equivalently formulated into (1) with $f=0$ and consensus constraint. Among them, [2] proved an almost sure convergence result, [48] showed sublinear convergence of the async-ADMM for convex problems, and [5] established its linear convergence for strongly convex problems. Besides convex problems, [4] also considered nonconvex cases. Assuming certain structure on the problem and choosing appropriate parameters, it showed that any limit point of the iterates satisfies first-order optimality conditions. The works [32, 9] developed async-parallel BCU methods for fixed-point or monotone inclusion problems. Although these settings are more general (including convex optimization as a special case), no convergence rate results have been shown under monotonicity assumption111In [32], a linear convergence result is established under strong monotonicity assumption, which is similar to strong convexity in optimization. (similar to convexity in optimization).

Running on a single node, the proposed async-parallel method reduces to a serial randomized primal-dual BCU. In the literature, various Gauss-Seidel (GS) cyclic BCU methods have been developed for solving separable convex programs with linear constraints. Although a cyclic primal-dual BCU can empirically work well, in general it may diverge [13, 7, 44]. By an example of $3\times 3$ linear system, [7] showed that the direct extension of ADMM could diverge on solving problems with more than 2 blocks. The works [13, 44] showed that even with proximal terms, the cyclic primal-dual BCU can still diverge. Hence, to guarantee convergence, additional assumptions besides convexity must be made, such as strong convexity on part of the objective [16, 3, 23, 26, 25, 11] and orthogonality properties of block matrices in the linear constraint [7]. Assuming strong convexity of each block component function and choosing the penalty parameter within a region, [16] showed the convergence of ADMM to an optimal solution for solving problems with multiple blocks. For 3-block problems, [3, 23, 11] established the convergence of ADMM and/or its variant by assuming strong convexity of one block component function. For general $m$ -block problems, [26] showed that if $m-1$ block component functions are strongly convex, then ADMM with appropriate penalty parameter is guaranteed to converge. Without these assumptions, modifications to the algorithm are necessary for convergence. For example, [18, 19] performed a correction step after each cycle of updates. On solving linear system or quadratic programming, [38] proposed, at each iteration, to first randomly permute all block variables and then perform a cyclic update. Jacobi-type update together with proximal terms was used in [12, 17] to ensure the convergence of the algorithm, which turns out to be a linearized augmented Lagrange method (ALM). In addition, a hybrid Jacobi-GS update was performed in [37, 24, 44]. Different from these modifications, our algorithm simply employs randomization in selecting block variable and can perform significantly better than Jacobi-type methods. In addition, convergence is guaranteed with convexity assumption and thus better than those results for GS-type methods.

1.5 Contributions

The contributions are summarized as follows.

–

We propose an async-parallel BCU method for solving multi-block structured convex programs with linear constraint. The algorithm is the first async-parallel primal-dual method for affinely constrained problems with nonseparable objective. When there is only one node, it reduces to a novel serial primal-dual BCU method.

–

With convexity and existence of a primal-dual solution, convergence of the proposed method is guaranteed. We first establish convergence of the serial BCU method. We show that the objective value converges in probability to the optimal value and also the constraint residual to zero. In addition, we establish an ergodic convergence rate result. Then through bounding a cross term involving delayed block gradient, we prove that similar convergence results hold for the async-parallel BCU method if a delay-dependent stepsize is chosen.

–

We implement the proposed algorithm and apply it to the basis pursuit, quadratic programming, and also the support vector machine problems. Numerical results demonstrate that the serial BCU is comparable to or better than state-of-the-art methods. In addition, the async-parallel BCU method can achieve significantly better speed-up performance than its sync-parallel counterpart.

1.6 Notation and Outline

We use bold small letters ${\mathbf{x}},{\mathbf{y}},{\boldsymbol{\lambda}},\ldots$ for vectors and bold capital letters ${\mathbf{A}},{\mathbf{L}},{\mathbf{P}},\ldots$ for matrices. $[m]$ denotes the integer set $\{1,2,\ldots,m\}$ . ${\mathbf{U}}_{i}{\mathbf{x}}$ represents a vector with ${\mathbf{x}}_{i}$ for its $i$ -th block and zero for all other $m-1$ blocks. $\mathrm{blkdiag}({\mathbf{P}}_{1},\ldots,{\mathbf{P}}_{m})$ denotes a block diagonal matrix with ${\mathbf{P}}_{1},\ldots,{\mathbf{P}}_{m}$ on the diagonal blocks. We denote $\|{\mathbf{x}}\|$ as the Euclidean norm of ${\mathbf{x}}$ and $\|{\mathbf{x}}\|_{\mathbf{P}}=\sqrt{{\mathbf{x}}^{\top}{\mathbf{P}}{\mathbf{x}}}$ for a symmetric positive semidefinite matrix ${\mathbf{P}}$ . We reserve ${\mathbf{I}}$ for the identity matrix, and its size is clear from the context. $\mathbb{E}_{i_{k}}$ stands for the expectation about $i_{k}$ conditional on previous history $\{i_{1},\ldots,i_{k-1}\}$ . We use $\boldsymbol{\xi}^{k}\overset{p}{\to}\boldsymbol{\xi}$ for convergence in probability of a random vector sequence $\boldsymbol{\xi}^{k}$ to $\boldsymbol{\xi}$ .

For ease of notation, we let $g({\mathbf{x}})=\sum_{i=1}^{m}g_{i}({\mathbf{x}}_{i})$ , $F=f+g$ , and ${\mathbf{A}}=[{\mathbf{A}}_{1},\ldots,{\mathbf{A}}_{m}]$ . Denote

[TABLE]

Then $({\mathbf{x}}^{*},{\boldsymbol{\lambda}}^{*})$ is a saddle point of (1) if ${\mathbf{A}}{\mathbf{x}}^{*}={\mathbf{b}}$ and $\Phi({\mathbf{x}},{\mathbf{x}}^{*},{\boldsymbol{\lambda}}^{*})\geq 0,\,\forall{\mathbf{x}}$ .

The proximal operator of a function $\psi$ is defined as

[TABLE]

If ${\mathbf{prox}}_{\psi}({\mathbf{x}})$ has a closed-form solution or is easy to compute, we call $\psi$ proximable.

Outline. The rest of the paper is organized as follows. In section 2, we present the serial and also async-parallel primal-dual BCU methods for (1). Convergence results of the algorithms are shown in section 3. Section 4 gives experimental results, and finally section 5 concludes the paper.

2 Algorithm

In this section, we propose an async-parallel primal-dual method for solving (1). Our algorithm is a BCU-type method based on the augmented Lagrangian function of (1):

[TABLE]

where ${\boldsymbol{\lambda}}$ is the multiplier (or augmented Lagrangian dual variable), and $\beta$ is a penalty parameter.

2.1 Non-parallel method

For ease of understanding, we first present a non-parallel method in Algorithm 1. At every iteration, the algorithm chooses one out of $m$ block uniformly at random and renews it by (6) while fixing all the remaining blocks. Upon finishing the update to ${\mathbf{x}}$ , it immediately changes the multiplier ${\boldsymbol{\lambda}}$ . The linearization to possibly complicated smooth term $f$ greatly eases the ${\mathbf{x}}$ -subproblem. Depending on the form of $g_{i}$ , we can choose appropriate ${\mathbf{P}}_{i}$ to make (6) simple to solve. Since each $g_{i}$ is proximable, one can always easily find a solution to (6) if ${\mathbf{P}}_{i}=\eta_{i}{\mathbf{I}}$ . For even simpler $g_{i}$ such as $\ell_{1}$ -norm and indicator function of a box constraint set, we can set ${\mathbf{P}}_{i}$ to a diagonal matrix and have a closed-form solution to (6). Note that the algorithm is a special case of Algorithm 1 in [14] with only one group of variables. We include it here for ease of understanding our parallel method.

Randomly choosing a block to update has advantages over the cyclic way in both theoretical and empirical perspectives. We will show that this randomized BCU has guaranteed convergence with convexity other than strong convexity assumed by the cyclic primal-dual BCU. In addition, randomization enables us to parallelize the algorithm in an efficient way as shown in Algorithm 2.

2.2 Async-parallel method

Assume there are $p$ nodes. Let the data and variables be stored in a global memory accessible to every node. We let one node (called master node) update both primal variable ${\mathbf{x}}$ and dual variable ${\boldsymbol{\lambda}}$ and the remaining ones (called worker nodes) compute block gradients of $f$ and provide them to the master node. The method is summarized in Algorithm 2.

To achieve nice practical speed-up performance, we make the following assumption:

Assumption 0

The cost of computing $\nabla_{i}f({\mathbf{x}})$ is roughly at least $p-1$ times of that of updating ${\mathbf{x}}_{i}$ , ${\mathbf{r}}$ , and ${\boldsymbol{\lambda}}$ respectively by (9), (7) and (8) for all $i$ , where $p$ is the number of nodes.

Note that our theoretical analysis does not require this assumption. Roughly speaking, the above assumption means that the worker nodes compute block gradients no faster than the master node can use them. When it holds, the master node can quickly digest the block gradient information fed by all worker nodes. Without this assumption, Algorithm 2 may not perform well in terms of parallel efficiency. For example, if $p>2$ , and computing $\nabla_{i}f({\mathbf{x}})$ takes similar time as updating ${\mathbf{x}}_{i}$ , ${\mathbf{r}}$ and ${\boldsymbol{\lambda}}$ , then until the $k$ -th iteration, there would be roughly $k(p-2)$ partial gradients that have been sent to but not used by the master node. In this case, a lot of computation will be wasted.

We make a few remarks on Algorithm 2 as follows.

–

Special case: If there is only one node (i.e., $p=1$ ), the algorithm simply reduces to the non-parallel Algorithm 1. In this case, Assumption ‣ 2.2 trivially holds.

–

Iteration number: Only the master node increases the iteration number $k$ , which counts the times ${\boldsymbol{\lambda}}$ is updated and also the number of used block gradients. The sync-parallel method (e.g., in [14]) chooses to update multiple blocks every time, and the computation is distributed over multiple nodes. It generally requires larger weight in the proximal term for convergence. Hence, even if ${\mathbf{v}}^{k}=\nabla_{i_{k}}f({\mathbf{x}}^{k}),\,\forall k$ , Algorithm 2 does not reduce to its sync-parallel counterpart.

–

Delayed information: Since all worker nodes provide block gradients to the master node, we cannot guarantee every computed block gradient will be immediately used to update ${\mathbf{x}}$ . Hence, in (9), ${\mathbf{v}}^{k}$ may not equal $\nabla_{i}f({\mathbf{x}}^{k})$ but can be a delayed (i.e., outdated) block gradient. The delay is usually in the same order of $p$ and can affect the stepsize, but the affect is negligible as the block number $m$ is greater than the delay in an order (see Theorem 3.8).

Because ${\mathbf{x}}$ -blocks are computed in the master node, the values of ${\mathbf{r}}$ and ${\boldsymbol{\lambda}}$ used in the update are always up-to-date. One can let worker nodes compute new ${\mathbf{x}}_{i}$ ’s and then feed them (or also the changes in ${\mathbf{r}}$ ) to the master node. That way, ${\mathbf{r}}$ and ${\boldsymbol{\lambda}}$ will also be outdated when computing ${\mathbf{x}}$ -blocks.

–

Load balance: Under Assumption ‣ 2.2, if (9) is easy to solve (e.g., ${\mathbf{P}}_{i}=\eta_{i}{\mathbf{I}}$ ) and all nodes have similar computing power, the master node will have used all received block gradients before a new one comes. We let the master node itself also compute block gradient if there is no new one sent from any worker node. This way, all nodes work continuously without idle wait. Compared to its sync-parallel counterpart that typically suffers serious load imbalance, the async-parallel can achieve better speed-up; see the numerical results in section 4.3.

3 Convergence analysis

In this section, we present convergence results of the proposed algorithm. First, we analyze the non-parallel Algorithm 1. We show that the objective value $F({\mathbf{x}}^{k})$ and the residual ${\mathbf{A}}{\mathbf{x}}^{k}-{\mathbf{b}}$ converge to the optimal value and zero respectively in probability. In addition, we establish a sublinear convergence rate result based on an averaged point. Then, through bounding a cross term involving the delayed block gradient, we establish similar results for the async-parallel Algorithm 2.

Throughout our analysis, we make the following assumptions.

Assumption 1 (Existence of a solution)

There exists one pair of primal-dual solution $({\mathbf{x}}^{*},{\boldsymbol{\lambda}}^{*})$ such that ${\mathbf{A}}{\mathbf{x}}^{*}={\mathbf{b}}$ and $\Phi({\mathbf{x}},{\mathbf{x}}^{*},{\boldsymbol{\lambda}}^{*})\geq 0,\,\forall{\mathbf{x}}$ .

Assumption 2 (Gradient Lipschitz continuity)

There exist constants $L_{i}$ ’s and $L_{r}$ such that for any ${\mathbf{x}}$ and ${\mathbf{y}}$ ,

[TABLE]

and

[TABLE]

Denote ${\mathbf{L}}=\textnormal{diag}(L_{1},\ldots,L_{m})$ . Then under the above assumption, it holds that

[TABLE]

3.1 Convergence results of Algorithm 1

Although Algorithm 1 is a special case of the method in [14], its convergence analysis is easier and can be made more succinct. In addition, our analysis for Algorithm 2 is based on that for Algorithm 1. Hence, we provide a complete convergence analysis for Algorithm 1. First, we establish several lemmas, which will be used to show our main convergence results.

Lemma 3.1

Let $\{{\mathbf{x}}^{k}\}$ be the sequence generated from Algorithm 1. Then for any ${\mathbf{x}}$ independent of $i_{k}$ , it holds that

[TABLE]

Proof. We write $\langle\nabla_{i_{k}}f({\mathbf{x}}^{k}),{\mathbf{x}}_{i_{k}}^{k+1}-{\mathbf{x}}_{i_{k}}\rangle=\langle\nabla_{i_{k}}f({\mathbf{x}}^{k}),{\mathbf{x}}_{i_{k}}^{k}-{\mathbf{x}}_{i_{k}}\rangle+\langle\nabla_{i_{k}}f({\mathbf{x}}^{k}),{\mathbf{x}}_{i_{k}}^{k+1}-{\mathbf{x}}_{i_{k}}^{k}\rangle$ . For the first term, we use the uniform distribution of $i_{k}$ on $[m]$ and the convexity of $f$ to have

[TABLE]

and for the second term, we use (10) to have

[TABLE]

Combining the above two inequalities gives the desired result. $\Box$

Lemma 3.2

For any ${\mathbf{x}}$ independent of $i_{k}$ such that ${\mathbf{A}}{\mathbf{x}}={\mathbf{b}}$ , it holds

[TABLE]

Proof. Let ${\mathbf{y}}^{k}=-{\mathbf{A}}^{\top}({\boldsymbol{\lambda}}^{k}-\beta{\mathbf{r}}^{k})$ . Then

[TABLE]

Note ${\mathbf{y}}^{k}=-{\mathbf{A}}^{\top}{\boldsymbol{\lambda}}^{k+1}+(\beta-\rho){\mathbf{A}}^{\top}{\mathbf{r}}^{k+1}-\beta{\mathbf{A}}^{\top}({\mathbf{r}}^{k+1}-{\mathbf{r}}^{k})$ and ${\mathbf{r}}^{k+1}-{\mathbf{r}}^{k}={\mathbf{A}}({\mathbf{x}}^{k+1}-{\mathbf{x}}^{k})$ . In addition, from ${\mathbf{A}}{\mathbf{x}}={\mathbf{b}}$ , we have ${\mathbf{A}}({\mathbf{x}}^{k+1}-{\mathbf{x}})={\mathbf{r}}^{k+1}$ . Hence,

[TABLE]

Noting

[TABLE]

we complete the proof by plugging (16) into (13). $\Box$

Lemma 3.3

For any ${\mathbf{x}}$ independent of $i_{k}$ , it holds

[TABLE]

where $\tilde{\nabla}g_{i_{k}}({\mathbf{x}}_{i_{k}}^{k+1})$ denotes a subgradient of $g_{i_{k}}$ at ${\mathbf{x}}_{i_{k}}^{k+1}$ .

Proof. From the convexity of $g_{i_{k}}$ and definition of subgradient, it follows that

[TABLE]

Writing $g_{i_{k}}({\mathbf{x}}_{i_{k}}^{k+1})-g_{i_{k}}({\mathbf{x}}_{i_{k}})=g_{i_{k}}({\mathbf{x}}_{i_{k}}^{k})-g_{i_{k}}({\mathbf{x}}_{i_{k}})+g_{i_{k}}({\mathbf{x}}_{i_{k}}^{k+1})-g_{i_{k}}({\mathbf{x}}_{i_{k}}^{k})$ and taking the conditional expectation give

[TABLE]

We obtain the desired result by plugging the above equation into (17). $\Box$

Using the above three lemmas, we show an inequality after each iteration of the algorithm.

Theorem 3.4 (Fundamental result)

Let $\{({\mathbf{x}}^{k},{\mathbf{r}}^{k},{\boldsymbol{\lambda}}^{k})\}$ be the sequence generated from Algorithm 1. Then for any ${\mathbf{x}}$ such that ${\mathbf{A}}{\mathbf{x}}={\mathbf{b}}$ , it holds

[TABLE]

where ${\mathbf{P}}=\mathrm{blkdiag}({\mathbf{P}}_{1},\ldots,{\mathbf{P}}_{m})$ .

Proof. Since ${\mathbf{x}}_{i_{k}}^{k+1}$ is one solution to (6), there is a subgradient $\tilde{\nabla}g_{i_{k}}({\mathbf{x}}_{i_{k}}^{k+1})$ of $g_{i_{k}}$ at ${\mathbf{x}}_{i_{k}}^{k+1}$ such that

[TABLE]

Hence,

[TABLE]

In the above equation, using Lemmas 3.1 through 3.3 and noting

[TABLE]

we have the desired result. $\Box$

Now we are ready to show the convergence results of Algorithm 1.

Theorem 3.5 (Global convergence in probability)

Let $\{({\mathbf{x}}^{k},{\mathbf{r}}^{k},{\boldsymbol{\lambda}}^{k})\}$ be the sequence generated from Algorithm 1. If $0<\rho\leq\frac{\beta}{m}$ and ${\mathbf{P}}_{i}\succeq L_{i}{\mathbf{I}}+\beta{\mathbf{A}}_{i}^{\top}{\mathbf{A}}_{i},\,\forall i$ , then

[TABLE]

Before proving the theorem, we make a remark here. The dual stepsize $\rho$ can be up to $\frac{\beta}{m}$ , so it could be much smaller than $\beta$ as $m$ is big. However, note that ${\boldsymbol{\lambda}}$ is renewed more frequently than ${\mathbf{x}}$ . It is updated once immediately after one change to ${\mathbf{x}}$ . Hence, if $\rho=\frac{\beta}{m}$ , after one epoch of ${\mathbf{x}}$ -update, the dual variable ${\boldsymbol{\lambda}}$ has been updated $m$ times and moved a step of size $\beta$ . That is why we can still observe fast convergence of the algorithm to the optimal solution even though a small $\rho$ is used; see the numerical results in section 4.

Proof. Note that

[TABLE]

Hence, taking expectation over both sides of (18) and summing up from $k=0$ through $K$ yield

[TABLE]

Since ${\boldsymbol{\lambda}}^{K+1}={\boldsymbol{\lambda}}^{K}-\rho{\mathbf{r}}^{K+1}$ , it follows from Young’s inequality that

[TABLE]

In addition,

[TABLE]

Plugging (24) and (25) into (3.1) and using ${\boldsymbol{\lambda}}^{0}=0$ , we have

[TABLE]

Letting $({\mathbf{x}},{\boldsymbol{\lambda}})=({\mathbf{x}}^{*},{\boldsymbol{\lambda}}^{*})$ in the above equality, we have from ${\mathbf{P}}_{i}\succeq L_{i}{\mathbf{I}}+\beta{\mathbf{A}}_{i}^{\top}{\mathbf{A}}_{i}$ and $\beta\geq m\rho$ that

[TABLE]

which together with $|\mathbb{E}\xi|^{2}\leq\mathbb{E}\xi^{2}$ implies that

[TABLE]

For any $\epsilon>0$ , it follows from the Markov’s inequality that

[TABLE]

and

[TABLE]

where in the first inequality, we have used the fact $F({\mathbf{x}})-F({\mathbf{x}}^{*})-\langle{\boldsymbol{\lambda}}^{*},{\mathbf{A}}{\mathbf{x}}-{\mathbf{b}}\rangle\geq 0,\,\forall{\mathbf{x}}$ , and the last equation follows from (28) and the Markov’s inequality. This completes the proof. $\Box$

Given any $\epsilon>0$ and $\sigma\in(0,1)$ , we can also estimate the number of iterations for the algorithm to produce a solution satisfying an error bound $\epsilon$ with probability no less than $1-\sigma$ .

Definition 3.1 ( $(\epsilon,\sigma)$ -solution)

Given $\epsilon>0$ and $0<\sigma<1$ , a random vector ${\mathbf{x}}$ is called an $(\epsilon,\sigma)$ -solution to (1) if ${\mathrm{Prob}}(|F({\mathbf{x}})-F({\mathbf{x}}^{*})|\geq\epsilon)\leq\sigma$ and ${\mathrm{Prob}}(\|{\mathbf{A}}{\mathbf{x}}-{\mathbf{b}}\|\geq\epsilon)\leq\sigma.$

Theorem 3.6 (Ergodic convergence rate)

Let $\{({\mathbf{x}}^{k},{\mathbf{r}}^{k},{\boldsymbol{\lambda}}^{k})\}$ be the sequence generated from Algorithm 1. Assume $0<\rho\leq\frac{\beta}{m}$ and ${\mathbf{P}}_{i}\succeq L_{i}{\mathbf{I}}+\beta{\mathbf{A}}_{i}^{\top}{\mathbf{A}}_{i},\,\forall i$ . Let $\bar{{\mathbf{x}}}^{K+1}=\frac{{\mathbf{x}}^{K+1}+\sum_{k=1}^{K}{\mathbf{x}}^{k+1}/m}{1+K/m}$ and

[TABLE]

Then

[TABLE]

In addition, given any $\epsilon>0$ and $0<\sigma<1$ , if

[TABLE]

then $\bar{{\mathbf{x}}}^{K+1}$ is an $(\epsilon,\sigma)$ -solution to (1).

Proof. Since $F$ is convex, it follows from (3.1) that

[TABLE]

which with ${\mathbf{x}}={\mathbf{x}}^{*}$ and ${\boldsymbol{\lambda}}=0$ implies the second inequality in (34). From $\Phi({\mathbf{x}},{\mathbf{x}}^{*},{\boldsymbol{\lambda}}^{*})\geq 0,\,\forall{\mathbf{x}}$ and Cauchy-Schwartz inequality, we have that

[TABLE]

Letting ${\mathbf{x}}={\mathbf{x}}^{*}$ and ${\boldsymbol{\lambda}}=-\frac{1+\|{\boldsymbol{\lambda}}^{*}\|}{\|{\mathbf{A}}\bar{{\mathbf{x}}}^{K+1}-{\mathbf{b}}\|}({\mathbf{A}}\bar{{\mathbf{x}}}^{K+1}-{\mathbf{b}})$ in (37) and using (38) give (35), where we have used the convention $\frac{0}{0}=0$ . By Markov’s inequality,

[TABLE]

and thus to have ${\mathrm{Prob}}(\|{\mathbf{A}}\bar{{\mathbf{x}}}^{K+1}-{\mathbf{b}}\|\geq\epsilon)\leq\sigma$ , it suffices to let

[TABLE]

Similarly, letting ${\mathbf{x}}={\mathbf{x}}^{*}$ and ${\boldsymbol{\lambda}}=-\frac{2\|{\boldsymbol{\lambda}}^{*}\|}{\|{\mathbf{A}}\bar{{\mathbf{x}}}^{K+1}-{\mathbf{b}}\|}({\mathbf{A}}\bar{{\mathbf{x}}}^{K+1}-{\mathbf{b}})$ in (37) and using (38) give

[TABLE]

which together with (38) implies the first inequality in (34). Through the same arguments that show (29), we have

[TABLE]

Hence, to have ${\mathrm{Prob}}(|F(\bar{{\mathbf{x}}}^{K+1})-F({\mathbf{x}}^{*})|\geq\epsilon)\leq\sigma$ , it suffices to let

[TABLE]

which together with (39) gives the desired result and thus completes the proof. $\Box$

3.2 Convergence results of Algorithm 2

The key difference between Algorithms 1 and 2 is that ${\mathbf{v}}^{k}$ used in (9) may not equal the block gradient of $f$ at ${\mathbf{x}}^{k}$ but another outdated vector, which we denote as $\hat{{\mathbf{x}}}^{k}$ . This delayed vector may not be any iterate that ever exists in the memory, i.e., inconsistent read can happen [27]. Besides Assumptions 1 and 2, we make an additional assumption on the delayed vector.

Assumption 3 (Bounded delay)

The delay is uniformly bounded by an integer $\tau$ , and $\hat{{\mathbf{x}}}^{k}$ can be related to ${\mathbf{x}}^{k}$ by the equation

[TABLE]

where $J(k)$ is a subset of $\{k-\tau,k-\tau+1,\ldots,k-1\}$ .

The boundedness of the delay holds if there is no “dead” node. The relation between ${\mathbf{x}}^{k}$ and $\hat{{\mathbf{x}}}^{k}$ in (41) is satisfied if the read of each block variable is consistent, which can be guaranteed by a dual memory approach; see [32].

Similar to (21), we have from the optimality condition of (9) that

[TABLE]

where we have used ${\mathbf{v}}^{k}=\nabla_{i_{k}}f(\hat{{\mathbf{x}}}^{k})$ . Except $\mathbb{E}_{i_{k}}\langle\nabla_{i_{k}}f(\hat{{\mathbf{x}}}^{k}),{\mathbf{x}}_{i_{k}}^{k+1}-{\mathbf{x}}_{i_{k}}\rangle$ , all the other terms in (42) can be bounded in the same ways as those in section 3.1. We first show how to bound this term and then present the convergence results of Algorithm 2.

Lemma 3.7

Under Assumptions 2 and 3, we have for any $\alpha>0$ that

[TABLE]

where $L_{c}=\max_{i}L_{i}>0$ , and $\kappa=\frac{L_{r}}{L_{c}}$ denotes the condition number.

Proof. We split $\mathbb{E}_{i_{k}}\langle\nabla_{i_{k}}f(\hat{{\mathbf{x}}}^{k}),{\mathbf{x}}_{i_{k}}^{k+1}-{\mathbf{x}}_{i_{k}}\rangle$ into four terms:

[TABLE]

and we bound each of the four cross terms in (46). The first is bounded in (11). Secondly, from the convexity of $f$ , we have

[TABLE]

For the other two terms, we use the relation between $\hat{{\mathbf{x}}}^{k}$ and ${\mathbf{x}}^{k}$ in (41). From the result in [28, pp.306], it holds that

[TABLE]

Hence by Young’s inequality, we have for any $\alpha>0$ that

[TABLE]

Let $\tau_{k}=|J(k)|$ and order the elements in $J(k)$ as $d_{1}<d_{2}<\ldots<d_{\tau_{k}}$ . Define $\hat{{\mathbf{x}}}^{k,0}=\hat{{\mathbf{x}}}^{k}$ and $\hat{{\mathbf{x}}}^{k,j}=\hat{{\mathbf{x}}}^{k}+\sum_{t=1}^{j}({\mathbf{x}}^{d_{t}+1}-{\mathbf{x}}^{d_{t}}),\,j=1,\ldots,\tau_{k}.$ Then we have

[TABLE]

Since $\hat{{\mathbf{x}}}^{k,j+1}-\hat{{\mathbf{x}}}^{k,j}={\mathbf{x}}^{d_{j+1}+1}-{\mathbf{x}}^{d_{j+1}}$ , it follows from (10) that

[TABLE]

Note $\nabla f(\hat{{\mathbf{x}}}^{k,j})-\nabla f(\hat{{\mathbf{x}}}^{k})=\sum_{t=0}^{j-1}(\nabla f(\hat{{\mathbf{x}}}^{k,t+1})-\nabla f(\hat{{\mathbf{x}}}^{k,t}))$ . Thus, by the Cauchy-Schwarz inequality and the Young’s inequality, we have

[TABLE]

Plugging (59) and (60) into (55) gives

[TABLE]

Noting $\tau_{k}\leq\tau$ , we have the desired result by plugging (11), (49), (51), and (64) into (46). $\Box$

From Lemmas 3.2, 3.3, and 3.7, and also the equation (22), we can easily have the following result.

[TABLE]

Regard ${\mathbf{x}}^{k}={\mathbf{x}}^{0},\,\forall k<0$ . Hence,

[TABLE]

Using (65) and following the same arguments in the proofs of Theorems 3.5 and 3.6, we obtain the two theorems below.

Theorem 3.8 (Global convergence in probability)

Let $\{({\mathbf{x}}^{k},{\mathbf{r}}^{k},{\boldsymbol{\lambda}}^{k})\}$ be the sequence generated from Algorithm 2 with $0<\rho\leq\frac{\beta}{m}$ and ${\mathbf{P}}_{i}$ ’s satisfying

[TABLE]

for $\alpha>0$ , then

[TABLE]

Theorem 3.9 (Ergodic convergence rate)

Under the assumptions of Theorem 3.8, let $\bar{{\mathbf{x}}}^{K+1}=\frac{{\mathbf{x}}^{K+1}+\sum_{k=1}^{K}{\mathbf{x}}^{k+1}/m}{1+K/m}$ and

[TABLE]

Then we have the same results as those in (34) and (35). In addition, given any $\epsilon>0$ and $0<\sigma<1$ , if $K$ satisfies (36), then $\bar{{\mathbf{x}}}^{K+1}$ is an $(\epsilon,\sigma)$ -solution to (1).

Remark 3.1

Comparing the settings of ${\mathbf{P}}_{i}$ ’s in Theorems 3.5 and 3.8, we see that they are only weakly affected by the delay if $\tau=o(\sqrt{m})$ , which holds for problems involving extremely many variables. If all $p$ nodes compute at the same rate, $\tau$ is in the same order of $p$ [33], and thus Theorem 3.8 indicates that nearly linear speed-up can be achieved on $O(\sqrt{m})$ nodes. Even without the nonseparable affine constraint, this quantity is better than that required in [27]. In addition, as $\tau=0$ , Algorithm 2 reduces to Algorithm 1, and their convergence results coincide.

4 Numerical experiments

In this section, we test the proposed methods on the basis pursuit problem (2), the nonnegativity constrained quadratic programming, and also the dual SVM (4). We demonstrate their efficacy by comparing to several other existing algorithms.

4.1 Basis pursuit

The tests in this subsection compare Algorithm 1 to the linearized augmented Lagrangian method (LALM) and the open-source solver YALL1 [49] on the basis pursuit problem (2). Putting all variables into a single block, we can regard LALM as a special case of Algorithm 1 with $m=1$ , and YALL1 is a linearized ADMM with penalty parameter adaptively updated based on primal and dual residuals.

The matrix ${\mathbf{A}}\in\mathbb{R}^{q\times 1000}$ in (2) was randomly generated with $q$ varying among $\{200,300,400\}$ , and its entries independently follow standard Gaussian distribution. We normalized each row of ${\mathbf{A}}$ . A sparse vector ${\mathbf{x}}^{o}$ was then generated with 30 nonzero entries that follow standard Gaussian distribution and whose locations are chosen uniformly at random. The vector ${\mathbf{b}}={\mathbf{A}}{\mathbf{x}}^{o}$ . We evenly partitioned the variable ${\mathbf{x}}$ into 100 blocks, and we set $\rho=\frac{\beta}{100}$ and ${\mathbf{P}}_{i}=\beta\|{\mathbf{A}}_{i}\|^{2}{\mathbf{I}},\,i=1,\ldots,100$ , where $\|{\mathbf{A}}_{i}\|$ denotes the spectral norm of ${\mathbf{A}}_{i}$ . For LALM, we treated it as a special case of Algorithm 1 with a single block and set $\rho=\beta$ and ${\mathbf{P}}=\beta\|{\mathbf{A}}\|^{2}{\mathbf{I}}$ . The same values of $\beta$ were used for both Algorithm 1 and LALM. The parameters of YALL1 were set to the default values.

To compare the performance of the three algorithms, we plot their values of $|F({\mathbf{x}}^{t})-F({\mathbf{x}}^{*})|$ and $\|{\mathbf{A}}{\mathbf{x}}^{t}-{\mathbf{b}}\|$ with respect to $t$ , where $t$ denotes the epoch number.222Each epoch is equivalent to updating $m$ ${\mathbf{x}}$ -blocks. Since the three algorithms have roughly the same per-epoch complexity, the plot in terms of running time will be similar. In Figure 1, we fixed $q=300$ and varied $\beta$ among $\{1,10,100\}$ . From the results, we see that the proposed algorithm perform significantly better than LALM and comparably as well as YALL1. In addition, the parameter $\beta$ affected both Algorithm 1 and LALM but the former was only weakly affected. In Figure 2, we set $\beta=\sqrt{q}$ and varied $q$ among $\{200,300,400\}$ . Again we see that the proposed algorithm is significantly better than LALM. For $q=200$ , Algorithm 1 is slightly better than YALL1, and for $q=300$ and $400$ , they perform equally well.

4.2 Quadratic programming

In this subsection, we simulate the performance of Algorithm 2 with different delays on solving the nonnegativity constrained quadratic programming (NCQP):

[TABLE]

where ${\mathbf{Q}}$ is a positive semidefinite matrix. We set ${\mathbf{Q}}={\mathbf{H}}{\mathbf{H}}^{\top}$ with ${\mathbf{H}}\in\mathbb{R}^{2000\times 2000}$ randomly generated from standard Gaussian distribution, and the vector ${\mathbf{c}}$ was generated from Gaussian distribution. The matrix ${\mathbf{A}}=[{\mathbf{B}},{\mathbf{I}}]\in\mathbb{R}^{200\times 2000}$ with the entries of ${\mathbf{B}}$ independently following standard Gaussian distribution, and ${\mathbf{b}}$ was generated from uniform distribution on $[0,1]$ . This way, we guarantee the feasibility of (69).

We partitioned ${\mathbf{x}}$ into 2,000 blocks, namely, every coordinate was treated as one block. To see how the algorithm is affected by delayed block gradients, $\tau+1$ most recent iterates were kept, and $\hat{{\mathbf{x}}}^{k}$ was set to one of these iterates that was chosen uniformly at random. We varied $\tau$ among $\{0,5,10,20,40\}$ . $\beta$ was tuned to $\sqrt{2}$ , $\rho=\frac{\beta}{2000}$ was used, and ${\mathbf{P}}_{i}$ ’s were set in two different ways. Figure 3 plots the results by Algorithm 2 with ${\mathbf{P}}_{i}$ ’s set according to (68) with $\alpha=1$ . Note that for this instance, we have $L_{i}=Q_{ii}$ , i.e., the $i$ -th diagonal entry of ${\mathbf{Q}}$ for each $i$ , and $L_{r}=\max_{i}\|{\mathbf{q}}_{i}\|$ where ${\mathbf{q}}_{i}$ denotes the $i$ -th column of ${\mathbf{Q}}$ . From the figure, we see that the convergence speed of the algorithm is affected by the delays. Larger $\tau$ gives smaller stepsize and leads to slower convergence. However, the algorithm is hardly affected by delayed block gradient if the same ${\mathbf{P}}_{i}$ ’s were used, as shown in Figure 4. Practically, the maximum delay $\tau$ is unknown, but the results in Figure 4 indicate that we can simply set ${\mathbf{P}}_{i}$ ’s according to Theorem 3.5 regardless of the delay. This implies that our analysis may not be tight.

4.3 Support vector machine

In this subsection, we compare the performance of the async-parallel Algorithm 2 and its sync-parallel counterpart on solving the dual SVM (4). Another way of parallel computing on solving (4) is to directly distribute computation of an algorithm (that may not be BCU type) over multiple nodes, such as the method in [47]. In the test, we used two LIBSVM datasets:333The data can be downloaded from https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/ rcv1 and news20, whose characteristics are listed in Table 1.

We partitioned the variable into blocks of size 50 or 51. For both sync and async-parallel methods, $\beta=0.1$ and $\rho=\frac{\beta}{m}$ were set, where $m$ is the number of blocks. As suggested in section 4.2, for the async-parallel method, we set ${\mathbf{P}}_{i}=(L_{i}+\beta\|{\mathbf{A}}_{i}\|^{2}){\mathbf{I}},\,\forall i$ according to Theorem 3.5. For the sync-parallel method, if there are $p$ cores, we selected a set $S$ of $p$ blocks at every iteration and set ${\mathbf{P}}_{i}=\sum_{j\in S}(L_{j}+\beta\|{\mathbf{A}}_{j}\|^{2}){\mathbf{I}}$ for all $i\in S$ . We also used ${\mathbf{P}}_{i}$ ’s the same as those by the async-parallel method but noticed that the sync-parallel method diverged. The larger weight matrices are also suggested in [14] to be proportional to the number of blocks. Note that in the dual SVM (4), if we let ${\mathbf{X}}_{i}$ and ${\mathbf{y}}_{i}$ contain the data points and labels corresponding to the $i$ -th block variable, then $L_{i}$ equals the spectral norm of the matrix $\textnormal{diag}({\mathbf{y}}_{i}){\mathbf{X}}_{i}^{\top}{\mathbf{X}}_{i}\textnormal{diag}({\mathbf{y}}_{i})$ . Since every block only has 50 or 51 coordinates, it is easy to compute $L_{i}$ ’s.

We ran the tests on a machine with 20 cores. Figure 5 plots the results by the sync and async-parallel algorithms on the rcv1 dataset. From the figure, we see that in terms of epoch number, the sync-parallel method converges slower if more cores are used, while the async-parallel one converges almost the same with different number of cores. As shown in Figure 6, similar results were observed for the news20 dataset. We also measured the speed-up of the two parallel methods in terms of running time. The results are plotted in Figure 7. From the results, we see that the async-parallel method achieves significantly better speed-up than the sync-parallel one, and that is because synchronization at every iteration wastes much waiting time.

5 Conclusions

We have proposed an async-parallel primal-dual BCU method for convex programming with nonseparable objective and arbitrary linear constraint. As a special case on a single node, the method reduces to a randomized primal-dual BCU for multi-block linearly constrained problems. Convergence and also rate results in probability have been established under convexity assumption. We have also numerically compared the proposed algorithm to several existing methods. The experimental results demonstrate the superior performance of our algorithm over other ones.

Bibliography49

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] D. P. Bertsekas and J. N. Tsitsiklis. Parallel and distributed computation: numerical methods , volume 23. Prentice hall Englewood Cliffs, NJ, 1989.
2[2] P. Bianchi, W. Hachem, and F. Iutzeler. A coordinate descent primal-dual algorithm and application to distributed asynchronous optimization. IEEE Transactions on Automatic Control , 61(10):2947–2957, 2016.
3[3] X. Cai, D. Han, and X. Yuan. On the convergence of the direct extension of admm for three-block separable convex minimization models with one strongly convex function. Computational Optimization and Applications , 66(1):39–73, 2017.
4[4] T.-H. Chang, M. Hong, W.-C. Liao, and X. Wang. Asynchronous distributed admm for large-scale optimization — part i: Algorithm and convergence analysis. IEEE Transactions on Signal Processing , 64(12):3118–3130, 2016.
5[5] T.-H. Chang, W.-C. Liao, M. Hong, and X. Wang. Asynchronous distributed admm for large-scale optimization — part ii: Linear convergence analysis and numerical performance. IEEE Transactions on Signal Processing , 64(12):3131–3144, 2016.
6[6] D. Chazan and W. Miranker. Chaotic relaxation. Linear Algebra and its Applications , 2(2):199–222, 1969.
7[7] C. Chen, B. He, Y. Ye, and X. Yuan. The direct extension of admm for multi-block convex minimization problems is not necessarily convergent. Mathematical Programming , 155(1-2):57–79, 2016.
8[8] S. S. Chen, D. L. Donoho, and M. A. Saunders. Atomic decomposition by basis pursuit. SIAM review , 43(1):129–159, 2001.

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Videos

Asynchronous parallel primal-dual block coordinate update methods for affinely constrained convex programs††thanks: This work is partly supported by NSF grant DMS-1719549.

Abstract

1 Introduction

1.1 Motivating examples

1.2 Block coordinate update

1.3 Asynchronization

1.4 Related works

1.5 Contributions

1.6 Notation and Outline

2 Algorithm

2.1 Non-parallel method

2.2 Async-parallel method

Assumption 0

3 Convergence analysis

Assumption 1** (Existence of a solution)**

Assumption 2** (Gradient Lipschitz continuity)**

3.1 Convergence results of Algorithm 1

Lemma 3.1

Lemma 3.2

Lemma 3.3

Theorem 3.4** (Fundamental result)**

Theorem 3.5** (Global convergence in probability)**

Definition 3.1** ((ϵ,σ)(\epsilon,\sigma)(ϵ,σ)-solution)**

Theorem 3.6** (Ergodic convergence rate)**

3.2 Convergence results of Algorithm 2

Assumption 3** (Bounded delay)**

Lemma 3.7

Theorem 3.8** (Global convergence in probability)**

Theorem 3.9** (Ergodic convergence rate)**

Remark 3.1

4 Numerical experiments

4.1 Basis pursuit

4.2 Quadratic programming

4.3 Support vector machine

5 Conclusions

Assumption 1 (Existence of a solution)

Assumption 2 (Gradient Lipschitz continuity)

Theorem 3.4 (Fundamental result)

Theorem 3.5 (Global convergence in probability)

Definition 3.1 ( $(\epsilon,\sigma)$ -solution)

Theorem 3.6 (Ergodic convergence rate)

Assumption 3 (Bounded delay)

Theorem 3.8 (Global convergence in probability)

Theorem 3.9 (Ergodic convergence rate)