Adaptive Relaxed ADMM: Convergence Theory and Practical Implementation

Zheng Xu; Mario A. T. Figueiredo; Xiaoming Yuan; Christoph Studer; and; Tom Goldstein

arXiv:1704.02712·cs.CV·April 11, 2017

Adaptive Relaxed ADMM: Convergence Theory and Practical Implementation

Zheng Xu, Mario A. T. Figueiredo, Xiaoming Yuan, Christoph Studer, and, Tom Goldstein

PDF

TL;DR

This paper introduces ARADMM, an adaptive version of relaxed ADMM that automatically tunes parameters for optimal performance, supported by convergence theory and demonstrated through practical applications.

Contribution

It proposes a novel adaptive relaxed ADMM method with automatic parameter tuning, backed by convergence analysis and empirical validation.

Findings

01

ARADMM achieves faster convergence in practice.

02

The method automatically adjusts parameters without user intervention.

03

Numerical experiments confirm the theoretical convergence and efficiency.

Abstract

Many modern computer vision and machine learning applications rely on solving difficult optimization problems that involve non-differentiable objective functions and constraints. The alternating direction method of multipliers (ADMM) is a widely used approach to solve such problems. Relaxed ADMM is a generalization of ADMM that often achieves better performance, but its efficiency depends strongly on algorithm parameters that must be chosen by an expert user. We propose an adaptive method that automatically tunes the key algorithm parameters to achieve optimal performance without user oversight. Inspired by recent work on adaptivity, the proposed adaptive relaxed ADMM (ARADMM) is derived by assuming a Barzilai-Borwein style linear gradient. A detailed convergence analysis of ARADMM is provided, and numerical results on several applications demonstrate fast practical convergence.

Tables1

Table 1. Table 1: Iterations (and runtime in seconds) for various applications. Absence of convergence after n 𝑛 n iterations is indicated as n + limit-from 𝑛 n+ .

Application

Dataset

#samples

\times

#features¹

Vanilla

ADMM

Relaxed

ADMM

Residual

balance

Adaptive

ADMM

Proposed

ARADMM

Elastic net regression

Synthetic

50

\times

40

2000+(.642)

2000+(.660)

424(.144)

102(.051)

70(.026)

MNIST

60000

\times

784

1225(29.4)

816(19.9)

94(2.28)

41(.943)

21(.549)

CIFAR10

10000

\times

3072

2000+(690)

2000+(697)

556(193)

2000+(669)

94(31.7)

News20

19996

\times

1355191

2000+(1.21e4)

2000+(9.16e3)

227(914)

104(391)

71(287)

Rcv1

20242

\times

47236

2000+(1.20e3)

1823(802)

196(79.1)

104(35.7)

64(26.0)

Realsim

72309

\times

20958

2000+(4.26e3)

2000+(4.33e3)

341(355)

152(125)

107(88.2)

Low rank least squares

Synthetic

1000

\times

200

2000+(118)

2000+(116)

268(15.1)

26(1.55)

18(1.04)

German

1000

\times

24

2000+(4.72)

642(1.52)

130(.334)

52(.125)

Spectf

80

\times

44

2000+(2.70)

2000+(2.74)

336(.455)

162(.236)

105(.150)

MNIST

60000

\times

784

200+(1.86e3)

200+(2.08e3)

200+(3.29e3)

200+(3.46e3)

38(658)

CIFAR10

10000

\times

3072

200+(7.24e3)

200+(1.33e4)

53(1.60e3)

8(208)

6(156)

QP and dual SVM

Synthetic

250

\times

500

1224(11.5)

823(7.49)

626(5.93)

170(1.57)

100(.914)

German

1000

\times

24

2000+(58.8)

2000+(61.8)

1592(45.0)

1393(38.9)

1238(34.9)

Spectf

80

\times

44

2000+(.846)

2000+(.777)

169(.070)

175(.086)

53(.026)

Consensus logistic regression

Synthetic

1000

\times

25

590(9.93)

391(6.97)

70(1.23)

35(.609)

20(.355)

German

1000

\times

24

2000+(34.3)

2000+(66.6)

151(2.60)

35(.691)

26(.580)

Spectf

80

\times

44

1005(20.1)

667(14.4)

117(1.98)

145(1.63)

85(1.07)

MNIST

60000

\times

784

200+(2.99e3)

200+(3.47e3)

200+(1.37e3)

49(536)

28(333)

CIFAR10

10000

\times

3072

200+(593)

200+(2.08e3)

200+(1.54e3)

131(165)

19(33.7)

Unwrapping SVM

Synthetic

1000

\times

25

2000+(1.13)

1418(.844)

2000+(1.16)

355(.229)

147(.094)

German

1000

\times

24

753(1.88)

560(1.37)

2000+(4.98)

572(1.44)

213(.545)

Spectf

80

\times

44

567(.203)

367(.112)

567(.185)

207(.068)

149(.052)

MNIST

60000

\times

784

128(130)

118(111)

163(153)

200+(217)

67(71.0)

CIFAR10

10000

\times

3072

200+(512)

200+(532)

200+(516)

89(285)

57(143)

Image denoising

Barbara

512

\times

512

262(35.0)

175(23.6)

74(10.0)

59(8.67)

38(5.57)

Cameraman

256

\times

256

311(8.96)

208(5.89)

82(2.29)

88(2.76)

35(1.08)

Lena

512

\times

512

347(46.3)

232(31.3)

94(12.5)

68(9.70)

39(5.58)

Robust PCA

FaceSet1

64

\times

1024

2000+(41.1)

1507(30.3)

560(11.1)

561(11.9)

267(5.65)

FaceSet2

64

\times

1024

2000+(41.1)

2000+(41.4)

263(5.54)

388(9.00)

188(4.02)

FaceSet3

64

\times

1024

2000+(39.4)

1843(36.3)

375(7.44)

473(9.89)

299(6.27)

Equations105

u \in R^{n}, v \in R^{m} min h (u) + g (v), \mbox s u bj ec tt o A u + B v = b .

u \in R^{n}, v \in R^{m} min h (u) + g (v), \mbox s u bj ec tt o A u + B v = b .

u_{k + 1}

u_{k + 1}

\tilde{u}_{k + 1}

v_{k + 1}

λ_{k + 1}

r_{k} = b - A u_{k} - B v_{k} and d_{k} = τ_{k} A^{T} B (v_{k} - v_{k - 1}) .

r_{k} = b - A u_{k} - B v_{k} and d_{k} = τ_{k} A^{T} B (v_{k} - v_{k - 1}) .

∥ r_{k} ∥ \leq ϵ^{t o l} max {∥ A u_{k} ∥, ∥ B v_{k} ∥, ∥ b ∥} and ∥ d_{k} ∥ \leq ϵ^{t o l} ∥ A^{T} λ_{k} ∥,

∥ r_{k} ∥ \leq ϵ^{t o l} max {∥ A u_{k} ∥, ∥ B v_{k} ∥, ∥ b ∥} and ∥ d_{k} ∥ \leq ϵ^{t o l} ∥ A^{T} λ_{k} ∥,

1 \leq γ_{k} < 2, k \to \infty lim 1/ τ_{k}^{2} < \infty, k = 1 \sum \infty η_{k}^{2} < \infty, where η_{k}^{2} = \frac{γ _{k}}{( 2 - γ _{k} )} max (τ_{k}^{2} / τ_{k - 1}^{2}, 1) - 1.

1 \leq γ_{k} < 2, k \to \infty lim 1/ τ_{k}^{2} < \infty, k = 1 \sum \infty η_{k}^{2} < \infty, where η_{k}^{2} = \frac{γ _{k}}{( 2 - γ _{k} )} max (τ_{k}^{2} / τ_{k - 1}^{2}, 1) - 1.

1 \leq γ_{k} < 2, k \to \infty lim τ_{k}^{2} < \infty, k = 1 \sum \infty θ_{k}^{2} < \infty, where θ_{k}^{2} = \frac{γ _{k}}{( 2 - γ _{k} )} max (τ_{k - 1}^{2} / τ_{k}^{2}, 1) - 1.

1 \leq γ_{k} < 2, k \to \infty lim τ_{k}^{2} < \infty, k = 1 \sum \infty θ_{k}^{2} < \infty, where θ_{k}^{2} = \frac{γ _{k}}{( 2 - γ _{k} )} max (τ_{k - 1}^{2} / τ_{k}^{2}, 1) - 1.

ζ \in R^{p} min \hat{h} (ζ) h^{*} (A^{T} ζ) - ⟨ ζ, b ⟩ + \overset{g}{^} (ζ) g^{*} (B^{T} ζ),

ζ \in R^{p} min \hat{h} (ζ) h^{*} (A^{T} ζ) - ⟨ ζ, b ⟩ + \overset{g}{^} (ζ) g^{*} (B^{T} ζ),

0 \in

0 \in

0 \in

- (1 - γ_{k}) \partial \overset{g}{^} (ζ_{k}) + \partial \overset{g}{^} (ζ_{k + 1}),

\partial \hat{h} (\hat{ζ}) = α_{k} \hat{ζ} + Ψ_{k} and \partial \overset{g}{^} (ζ) = β_{k} ζ + Φ_{k},

\partial \hat{h} (\hat{ζ}) = α_{k} \hat{ζ} + Ψ_{k} and \partial \overset{g}{^} (ζ) = β_{k} ζ + Φ_{k},

\partial \hat{h} (\hat{ζ}) = α \hat{ζ} + Ψ and \partial \overset{g}{^} (ζ) = β ζ + Φ.

\partial \hat{h} (\hat{ζ}) = α \hat{ζ} + Ψ and \partial \overset{g}{^} (ζ) = β ζ + Φ.

τ_{k} = ar g τ min \frac{1 + α β τ ^{2}}{( α + β ) τ} = 1/ α β .

τ_{k} = ar g τ min \frac{1 + α β τ ^{2}}{( α + β ) τ} = 1/ α β .

γ_{k} = 1 + \frac{1 + α β τ ^{2}}{( α + β ) τ} = 1 + \frac{2 α β}{α + β} \leq 2.

γ_{k} = 1 + \frac{1 + α β τ ^{2}}{( α + β ) τ} = 1 + \frac{2 α β}{α + β} \leq 2.

Δ \hat{λ}_{k}

Δ \hat{λ}_{k}

\overset{α}{^}_{k} = {\overset{α}{^}_{k}^{\mbox M G} \overset{α}{^}_{k}^{\mbox S D} - \overset{α}{^}_{k}^{\mbox M G} /2 if 2 \overset{α}{^}_{k}^{\mbox M G} > \overset{α}{^}_{k}^{\mbox S D} otherwise,

\overset{α}{^}_{k} = {\overset{α}{^}_{k}^{\mbox M G} \overset{α}{^}_{k}^{\mbox S D} - \overset{α}{^}_{k}^{\mbox M G} /2 if 2 \overset{α}{^}_{k}^{\mbox M G} > \overset{α}{^}_{k}^{\mbox S D} otherwise,

\overset{α}{^}_{k}^{\mbox S D} = \frac{⟨ Δ λ ^ _{k} , Δ λ ^ _{k} ⟩}{⟨ Δ h ^ _{k} , Δ λ ^ _{k} ⟩} and \overset{α}{^}_{k}^{\mbox M G} = \frac{⟨ Δ h ^ _{k} , Δ λ ^ _{k} ⟩}{⟨ Δ h ^ _{k} , Δ h ^ _{k} ⟩} .

α_{k}^{\mbox cor} = \frac{⟨ Δ h ^ _{k} , Δ λ ^ _{k} ⟩}{∥Δ h ^ _{k} ∥ ∥Δ λ ^ _{k} ∥} and β_{k}^{\mbox cor} = \frac{⟨ Δ g ^ _{k} , Δ λ _{k} ⟩}{∥Δ g ^ _{k} ∥ ∥Δ λ _{k} ∥} .

α_{k}^{\mbox cor} = \frac{⟨ Δ h ^ _{k} , Δ λ ^ _{k} ⟩}{∥Δ h ^ _{k} ∥ ∥Δ λ ^ _{k} ∥} and β_{k}^{\mbox cor} = \frac{⟨ Δ g ^ _{k} , Δ λ _{k} ⟩}{∥Δ g ^ _{k} ∥ ∥Δ λ _{k} ∥} .

τ_{k + 1} = ⎩ ⎨ ⎧ \overset{α}{^}_{k} \hat{β}_{k} \overset{α}{^}_{k} \hat{β}_{k} τ_{k} if α_{k}^{\mbox cor} > ϵ^{\mbox cor} and β_{k}^{\mbox cor} > ϵ^{\mbox cor} if α_{k}^{\mbox cor} > ϵ^{\mbox cor} and β_{k}^{\mbox cor} \leq ϵ^{\mbox cor} if α_{k}^{\mbox cor} \leq ϵ^{\mbox cor} and β_{k}^{\mbox cor} > ϵ^{\mbox cor} otherwise,

τ_{k + 1} = ⎩ ⎨ ⎧ \overset{α}{^}_{k} \hat{β}_{k} \overset{α}{^}_{k} \hat{β}_{k} τ_{k} if α_{k}^{\mbox cor} > ϵ^{\mbox cor} and β_{k}^{\mbox cor} > ϵ^{\mbox cor} if α_{k}^{\mbox cor} > ϵ^{\mbox cor} and β_{k}^{\mbox cor} \leq ϵ^{\mbox cor} if α_{k}^{\mbox cor} \leq ϵ^{\mbox cor} and β_{k}^{\mbox cor} > ϵ^{\mbox cor} otherwise,

γ_{k + 1} = ⎩ ⎨ ⎧ 1 + \frac{2 α ^ _{k} β ^ _{k}}{α ^ _{k} + β ^ _{k}} 1.9 1.1 1.5 if α_{k}^{\mbox cor} > ϵ^{\mbox cor} and β_{k}^{\mbox cor} > ϵ^{\mbox cor} if α_{k}^{\mbox cor} > ϵ^{\mbox cor} and β_{k}^{\mbox cor} \leq ϵ^{\mbox cor} if α_{k}^{\mbox cor} \leq ϵ^{\mbox cor} and β_{k}^{\mbox cor} > ϵ^{\mbox cor} otherwise,

γ_{k + 1} = ⎩ ⎨ ⎧ 1 + \frac{2 α ^ _{k} β ^ _{k}}{α ^ _{k} + β ^ _{k}} 1.9 1.1 1.5 if α_{k}^{\mbox cor} > ϵ^{\mbox cor} and β_{k}^{\mbox cor} > ϵ^{\mbox cor} if α_{k}^{\mbox cor} > ϵ^{\mbox cor} and β_{k}^{\mbox cor} \leq ϵ^{\mbox cor} if α_{k}^{\mbox cor} \leq ϵ^{\mbox cor} and β_{k}^{\mbox cor} > ϵ^{\mbox cor} otherwise,

\overset{τ}{^}_{k + 1} = \overset{γ}{^}_{k + 1} = min {τ_{k + 1}, (1 + \nicefrac C_{c g} k^{2}) τ_{k}} min {γ_{k + 1}, 1 + \nicefrac C_{c g} k^{2}},

\overset{τ}{^}_{k + 1} = \overset{γ}{^}_{k + 1} = min {τ_{k + 1}, (1 + \nicefrac C_{c g} k^{2}) τ_{k}} min {γ_{k + 1}, 1 + \nicefrac C_{c g} k^{2}},

y = (u v) \in R^{n + m}, z = u v λ \in R^{n + m + p} .

y = (u v) \in R^{n + m}, z = u v λ \in R^{n + m + p} .

f (y) = h (u) + g (v), F (z) = - A^{T} λ - B^{T} λ A u + B v - b .

f (y) = h (u) + g (v), F (z) = - A^{T} λ - B^{T} λ A u + B v - b .

\forall z, f (y) - f (y^{*}) + (z - z^{*})^{T} F (z^{*}) \geq 0.

\forall z, f (y) - f (y^{*}) + (z - z^{*})^{T} F (z^{*}) \geq 0.

\forall u,

\forall u,

(τ_{k} A^{T} (A u_{k + 1} + B v_{k} - b) - A^{T} λ_{k}) \geq 0,

\forall v,

(τ_{k} B^{T} (\tilde{u}_{k + 1} + B v_{k + 1} - b) - B^{T} λ_{k}) \geq 0.

f (y) - f (y_{k + 1}) + (z - z_{k + 1})^{T} (F (z_{k + 1}) + Ω (Δ z_{k}^{+}, τ_{k}, γ_{k})) \geq 0,

f (y) - f (y_{k + 1}) + (z - z_{k + 1})^{T} (F (z_{k + 1}) + Ω (Δ z_{k}^{+}, τ_{k}, γ_{k})) \geq 0,

Ω (Δ z_{k}^{+}, τ_{k}, γ_{k}) = \frac{γ _{k} - 1}{γ _{k}} A^{T} Δ λ_{k}^{+} - \frac{τ _{k}}{γ _{k}} A^{T} B Δ v_{k}^{+} 0 \frac{1}{γ _{k} τ _{k}} Δ λ_{k}^{+} - \frac{γ _{k} - 1}{γ _{k}} B Δ v_{k}^{+} .

(B Δ v_{k}^{+})^{T} Δ λ_{k}^{+} \geq 0.

(B Δ v_{k}^{+})^{T} Δ λ_{k}^{+} \geq 0.

\frac{2 - γ _{k}}{γ _{k}} \leq ∥ τ_{k} B Δ v_{k}^{+} + Δ λ_{k}^{+} ∥^{2} γ_{k} (∥ τ_{k} B Δ v_{k}^{*} ∥^{2} + ∥Δ λ_{k}^{*} ∥^{2}) - (2 - γ_{k}) (∥ τ_{k} B Δ v_{k + 1}^{*} ∥^{2} + ∥Δ λ_{k + 1}^{*} ∥^{2}) .

\frac{2 - γ _{k}}{γ _{k}} \leq ∥ τ_{k} B Δ v_{k}^{+} + Δ λ_{k}^{+} ∥^{2} γ_{k} (∥ τ_{k} B Δ v_{k}^{*} ∥^{2} + ∥Δ λ_{k}^{*} ∥^{2}) - (2 - γ_{k}) (∥ τ_{k} B Δ v_{k + 1}^{*} ∥^{2} + ∥Δ λ_{k + 1}^{*} ∥^{2}) .

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

MethodsAlternating Direction Method of Multipliers

Full text

Adaptive Relaxed ADMM: Convergence Theory and Practical Implementation

Zheng Xu1, Mário A. T. Figueiredo2, Xiaoming Yuan3, Christoph Studer4, and Tom Goldstein1

1Department of Computer Science, University of Maryland, College Park, MD

2Instituto de Telecomunicações, Instituto Superior Técnico, Universidade de Lisboa, Portugal

3Department of Mathematics, Hong Kong Baptist University, Kowloon Tong, Hong Kong

4School of Electrical and Computer Engineering, Cornell University, Ithaca, NY [email protected]

Abstract

Many modern computer vision and machine learning applications rely on solving difficult optimization problems that involve non-differentiable objective functions and constraints. The alternating direction method of multipliers (ADMM) is a widely used approach to solve such problems. Relaxed ADMM is a generalization of ADMM that often achieves better performance, but its efficiency depends strongly on algorithm parameters that must be chosen by an expert user. We propose an adaptive method that automatically tunes the key algorithm parameters to achieve optimal performance without user oversight. Inspired by recent work on adaptivity, the proposed adaptive relaxed ADMM (ARADMM) is derived by assuming a Barzilai-Borwein style linear gradient. A detailed convergence analysis of ARADMM is provided, and numerical results on several applications demonstrate fast practical convergence.

1 Introduction

Modern methods in computer vision and machine learning often require solving difficult optimization problems involving non-differentiable objective functions and constraints. Some popular applications include sparse models [48, 54, 8, 36], low-rank models [47, 23, 53, 31], and support vector machines (SVMs) [4, 3]. The alternating direction method of multiplier (ADMM) is one of the most prominent optimization tools to solve such problems, and tackles problems in the following form:

[TABLE]

Here, $h:{\mathbb{R}}^{n}\rightarrow{\mathbb{R}}$ and $g:{\mathbb{R}}^{m}\rightarrow{\mathbb{R}}$ are closed, proper, and convex functions, $A\in{\mathbb{R}}^{p\times n}$ , $B\in{\mathbb{R}}^{p\times m}$ , and $b\in{\mathbb{R}}^{p}$ . ADMM was first introduced in [16] and [12], and has found applications in a variety of optimization problems in machine learning, image processing, computer vision, wireless communications, and many other areas [2, 21].

Relaxed ADMM is a popular practical variant of ADMM, and proceeds with the following steps:

[TABLE]

Here, $\lambda_{k}\!\in\!{\mathbb{R}}^{p}$ denotes the dual variables (Lagrange multipliers) on iteration $k$ , and $(\tau_{k},\gamma_{k})$ are sequences of penalty and relaxation parameters. Relaxed ADMM coincides with the original non-relaxed version if $\gamma_{k}=1$ .

Convergence of (relaxed) ADMM is guaranteed under fairly general assumptions [6, 25, 26, 10], if the penalty and relaxation parameters are held constant. However, the practical performance of ADMM depends strongly on the choice of these parameters, as well as on the problem being solved. Good penalty choices are known for certain ADMM formulations, such as strictly convex quadratic problems [40, 14], and for the gradient descent parameter in the “linearized” ADMM [32, 34].

Adaptive penalty methods (in which the penalty parameters are tuned automatically as the algorithm proceeds) achieve good performance without user oversight. For non-relaxed ADMM, the authors of [24] propose methods that modulate the penalty parameter so that the primal and dual residuals (i.e., derivatives of the Lagrangian with respect to primal and dual variables) are of approximately equal size. This “residual balancing” approach has been generalized to work with preconditioned variants of ADMM [20] and distributed ADMM [44]. In [51], a spectral penalty parameter method is proposed that uses the local curvature of the objective to achieve fast convergence. All of these methods are specific to (non-relaxed) vanilla ADMM, and do not apply to the more general case involving a relaxation parameter.

1.1 Overview & contributions

In this paper, we study adaptive parameter choices for the relaxed ADMM that jointly and automatically tune both the penalty parameter $\tau_{k}$ and relaxation parameter $\gamma_{k}$ . In Section 3, we address theoretical questions about the convergence of ADMM with non-constant penalty and relaxation parameters. In Section 4, we discuss practical methods for choosing these parameters. In Section 6, we apply the proposed ARADMM to several problems in machine learning, computer vision, and image processing. Finally, in Section 7, we compare ARADMM to other ADMM variants and examine the benefits of the proposed approach for real-world regression, classification, and image processing problems.

2 Related work

Sparse and low rank methods are widely used in computer vision [48, 54, 8, 47, 23, 36, 53, 31], machine learning [7, 57, 43, 9, 33], and image processing [42, 21]. ADMM has been extensively applied to solve such problems [2, 21, 51, 50], and has recently found applications in neural networks [56, 45], tensor decomposition [18, 35, 52], structure from motion [19], and other vision problems.

The $O(1/k)$ convergence rate of non-relaxed ADMM is established under mild conditions for convex problems [25, 26]. The $O(1/k^{2})$ convergence rate is discussed in [17, 21, 27, 46], where at least one of the functions is assumed either strongly convex or smooth. For the general relaxed ADMM formulation, a $O(1/k)$ convergence rate is provided under mild conditions [10]. Linear convergence can be achieved with strong convexity assumptions [5, 38, 15]. All of these results assume constant parameters—it is considerably harder to prove convergence when the algorithm parameters are adaptive.

Fixed optimal parameters are discussed in the literature. For the specific case in which the objective is quadratic, a criterion is proposed in [40, 14]. The authors of [38] suggest a grid search and semidefinite programming based method to determine the optimal relaxation and penalty parameters. These methods, however, make strong assumptions about the objective and require knowledge of condition numbers.

Adaptive penalty methods are proposed to accelerate the practical convergence of non-relaxed ADMM [24, 51]. For the relaxation parameter, it has been suggested in [6] that over-relaxation ( $\gamma\in(1,2)$ ) may accelerate convergence and $\gamma=1.5$ achieves faster convergence in a specific distributed computing application. The proposed ARADMM simultaneously adapts both the penalty and the relaxation parameter, thus being fully automated.

3 Convergence theory

We study conditions under which ADMM converges with adaptive penalty and relaxation parameters. Our approach utilizes the variational inequality (VI) methods put forward in [24, 25, 26]. Our results measure convergence using the primal and dual “residuals,” which are defined as

[TABLE]

It has been observed that these residuals approach zero as the algorithm approaches a true solution [2]. Typically, the iterative process is stopped if

[TABLE]

where $\epsilon^{tol}>0$ is the stopping tolerance [2]. For this reason, it is important to know that the method converges in the sense that the residuals approach zero as $k\to\infty.$

In the sequel, we prove that relaxed ADMM converges in the residual sense, provided that the algorithm parameters satisfy one of the following two assumptions.

Assumption 1.

The relaxation sequence $\gamma_{k}$ and penalty sequence $\tau_{k}$ satisfy

[TABLE]

Assumption 2.

The relaxation sequence $\gamma_{k}$ and penalty sequence $\tau_{k}$ satisfy

[TABLE]

In Section 5, we prove adaptive relaxed ADMM converges if the algorithm parameters satisfy either Assumption 1 or Assumption 2. Before presenting the proof, we show how to choose the relaxation parameters that lead to efficient performance in practice.

4 ARADMM: Adaptive relaxed ADMM

Spectral stepsize selection methods for vanilla ADMM were discussed in [51]. Here, we modify the adaptive ADMM framework in two important ways. First, we discuss the selection of penalty parameters in the presence of the relaxation term. Second, we discuss adaptive methods also for automatically selecting the relaxation parameter.

The proposed method works by assuming a local linear model for the dual optimization problem, and then selecting an optimal stepsize under this assumption. A safeguarding method is adopted to ensure that bad stepsizes are not chosen in case these linearity assumptions fail to hold.

4.1 Dual interpretation of relaxed ADMM

We derive our adaptive stepsize rules by examining the close relationship between relaxed ADMM and the relaxed Douglas-Rachford Splitting (DRS) [6, 5, 15]. The dual of the general constrained problem (1) is

[TABLE]

with $f^{*}$ denoting the Fenchel conjugate of $f$ , defined as $f^{*}(y)=\sup_{x}\langle x,y\rangle-f(x)$ [41].

The relaxed DRS algorithm solves (10) by generating two sequences, $(\zeta_{k})_{k\in{\mathbb{N}}}$ and $(\hat{\zeta}_{k})_{k\in{\mathbb{N}}},$ according to

[TABLE]

where $\gamma_{k}$ is a relaxation parameter, and $\partial f(x)$ denotes the subdifferential of $f$ evaluated at $x$ [41]. Referring back to ADMM in (2)–(5), and defining $\hat{\lambda}_{k+1}=\lambda_{k}+\tau_{k}(b-Au_{k+1}-Bv_{k})$ , the sequences $(\lambda_{k})_{k\in{\mathbb{N}}}$ and $(\hat{\lambda}_{k})_{k\in{\mathbb{N}}}$ satisfy the same conditions (11) and (12) as $(\zeta_{k})_{k\in{\mathbb{N}}}$ and $(\hat{\zeta}_{k})_{k\in{\mathbb{N}}}$ , thus ADMM for the problem (1) is equivalent to DRS on its dual (10). A detailed proof of this is provided in the supplementary material.

4.2 Spectral adaptive stepsize rule

Adaptive stepsize rules of the “spectral” type were originally proposed for simple gradient descent on smooth problems by Barzilai and Borwein [1], and have been found to dramatically outperform constant stepsizes in many applications [11, 49]. Spectral stepsize methods work by modeling the gradient of the objective as a linear function, and then selecting the optimal stepsize for this simplified linear model.

Spectral methods were recently used to determine the penalty parameter for the non-relaxed ADMM in [51]. Inspired by that work, we derive spectral stepsize rules assuming a linear model/approximation for $\partial\hat{h}(\hat{\zeta})$ and $\partial\hat{g}(\zeta)$ at iteration $k$ given by

[TABLE]

where $\alpha_{k}>0$ , $\beta_{k}>0$ are local curvature estimates of $\hat{h}$ and $\hat{g}$ , respectively, and $\Psi_{k},\Phi_{k}\subset{\mathbb{R}}^{p}$ . Once we obtain these curvature estimates, we will exploit the following simple proposition whose proof is given in the supplementary material.

Proposition 1.

Suppose the DRS steps (11)–(12) are applied to problem (10), where (omitting iteration $k$ from $\alpha_{k},\beta_{k},\Psi_{k},\Phi_{k}$ to lighten the notation in what follows)

[TABLE]

Then, the residual of $\,\hat{h}(\zeta_{k+1})+\hat{g}(\zeta_{k+1})$ will be zero if $\tau$ and $\gamma$ are chosen to satisfiy $\gamma_{k}=1+\frac{1+\alpha\beta\tau_{k}^{2}}{(\alpha+\beta)\tau_{k}}.$

Our adaptive method works by fitting a linear model to the gradient (or subgradient) of our objective, and then using Proposition 1 to select an optimal stepsize pair that obtains zero residual on the model problem. For our convergence theory to hold, we need $\gamma<2.$ For fixed values of $\alpha$ and $\beta,$ the minimal value of $\gamma_{k}$ that is still optimal for the linear model occurs if we choose

[TABLE]

Note this is the same “optimal” penalty parameter proposed for non-relaxed ADMM in [51]. Under this choice of $\tau_{k},$ we then have the “optimal” relaxation parameter

[TABLE]

4.3 Estimation of stepsizes

We now propose a simple method for fitting a linear model to the dual objective terms so that the formulas in Section 4.2 can be used to obtain stepsizes. Once these linear models are formed, the optimal penalty parameter and relaxation term can be calculated by (15) and (16), thanks to the equivalence of relaxed ADMM and DRS.

In what follows, we let $\hat{\alpha}_{k}=1/\alpha_{k}$ and $\hat{\beta}_{k}=1/\beta_{k}$ to simplify notation. The optimal stepsize choice is then written as $\tau_{k}=(\hat{\alpha}_{k}\,\hat{\beta}_{k})^{1/2}$ and $\gamma_{k}=1+\frac{2\sqrt{\hat{\alpha}_{k}\hat{\beta}_{k}}}{\hat{\alpha}_{k}+\hat{\beta}_{k}}$ .

The estimation of $\hat{\alpha}_{k}$ and $\hat{\beta}_{k}$ for the dual components $\hat{h}(\hat{\lambda}_{k})$ and $\hat{g}(\lambda_{k})$ at the $k$ -th iteration of primal ADMM has been described in [51]. It is easy to verify that the model parameters $\hat{\alpha}_{k}$ and $\hat{\beta}_{k}$ of relaxed ADMM can be estimated based on the results from iteration $k$ and an older iteration $k_{0}<k$ in a similar way. If we define

[TABLE]

then the parameter $\hat{\alpha}_{k}$ is obtained from the formula

[TABLE]

For a detailed derivation of these formulas, see [51].

The spectral stepsize $\hat{\beta}_{k}$ of $\hat{g}(\lambda_{k})$ is similarly estimated with $\Delta\hat{g}_{k}\!:=\!B(v_{k}-v_{k_{0}})$ , and $\Delta\lambda_{k}\!:=\!\lambda_{k}-\lambda_{k_{0}}$ . It is important to note that $\hat{\alpha}_{k}$ and $\hat{\beta}_{k}$ are obtained from the iterates of ADMM alone, i.e., our scheme does not require the user to supply the dual problem.

4.4 Safeguarding

Spectral stepsize methods for simple gradient descent are paired with a backtracking line search to guarantee convergence in case the linear model assumptions break down and an unstable stepsize is produced. ADMM methods have no analog of backtracking. Rather, we adopt the correlation criterion proposed in [51] to test the validity of the local linear assumption, and only rely on the adaptive model when the assumptions are deemed valid. To this end, we define

[TABLE]

When the model assumptions (14) hold perfectly, the vectors $\Delta\hat{h}_{k}$ and $\Delta\hat{\lambda}_{k}$ should be highly correlated and we get $\alpha^{\mbox{\scriptsize cor}}_{k}=1.$ When $\alpha^{\mbox{\scriptsize cor}}_{k}$ or $\beta^{\mbox{\scriptsize cor}}_{k}$ is small, the model assumptions are invalid and the spectral stepsize may not be effective.

The proposed method uses the following update rules

[TABLE]

where $\epsilon^{\mbox{\scriptsize cor}}$ is a quality threshold for the curvature estimates, while $\hat{\alpha}_{k}$ and $\hat{\beta}_{k}$ are the spectral stepsizes estimated in Section 4.3. The update for $\tau_{k+1}$ only uses model parameters that have been accurately estimated. When the model is effective for $h$ but not $g,$ we use a large $\gamma_{k}=1.9$ to make the $v$ update conservative relative to the $u$ update. When the model is effective for $g$ but not $h,$ we use a small $\gamma_{k}=1.1$ to make the $v$ update aggressive relative to the $u$ update.

4.5 Applying convergence guarantee

Our convergence theory requires either Assumption 1 or Assumption 2 to be satisfied, which suggests that convergence is guaranteed under “bounded adaptivity” for both penalty and relaxation parameters. These conditions can be guaranteed by explicitly adding constraints to the stepsize choice in ARADMM.

To guarantee convergence, we simply replace the parameter updates (21) and (22) with

[TABLE]

where $C_{cg}$ is some (large) constant. It is easily verified that the parameter sequence $(\hat{\tau}_{k},\hat{\gamma}_{k})$ satisfies Assumption 1. In practice, the update schemes (21) and (22) converges reliably without explicitly enforcing these conditions. We use a very large $C_{cg}$ such that the conditions are not triggered in the first few thousand iterations and provide these constraints for theoretical interests.

4.6 ARADMM algorithm

The complete adaptive relaxed ADMM (ARADMM) is shown in Algorithm 1. We suggest only updating the stepsize every $T_{f}=2$ iterations. We suggest a fixed safeguarding threshold $\epsilon^{\mbox{\scriptsize cor}}=0.2,$ which is used in all the experiments in Section 6. The overhead of the adaptive scheme is modest, requiring only a few inner product calculations.

5 Proofs of convergence theorems

We now prove that relaxed ADMM converges under Assumption 1 or 2. Let

[TABLE]

We use $y_{k}=(u_{k},v_{k})^{T}$ and $z_{k}=(u_{k},v_{k},\lambda_{k})^{T}$ to denote iterates, and $y^{*}=(u^{*},v^{*})^{T}$ and $z^{*}=(u^{*},v^{*},\lambda^{*})^{T}$ denote optimal solutions. Set $\Delta z^{+}_{k}=(\Delta u^{+}_{k},\Delta v^{+}_{k},\Delta\lambda^{+}_{k}):=z_{k+1}-z_{k}$ , and $\Delta z^{*}_{k}=(\Delta u^{*}_{k},\Delta v^{*}_{k},\Delta\lambda^{*}_{k}):=z^{*}-z_{k}$ , and define

[TABLE]

Notice that $F(z)$ is monotone, which means $\forall z,z^{\prime},(z-z^{\prime})^{T}(F(z)-F(z^{\prime}))\geq 0$ .

Problem formulation (1) can be reformulated as a variational inequality (VI). The optimal solution $z^{*}$ satisfies

[TABLE]

Likewise, the ADMM iterates produced by steps (2) and (4) satisfy the variational inequalities

[TABLE]

Using the definitions of $y$ , $z$ , $f(y)$ , and $F(z)$ in (24, 25), $\lambda$ in (5), and $\tilde{u}$ in (3), VI (27) and (28) combine to yield

[TABLE]

We then apply VI (26), (28), and (29) in order to prove the following lemmas for our contraction proof, which show that the difference between iterates decreases as the iterates approach the true solution. ‘The remaining details of the proof are in the supplementary material.

Lemma 1.

The iterates $z_{k}=(u_{k},v_{k},\lambda_{k})^{T}$ generated by ADMM satisfy

[TABLE]

Lemma 2.

Let $\gamma_{k}\geq 1.$ The optimal solution $z^{*}$ and iterates $z_{k}$ generated by ADMM satisfy

[TABLE]

5.1 Convergence with adaptivity

We are now ready to state our main convergence results. The proof of Theorem 1 is shown here in full, and leverages Lemma 2 to produce a contraction argument. The proof of Theorem 2 is extremely similar, and is shown in the supplementary material.

Theorem 1.

Suppose Assumption 1 holds. Then, the iterates $z_{k}=(u_{k},v_{k},\lambda_{k})^{T}$ generated by ADMM satisfy

[TABLE]

Proof.

Assumption 1 implies

[TABLE]

If $\gamma_{k}<2$ as in Assumption 1, then Lemma 2 shows

[TABLE]

where (33) is used to get from (34) to (35). Accumulating inequality (35) from $k=0$ to $N$ shows

[TABLE]

Assumption 1 also implies $\prod_{t=1}^{\infty}(1+\eta_{t}^{2})\!\!<\!\!\infty$ , and $\prod_{t=k+1}^{N}(1+\eta_{t}^{2})\frac{1}{\gamma_{k}}\!\!\geq\!\!\frac{1}{\gamma_{k}}\!\!>\!\!\nicefrac{{1}}{{2}}$ . Then, (5.1) indicates $\sum_{k=0}^{\infty}\|\tau_{k}B\Delta v^{+}_{k}+\Delta\lambda^{+}_{k}\|^{2}<\infty,$ and

[TABLE]

Now, from Lemma 1, $(B\Delta v^{+}_{k})^{T}\Delta\lambda^{+}_{k}\geq 0,$ and so

[TABLE]

The residuals $r_{k},d_{k}$ in (6) satisfy

[TABLE]

from which we get

[TABLE]

easdasd

∎

Similar methods can be used to prove the following about convergence under Assumption 2. The proof of the following theorem is given in the supplementary material.

Theorem 2.

Suppose Assumption 2 holds. Then, the iterates $z_{k}=(u_{k},v_{k},\lambda_{k})^{T}$ generated by ADMM satisfy

[TABLE]

6 Applications

We focus on the following statistical and image processing problems involving non-differentiable objectives: linear regression with elastic net regularization (EN), low-rank least squares (LRLS), quadratic programming (QP), consensus $\ell_{1}$ -regularized logistic regression, support vector machine (SVM), total variation image restoration (TVIR), and robust principle component analysis (RPCA). We study several vision benchmark datasets such as the extended Yale B face dataset [13], MNIST digital images [29], and CIFAR10 object images111We use the first batch of CIFAR10 that contains $10000$ samples. [28]. We also use synthetic and benchmark datasets from [7, 57, 30, 43, 33, 21], which are obtained from the UCI repository and the LIBSVM page. The experimental setups for each problem are briefly described here, and the implementation details are provided in the supplementary material.

Linear regression with EN regularization

Elastic net (EN) is a modification of the $\ell_{1}$ -norm (or LASSO) regularizer that helps dealing with highly correlated variables [57, 21], and requires solving

[TABLE]

where $\|\cdot\|_{1}$ denotes the $\ell_{1}$ -norm, $D$ is the data matrix, $c$ contains measurements, and $x$ is the vector of regression coefficient.

Low-rank least squares (LRLS)

The nuclear norm (the $\ell_{1}$ -norm of the matrix singular values) is a convex surrogate for matrix rank. ADMM has been applied to solve low rank least squares problems [55, 53]

[TABLE]

where $\|\cdot\|_{*}$ denotes the nuclear norm, $\|\cdot\|_{F}$ denotes the Frobenius norm, $D\in{\mathbb{R}}^{n\times m}$ is a data matrix, $C\in{\mathbb{R}}^{n\times d}$ contains measurements, and $X\in{\mathbb{R}}^{m\times d}$ contains variables.

ADMM is applied by splitting the regression term and the non-differentiable regularizer composed of nuclear and Frobenius norm. LRLS has been used to formulate exemplar classifiers and discover visual subcategories [53].

SVM and QP

Support vector machine (SVM) is one of the most successful binary classifiers for computer vision. The dual of the SVM is a QP problem,

[TABLE]

where $z$ is the SVM dual variable, $Q$ is the kernel matrix, $c$ is a vector of labels, $e$ is a vector of ones, and $C>0$ [3]. The canonical QP is also considered,

[TABLE]

Consensus $\ell_{1}$ -regularized logistic regression

ADMM has become an important tool for solving distributed optimization problems [2]. A typical problem is the consensus $\ell_{1}$ -regularized logistic regression

[TABLE]

where $x_{i}\in{\mathbb{R}}^{m}$ represents the local variable on the $i$ th distributed node, $z$ is the global variable, $n_{i}$ is the number of samples in the $i$ th block, $D_{j}\in{\mathbb{R}}^{m}$ is the $j$ th sample, and $c_{j}\in\{-1,+1\}$ is the corresponding label.

Unwrapped SVM

The unwrapped formulation of SVM [22], which can be used in distributed computing environments via “transpose reduction” tricks, applies ADMM to the primal form of SVM to solve

[TABLE]

where $D_{j}\in{\mathbb{R}}^{m}$ is the $j$ th sample of training data, and $c_{j}\in\{-1,1\}$ is the corresponding label. ADMM is applied by splitting the $\ell_{2}$ -norm regularizer and the non-differentiable hinge loss term.

Total variation image denoising (TVID)

Total variation image denoising is often performed by solving [42]

[TABLE]

where $c$ represents given noisy image, and $\nabla$ is the discrete gradient operator, which computes differences between adjacent image pixels. ADMM is applied by splitting the $\ell_{2}$ -norm term and the non-differentiable total variation term.

RPCA

Robust principal component analysis (RPCA) has broad applications in computer vision and imaging [47, 37, 39]. RPCA recovers a low-rank matrix and a sparse matrix by solving

[TABLE]

where the nuclear norm $\|\cdot\|_{*}$ is used to obtain a low rank matrix $Z$ , and $\|\cdot\|_{1}$ is used to obtain a sparse error $E$ .

7 Experiments

The proposed AADMM is implemented as shown in Algorithm 1. We also implemented vanilla ADMM, (non-adaptive) relaxed ADMM, ADMM with residual balancing (RB), and adaptive ADMM (AADMM) for comparison.

The relaxation parameter for the non-adaptive relaxed ADMM is fixed at $\gamma_{k}\!=\!1.5$ as suggested in [6]. The parameters of RB and AADMM are selected as in [24, 2, 51]. The initial penalty $\tau_{0}\!=\!\nicefrac{{1}}{{10}}$ and initial relaxation $\gamma_{0}\!=\!1$ are used for all problems except the canonical QP problem, where initial parameters are set to the geometric mean of the maximum and minimum eigenvalues of matrix $Q$ , as proposed for quadratic problems in [40].

For each problem, the same randomly generated initial variables $v_{0},\lambda_{0}$ are used for ADMM and its variant methods. As suggested by [24, 51], the adaptivity of RB and AADMM is stopped after 1000 iterations to guarantee convergence.

7.1 Convergence results

Table 1 reports the convergence speed of ADMM and its variants for the applications described in Section 6. More experimental results including the table of more test cases, the convergence curves, and visual results of image restoration and robust PCA for face decomposition are provided in the supplementary material. Relaxed ADMM often outperforms vanilla ADMM, but does not compete with adaptive methods like RB, AADMM and ARADMM. The proposed ARADMM performs best in all the test cases.

7.2 Sensitivity to initialization

We study the sensitivity of the different ADMM variants to the initial penalty ( $\tau_{0}$ ) and initial relaxation parameter ( $\gamma_{0}$ ). Fig. 1 presents iteration counts for a wide range of values of $\tau_{0},\gamma_{0}$ , for elastic net regression with synthetic datasets. In the left and center plots we fix one of $\tau_{0},\gamma_{0}$ and vary the other. The number of iterations needed to convergence is plotted as the algorithm parameters vary. In the right plot, we use a grid search to find the optimal $\tau_{0}$ for different values of $\gamma_{0}$ . Fig. 1 (left) shows that adaptive methods are relatively stable with respect to the initial penalty $\tau_{0}$ , while ARADMM outperforms RB and AADMM in all choices of initial $\tau_{0}$ . Fig. 1 (middle) suggests that the relaxation $\gamma_{0}$ is generally less important than $\tau_{0}$ . When a bad value of $\tau$ is chosen, it is unlikely that a good choice of $\gamma$ can compensate. The proposed ARADMM that jointly adjusts $\tau,\gamma$ is generally better than simply adding the relaxation to the existing adaptive methods RB and AADMM.

Fig. 1 (right) shows the sensitivity to $\gamma$ when using a grid search to choose the optimal $\tau_{0}$ . This optimal $\tau_{0}$ significantly improves the performance of vanilla ADMM and relaxed ADMM (which use the same $\tau_{0}$ for all iterations). Even when using the optimal stepsize for the non-adaptive methods, ARADMM is superior to or competitive with the non-adaptive methods. Note that this experiment is meant to show a best-case scenario for the non-adaptive methods; in practice the user generally has no knowledge of the optimal value of $\tau.$ Adaptive methods achieve optimal or near-optimal performance without an expensive grid search.

7.3 Sensitivity to safeguarding

Finally, Fig. 2 presents iteration counts when applying ARADMM with various safeguarding correlation thresholds $\epsilon^{{\scriptsize\text{cor}}}$ . When $\epsilon^{{\scriptsize\text{cor}}}=0$ , the calculated adaptive parameters based on curvature estimations are always accepted, and when $\epsilon^{{\scriptsize\text{cor}}}\!=\!1$ the parameters are never changed. The proposed AADMM method is insensitive to $\epsilon^{{\scriptsize\text{cor}}}$ and performs well for a wide range of $\epsilon^{{\scriptsize\text{cor}}}\in[0.1,\,0.4]$ for various applications, except for unwrapping SVM and RPCA. Though tuning such “hyper-parameters” may improve the performance of ARADMM for some applications, the fixed $\epsilon^{{\scriptsize\text{cor}}}=0.2$ performs well in all our experiments (seven applications and over fifty test cases, a full list is in the supplementary material). The proposed ARADMM is fully automated and performs well without parameter tuning.

8 Conclusion

We have proposed an adaptive method for jointly tuning the penalty and relaxation parameters of relaxed ADMM without user oversight. We have analyzed adaptive relaxed ADMM schemes, and provided conditions for which convergence is guaranteed. Experiments on a wide range of machine learning, computer vision, and image processing benchmarks have demonstrated that the proposed adaptive method (often significantly) outperforms other ADMM variants without user oversight or parameter tuning. The new adaptive method improves the applicability of relaxed ADMM by facilitating fully automated solvers that exhibit fast convergence and are usable by non-expert users.

Acknowledgments

TG and ZX were supported by the US Office of Naval Research under grant N00014-17-1-2078 and by the US National Science Foundation (NSF) under grant CCF-1535902. MF was partially supported by the Fundação para a Ciência e Tecnologia, grant UID/EEA/5008/2013. XY was supported by the General Research Fund from Hong Kong Research Grants Council under grant HKBU-12313516. CS was supported in part by Xilinx Inc., and by the US NSF under grants ECCS-1408006, CCF-1535897, and CAREER CCF-1652065.

Bibliography57

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] J. Barzilai and J. Borwein. Two-point step size gradient methods. IMA J. Num. Analysis , 8:141–148, 1988.
2[2] S. Boyd, N. Parikh, E. Chu, B. Peleato, and J. Eckstein. Distributed optimization and statistical learning via the alternating direction method of multipliers. Found. and Trends in Mach. Learning , 3:1–122, 2011.
3[3] C.-C. Chang and C.-J. Lin. LIBSVM: a library for support vector machines. ACM Transactions on Intelligent Systems and Technology (TIST) , 2(3):27, 2011.
4[4] C. Cortes and V. Vapnik. Support-vector networks. Machine learning , 20(3):273–297, 1995.
5[5] D. Davis and W. Yin. Faster convergence rates of relaxed Peaceman-Rachford and ADMM under regularity assumptions. ar Xiv preprint ar Xiv:1407.5210 , 2014.
6[6] J. Eckstein and D. Bertsekas. On the Douglas-Rachford splitting method and the proximal point algorithm for maximal monotone operators. Mathematical Programming , 55(1-3):293–318, 1992.
7[7] B. Efron, T. Hastie, I. Johnstone, and R. Tibshirani. Least angle regression. The Annals of statistics , 32(2):407–499, 2004.
8[8] E. Elhamifar and R. Vidal. Sparse subspace clustering. In Computer Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference on , pages 2790–2797. IEEE, 2009.

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Videos

Taxonomy

Adaptive Relaxed ADMM: Convergence Theory and Practical Implementation

Abstract

1 Introduction

1.1 Overview & contributions

2 Related work

3 Convergence theory

Assumption 1**.**

Assumption 2**.**

4 ARADMM: Adaptive relaxed ADMM

4.1 Dual interpretation of relaxed ADMM

4.2 Spectral adaptive stepsize rule

Proposition 1**.**

4.3 Estimation of stepsizes

4.4 Safeguarding

4.5 Applying convergence guarantee

4.6 ARADMM algorithm

5 Proofs of convergence theorems

Lemma 1**.**

Lemma 2**.**

5.1 Convergence with adaptivity

Theorem 1**.**

Proof.

Theorem 2**.**

6 Applications

Linear regression with EN regularization

Low-rank least squares (LRLS)

SVM and QP

Consensus ℓ1\ell_{1}ℓ1​-regularized logistic regression

Unwrapped SVM

Total variation image denoising (TVID)

RPCA

7 Experiments

7.1 Convergence results

7.2 Sensitivity to initialization

7.3 Sensitivity to safeguarding

8 Conclusion

Acknowledgments

Assumption 1.

Assumption 2.

Proposition 1.

Lemma 1.

Lemma 2.

Theorem 1.

Theorem 2.

Consensus $\ell_{1}$ -regularized logistic regression