Adaptive Relaxed ADMM: Convergence Theory and Practical Implementation
Zheng Xu, Mario A. T. Figueiredo, Xiaoming Yuan, Christoph Studer, and, Tom Goldstein

TL;DR
This paper introduces ARADMM, an adaptive version of relaxed ADMM that automatically tunes parameters for optimal performance, supported by convergence theory and demonstrated through practical applications.
Contribution
It proposes a novel adaptive relaxed ADMM method with automatic parameter tuning, backed by convergence analysis and empirical validation.
Findings
ARADMM achieves faster convergence in practice.
The method automatically adjusts parameters without user intervention.
Numerical experiments confirm the theoretical convergence and efficiency.
Abstract
Many modern computer vision and machine learning applications rely on solving difficult optimization problems that involve non-differentiable objective functions and constraints. The alternating direction method of multipliers (ADMM) is a widely used approach to solve such problems. Relaxed ADMM is a generalization of ADMM that often achieves better performance, but its efficiency depends strongly on algorithm parameters that must be chosen by an expert user. We propose an adaptive method that automatically tunes the key algorithm parameters to achieve optimal performance without user oversight. Inspired by recent work on adaptivity, the proposed adaptive relaxed ADMM (ARADMM) is derived by assuming a Barzilai-Borwein style linear gradient. A detailed convergence analysis of ARADMM is provided, and numerical results on several applications demonstrate fast practical convergence.
| Application | Dataset |
|
|
|
|
|
|
||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Elastic net regression | Synthetic | 50 40 | 2000+(.642) | 2000+(.660) | 424(.144) | 102(.051) | 70(.026) | ||||||||||||
| MNIST | 60000 784 | 1225(29.4) | 816(19.9) | 94(2.28) | 41(.943) | 21(.549) | |||||||||||||
| CIFAR10 | 10000 3072 | 2000+(690) | 2000+(697) | 556(193) | 2000+(669) | 94(31.7) | |||||||||||||
| News20 | 19996 1355191 | 2000+(1.21e4) | 2000+(9.16e3) | 227(914) | 104(391) | 71(287) | |||||||||||||
| Rcv1 | 20242 47236 | 2000+(1.20e3) | 1823(802) | 196(79.1) | 104(35.7) | 64(26.0) | |||||||||||||
| Realsim | 72309 20958 | 2000+(4.26e3) | 2000+(4.33e3) | 341(355) | 152(125) | 107(88.2) | |||||||||||||
| Low rank least squares | Synthetic | 1000 200 | 2000+(118) | 2000+(116) | 268(15.1) | 26(1.55) | 18(1.04) | ||||||||||||
| German | 1000 24 | 2000+(4.72) | 2000+(4.72) | 642(1.52) | 130(.334) | 52(.125) | |||||||||||||
| Spectf | 80 44 | 2000+(2.70) | 2000+(2.74) | 336(.455) | 162(.236) | 105(.150) | |||||||||||||
| MNIST | 60000 784 | 200+(1.86e3) | 200+(2.08e3) | 200+(3.29e3) | 200+(3.46e3) | 38(658) | |||||||||||||
| CIFAR10 | 10000 3072 | 200+(7.24e3) | 200+(1.33e4) | 53(1.60e3) | 8(208) | 6(156) | |||||||||||||
| QP and dual SVM | Synthetic | 250 500 | 1224(11.5) | 823(7.49) | 626(5.93) | 170(1.57) | 100(.914) | ||||||||||||
| German | 1000 24 | 2000+(58.8) | 2000+(61.8) | 1592(45.0) | 1393(38.9) | 1238(34.9) | |||||||||||||
| Spectf | 80 44 | 2000+(.846) | 2000+(.777) | 169(.070) | 175(.086) | 53(.026) | |||||||||||||
| Consensus logistic regression | Synthetic | 1000 25 | 590(9.93) | 391(6.97) | 70(1.23) | 35(.609) | 20(.355) | ||||||||||||
| German | 1000 24 | 2000+(34.3) | 2000+(66.6) | 151(2.60) | 35(.691) | 26(.580) | |||||||||||||
| Spectf | 80 44 | 1005(20.1) | 667(14.4) | 117(1.98) | 145(1.63) | 85(1.07) | |||||||||||||
| MNIST | 60000 784 | 200+(2.99e3) | 200+(3.47e3) | 200+(1.37e3) | 49(536) | 28(333) | |||||||||||||
| CIFAR10 | 10000 3072 | 200+(593) | 200+(2.08e3) | 200+(1.54e3) | 131(165) | 19(33.7) | |||||||||||||
| Unwrapping SVM | Synthetic | 1000 25 | 2000+(1.13) | 1418(.844) | 2000+(1.16) | 355(.229) | 147(.094) | ||||||||||||
| German | 1000 24 | 753(1.88) | 560(1.37) | 2000+(4.98) | 572(1.44) | 213(.545) | |||||||||||||
| Spectf | 80 44 | 567(.203) | 367(.112) | 567(.185) | 207(.068) | 149(.052) | |||||||||||||
| MNIST | 60000 784 | 128(130) | 118(111) | 163(153) | 200+(217) | 67(71.0) | |||||||||||||
| CIFAR10 | 10000 3072 | 200+(512) | 200+(532) | 200+(516) | 89(285) | 57(143) | |||||||||||||
| Image denoising | Barbara | 512 512 | 262(35.0) | 175(23.6) | 74(10.0) | 59(8.67) | 38(5.57) | ||||||||||||
| Cameraman | 256 256 | 311(8.96) | 208(5.89) | 82(2.29) | 88(2.76) | 35(1.08) | |||||||||||||
| Lena | 512 512 | 347(46.3) | 232(31.3) | 94(12.5) | 68(9.70) | 39(5.58) | |||||||||||||
| Robust PCA | FaceSet1 | 64 1024 | 2000+(41.1) | 1507(30.3) | 560(11.1) | 561(11.9) | 267(5.65) | ||||||||||||
| FaceSet2 | 64 1024 | 2000+(41.1) | 2000+(41.4) | 263(5.54) | 388(9.00) | 188(4.02) | |||||||||||||
| FaceSet3 | 64 1024 | 2000+(39.4) | 1843(36.3) | 375(7.44) | 473(9.89) | 299(6.27) |
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsAlternating Direction Method of Multipliers
Adaptive Relaxed ADMM: Convergence Theory and Practical Implementation
Zheng Xu1, Mário A. T. Figueiredo2, Xiaoming Yuan3, Christoph Studer4, and Tom Goldstein1
1Department of Computer Science, University of Maryland, College Park, MD
2Instituto de Telecomunicações, Instituto Superior Técnico, Universidade de Lisboa, Portugal
3Department of Mathematics, Hong Kong Baptist University, Kowloon Tong, Hong Kong
4School of Electrical and Computer Engineering, Cornell University, Ithaca, NY [email protected]
Abstract
Many modern computer vision and machine learning applications rely on solving difficult optimization problems that involve non-differentiable objective functions and constraints. The alternating direction method of multipliers (ADMM) is a widely used approach to solve such problems. Relaxed ADMM is a generalization of ADMM that often achieves better performance, but its efficiency depends strongly on algorithm parameters that must be chosen by an expert user. We propose an adaptive method that automatically tunes the key algorithm parameters to achieve optimal performance without user oversight. Inspired by recent work on adaptivity, the proposed adaptive relaxed ADMM (ARADMM) is derived by assuming a Barzilai-Borwein style linear gradient. A detailed convergence analysis of ARADMM is provided, and numerical results on several applications demonstrate fast practical convergence.
1 Introduction
Modern methods in computer vision and machine learning often require solving difficult optimization problems involving non-differentiable objective functions and constraints. Some popular applications include sparse models [48, 54, 8, 36], low-rank models [47, 23, 53, 31], and support vector machines (SVMs) [4, 3]. The alternating direction method of multiplier (ADMM) is one of the most prominent optimization tools to solve such problems, and tackles problems in the following form:
[TABLE]
Here, and are closed, proper, and convex functions, , , and . ADMM was first introduced in [16] and [12], and has found applications in a variety of optimization problems in machine learning, image processing, computer vision, wireless communications, and many other areas [2, 21].
Relaxed ADMM is a popular practical variant of ADMM, and proceeds with the following steps:
[TABLE]
Here, denotes the dual variables (Lagrange multipliers) on iteration , and are sequences of penalty and relaxation parameters. Relaxed ADMM coincides with the original non-relaxed version if .
Convergence of (relaxed) ADMM is guaranteed under fairly general assumptions [6, 25, 26, 10], if the penalty and relaxation parameters are held constant. However, the practical performance of ADMM depends strongly on the choice of these parameters, as well as on the problem being solved. Good penalty choices are known for certain ADMM formulations, such as strictly convex quadratic problems [40, 14], and for the gradient descent parameter in the “linearized” ADMM [32, 34].
Adaptive penalty methods (in which the penalty parameters are tuned automatically as the algorithm proceeds) achieve good performance without user oversight. For non-relaxed ADMM, the authors of [24] propose methods that modulate the penalty parameter so that the primal and dual residuals (i.e., derivatives of the Lagrangian with respect to primal and dual variables) are of approximately equal size. This “residual balancing” approach has been generalized to work with preconditioned variants of ADMM [20] and distributed ADMM [44]. In [51], a spectral penalty parameter method is proposed that uses the local curvature of the objective to achieve fast convergence. All of these methods are specific to (non-relaxed) vanilla ADMM, and do not apply to the more general case involving a relaxation parameter.
1.1 Overview & contributions
In this paper, we study adaptive parameter choices for the relaxed ADMM that jointly and automatically tune both the penalty parameter and relaxation parameter . In Section 3, we address theoretical questions about the convergence of ADMM with non-constant penalty and relaxation parameters. In Section 4, we discuss practical methods for choosing these parameters. In Section 6, we apply the proposed ARADMM to several problems in machine learning, computer vision, and image processing. Finally, in Section 7, we compare ARADMM to other ADMM variants and examine the benefits of the proposed approach for real-world regression, classification, and image processing problems.
2 Related work
Sparse and low rank methods are widely used in computer vision [48, 54, 8, 47, 23, 36, 53, 31], machine learning [7, 57, 43, 9, 33], and image processing [42, 21]. ADMM has been extensively applied to solve such problems [2, 21, 51, 50], and has recently found applications in neural networks [56, 45], tensor decomposition [18, 35, 52], structure from motion [19], and other vision problems.
The convergence rate of non-relaxed ADMM is established under mild conditions for convex problems [25, 26]. The convergence rate is discussed in [17, 21, 27, 46], where at least one of the functions is assumed either strongly convex or smooth. For the general relaxed ADMM formulation, a convergence rate is provided under mild conditions [10]. Linear convergence can be achieved with strong convexity assumptions [5, 38, 15]. All of these results assume constant parameters—it is considerably harder to prove convergence when the algorithm parameters are adaptive.
Fixed optimal parameters are discussed in the literature. For the specific case in which the objective is quadratic, a criterion is proposed in [40, 14]. The authors of [38] suggest a grid search and semidefinite programming based method to determine the optimal relaxation and penalty parameters. These methods, however, make strong assumptions about the objective and require knowledge of condition numbers.
Adaptive penalty methods are proposed to accelerate the practical convergence of non-relaxed ADMM [24, 51]. For the relaxation parameter, it has been suggested in [6] that over-relaxation () may accelerate convergence and achieves faster convergence in a specific distributed computing application. The proposed ARADMM simultaneously adapts both the penalty and the relaxation parameter, thus being fully automated.
3 Convergence theory
We study conditions under which ADMM converges with adaptive penalty and relaxation parameters. Our approach utilizes the variational inequality (VI) methods put forward in [24, 25, 26]. Our results measure convergence using the primal and dual “residuals,” which are defined as
[TABLE]
It has been observed that these residuals approach zero as the algorithm approaches a true solution [2]. Typically, the iterative process is stopped if
[TABLE]
where is the stopping tolerance [2]. For this reason, it is important to know that the method converges in the sense that the residuals approach zero as
In the sequel, we prove that relaxed ADMM converges in the residual sense, provided that the algorithm parameters satisfy one of the following two assumptions.
Assumption 1**.**
The relaxation sequence and penalty sequence satisfy
[TABLE]
Assumption 2**.**
The relaxation sequence and penalty sequence satisfy
[TABLE]
In Section 5, we prove adaptive relaxed ADMM converges if the algorithm parameters satisfy either Assumption 1 or Assumption 2. Before presenting the proof, we show how to choose the relaxation parameters that lead to efficient performance in practice.
4 ARADMM: Adaptive relaxed ADMM
Spectral stepsize selection methods for vanilla ADMM were discussed in [51]. Here, we modify the adaptive ADMM framework in two important ways. First, we discuss the selection of penalty parameters in the presence of the relaxation term. Second, we discuss adaptive methods also for automatically selecting the relaxation parameter.
The proposed method works by assuming a local linear model for the dual optimization problem, and then selecting an optimal stepsize under this assumption. A safeguarding method is adopted to ensure that bad stepsizes are not chosen in case these linearity assumptions fail to hold.
4.1 Dual interpretation of relaxed ADMM
We derive our adaptive stepsize rules by examining the close relationship between relaxed ADMM and the relaxed Douglas-Rachford Splitting (DRS) [6, 5, 15]. The dual of the general constrained problem (1) is
[TABLE]
with denoting the Fenchel conjugate of , defined as [41].
The relaxed DRS algorithm solves (10) by generating two sequences, and according to
[TABLE]
where is a relaxation parameter, and denotes the subdifferential of evaluated at [41]. Referring back to ADMM in (2)–(5), and defining , the sequences and satisfy the same conditions (11) and (12) as and , thus ADMM for the problem (1) is equivalent to DRS on its dual (10). A detailed proof of this is provided in the supplementary material.
4.2 Spectral adaptive stepsize rule
Adaptive stepsize rules of the “spectral” type were originally proposed for simple gradient descent on smooth problems by Barzilai and Borwein [1], and have been found to dramatically outperform constant stepsizes in many applications [11, 49]. Spectral stepsize methods work by modeling the gradient of the objective as a linear function, and then selecting the optimal stepsize for this simplified linear model.
Spectral methods were recently used to determine the penalty parameter for the non-relaxed ADMM in [51]. Inspired by that work, we derive spectral stepsize rules assuming a linear model/approximation for and at iteration given by
[TABLE]
where , are local curvature estimates of and , respectively, and . Once we obtain these curvature estimates, we will exploit the following simple proposition whose proof is given in the supplementary material.
Proposition 1**.**
Suppose the DRS steps (11)–(12) are applied to problem (10), where (omitting iteration from to lighten the notation in what follows)
[TABLE]
Then, the residual of will be zero if and are chosen to satisfiy
Our adaptive method works by fitting a linear model to the gradient (or subgradient) of our objective, and then using Proposition 1 to select an optimal stepsize pair that obtains zero residual on the model problem. For our convergence theory to hold, we need For fixed values of and the minimal value of that is still optimal for the linear model occurs if we choose
[TABLE]
Note this is the same “optimal” penalty parameter proposed for non-relaxed ADMM in [51]. Under this choice of we then have the “optimal” relaxation parameter
[TABLE]
4.3 Estimation of stepsizes
We now propose a simple method for fitting a linear model to the dual objective terms so that the formulas in Section 4.2 can be used to obtain stepsizes. Once these linear models are formed, the optimal penalty parameter and relaxation term can be calculated by (15) and (16), thanks to the equivalence of relaxed ADMM and DRS.
In what follows, we let and to simplify notation. The optimal stepsize choice is then written as and .
The estimation of and for the dual components and at the -th iteration of primal ADMM has been described in [51]. It is easy to verify that the model parameters and of relaxed ADMM can be estimated based on the results from iteration and an older iteration in a similar way. If we define
[TABLE]
then the parameter is obtained from the formula
[TABLE]
For a detailed derivation of these formulas, see [51].
The spectral stepsize of is similarly estimated with , and . It is important to note that and are obtained from the iterates of ADMM alone, i.e., our scheme does not require the user to supply the dual problem.
4.4 Safeguarding
Spectral stepsize methods for simple gradient descent are paired with a backtracking line search to guarantee convergence in case the linear model assumptions break down and an unstable stepsize is produced. ADMM methods have no analog of backtracking. Rather, we adopt the correlation criterion proposed in [51] to test the validity of the local linear assumption, and only rely on the adaptive model when the assumptions are deemed valid. To this end, we define
[TABLE]
When the model assumptions (14) hold perfectly, the vectors and should be highly correlated and we get When or is small, the model assumptions are invalid and the spectral stepsize may not be effective.
The proposed method uses the following update rules
[TABLE]
[TABLE]
where is a quality threshold for the curvature estimates, while and are the spectral stepsizes estimated in Section 4.3. The update for only uses model parameters that have been accurately estimated. When the model is effective for but not we use a large to make the update conservative relative to the update. When the model is effective for but not we use a small to make the update aggressive relative to the update.
4.5 Applying convergence guarantee
Our convergence theory requires either Assumption 1 or Assumption 2 to be satisfied, which suggests that convergence is guaranteed under “bounded adaptivity” for both penalty and relaxation parameters. These conditions can be guaranteed by explicitly adding constraints to the stepsize choice in ARADMM.
To guarantee convergence, we simply replace the parameter updates (21) and (22) with
[TABLE]
where is some (large) constant. It is easily verified that the parameter sequence satisfies Assumption 1. In practice, the update schemes (21) and (22) converges reliably without explicitly enforcing these conditions. We use a very large such that the conditions are not triggered in the first few thousand iterations and provide these constraints for theoretical interests.
4.6 ARADMM algorithm
The complete adaptive relaxed ADMM (ARADMM) is shown in Algorithm 1. We suggest only updating the stepsize every iterations. We suggest a fixed safeguarding threshold which is used in all the experiments in Section 6. The overhead of the adaptive scheme is modest, requiring only a few inner product calculations.
5 Proofs of convergence theorems
We now prove that relaxed ADMM converges under Assumption 1 or 2. Let
[TABLE]
We use and to denote iterates, and and denote optimal solutions. Set , and , and define
[TABLE]
Notice that is monotone, which means .
Problem formulation (1) can be reformulated as a variational inequality (VI). The optimal solution satisfies
[TABLE]
Likewise, the ADMM iterates produced by steps (2) and (4) satisfy the variational inequalities
[TABLE]
Using the definitions of , , , and in (24, 25), in (5), and in (3), VI (27) and (28) combine to yield
[TABLE]
We then apply VI (26), (28), and (29) in order to prove the following lemmas for our contraction proof, which show that the difference between iterates decreases as the iterates approach the true solution. ‘The remaining details of the proof are in the supplementary material.
Lemma 1**.**
The iterates generated by ADMM satisfy
[TABLE]
Lemma 2**.**
Let The optimal solution and iterates generated by ADMM satisfy
[TABLE]
5.1 Convergence with adaptivity
We are now ready to state our main convergence results. The proof of Theorem 1 is shown here in full, and leverages Lemma 2 to produce a contraction argument. The proof of Theorem 2 is extremely similar, and is shown in the supplementary material.
Theorem 1**.**
Suppose Assumption 1 holds. Then, the iterates generated by ADMM satisfy
[TABLE]
Proof.
Assumption 1 implies
[TABLE]
If as in Assumption 1, then Lemma 2 shows
[TABLE]
where (33) is used to get from (34) to (35). Accumulating inequality (35) from to shows
[TABLE]
Assumption 1 also implies , and . Then, (5.1) indicates and
[TABLE]
Now, from Lemma 1, and so
[TABLE]
The residuals in (6) satisfy
[TABLE]
from which we get
[TABLE]
[TABLE]
easdasd
∎
Similar methods can be used to prove the following about convergence under Assumption 2. The proof of the following theorem is given in the supplementary material.
Theorem 2**.**
Suppose Assumption 2 holds. Then, the iterates generated by ADMM satisfy
[TABLE]
6 Applications
We focus on the following statistical and image processing problems involving non-differentiable objectives: linear regression with elastic net regularization (EN), low-rank least squares (LRLS), quadratic programming (QP), consensus -regularized logistic regression, support vector machine (SVM), total variation image restoration (TVIR), and robust principle component analysis (RPCA). We study several vision benchmark datasets such as the extended Yale B face dataset [13], MNIST digital images [29], and CIFAR10 object images111We use the first batch of CIFAR10 that contains samples. [28]. We also use synthetic and benchmark datasets from [7, 57, 30, 43, 33, 21], which are obtained from the UCI repository and the LIBSVM page. The experimental setups for each problem are briefly described here, and the implementation details are provided in the supplementary material.
Linear regression with EN regularization
Elastic net (EN) is a modification of the -norm (or LASSO) regularizer that helps dealing with highly correlated variables [57, 21], and requires solving
[TABLE]
where denotes the -norm, is the data matrix, contains measurements, and is the vector of regression coefficient.
Low-rank least squares (LRLS)
The nuclear norm (the -norm of the matrix singular values) is a convex surrogate for matrix rank. ADMM has been applied to solve low rank least squares problems [55, 53]
[TABLE]
where denotes the nuclear norm, denotes the Frobenius norm, is a data matrix, contains measurements, and contains variables.
ADMM is applied by splitting the regression term and the non-differentiable regularizer composed of nuclear and Frobenius norm. LRLS has been used to formulate exemplar classifiers and discover visual subcategories [53].
SVM and QP
Support vector machine (SVM) is one of the most successful binary classifiers for computer vision. The dual of the SVM is a QP problem,
[TABLE]
where is the SVM dual variable, is the kernel matrix, is a vector of labels, is a vector of ones, and [3]. The canonical QP is also considered,
[TABLE]
Consensus -regularized logistic regression
ADMM has become an important tool for solving distributed optimization problems [2]. A typical problem is the consensus -regularized logistic regression
[TABLE]
where represents the local variable on the th distributed node, is the global variable, is the number of samples in the th block, is the th sample, and is the corresponding label.
Unwrapped SVM
The unwrapped formulation of SVM [22], which can be used in distributed computing environments via “transpose reduction” tricks, applies ADMM to the primal form of SVM to solve
[TABLE]
where is the th sample of training data, and is the corresponding label. ADMM is applied by splitting the -norm regularizer and the non-differentiable hinge loss term.
Total variation image denoising (TVID)
Total variation image denoising is often performed by solving [42]
[TABLE]
where represents given noisy image, and is the discrete gradient operator, which computes differences between adjacent image pixels. ADMM is applied by splitting the -norm term and the non-differentiable total variation term.
RPCA
Robust principal component analysis (RPCA) has broad applications in computer vision and imaging [47, 37, 39]. RPCA recovers a low-rank matrix and a sparse matrix by solving
[TABLE]
where the nuclear norm is used to obtain a low rank matrix , and is used to obtain a sparse error .
7 Experiments
The proposed AADMM is implemented as shown in Algorithm 1. We also implemented vanilla ADMM, (non-adaptive) relaxed ADMM, ADMM with residual balancing (RB), and adaptive ADMM (AADMM) for comparison.
The relaxation parameter for the non-adaptive relaxed ADMM is fixed at as suggested in [6]. The parameters of RB and AADMM are selected as in [24, 2, 51]. The initial penalty and initial relaxation are used for all problems except the canonical QP problem, where initial parameters are set to the geometric mean of the maximum and minimum eigenvalues of matrix , as proposed for quadratic problems in [40].
For each problem, the same randomly generated initial variables are used for ADMM and its variant methods. As suggested by [24, 51], the adaptivity of RB and AADMM is stopped after 1000 iterations to guarantee convergence.
7.1 Convergence results
Table 1 reports the convergence speed of ADMM and its variants for the applications described in Section 6. More experimental results including the table of more test cases, the convergence curves, and visual results of image restoration and robust PCA for face decomposition are provided in the supplementary material. Relaxed ADMM often outperforms vanilla ADMM, but does not compete with adaptive methods like RB, AADMM and ARADMM. The proposed ARADMM performs best in all the test cases.
7.2 Sensitivity to initialization
We study the sensitivity of the different ADMM variants to the initial penalty () and initial relaxation parameter (). Fig. 1 presents iteration counts for a wide range of values of , for elastic net regression with synthetic datasets. In the left and center plots we fix one of and vary the other. The number of iterations needed to convergence is plotted as the algorithm parameters vary. In the right plot, we use a grid search to find the optimal for different values of . Fig. 1 (left) shows that adaptive methods are relatively stable with respect to the initial penalty , while ARADMM outperforms RB and AADMM in all choices of initial . Fig. 1 (middle) suggests that the relaxation is generally less important than . When a bad value of is chosen, it is unlikely that a good choice of can compensate. The proposed ARADMM that jointly adjusts is generally better than simply adding the relaxation to the existing adaptive methods RB and AADMM.
Fig. 1 (right) shows the sensitivity to when using a grid search to choose the optimal . This optimal significantly improves the performance of vanilla ADMM and relaxed ADMM (which use the same for all iterations). Even when using the optimal stepsize for the non-adaptive methods, ARADMM is superior to or competitive with the non-adaptive methods. Note that this experiment is meant to show a best-case scenario for the non-adaptive methods; in practice the user generally has no knowledge of the optimal value of Adaptive methods achieve optimal or near-optimal performance without an expensive grid search.
7.3 Sensitivity to safeguarding
Finally, Fig. 2 presents iteration counts when applying ARADMM with various safeguarding correlation thresholds . When , the calculated adaptive parameters based on curvature estimations are always accepted, and when the parameters are never changed. The proposed AADMM method is insensitive to and performs well for a wide range of for various applications, except for unwrapping SVM and RPCA. Though tuning such “hyper-parameters” may improve the performance of ARADMM for some applications, the fixed performs well in all our experiments (seven applications and over fifty test cases, a full list is in the supplementary material). The proposed ARADMM is fully automated and performs well without parameter tuning.
8 Conclusion
We have proposed an adaptive method for jointly tuning the penalty and relaxation parameters of relaxed ADMM without user oversight. We have analyzed adaptive relaxed ADMM schemes, and provided conditions for which convergence is guaranteed. Experiments on a wide range of machine learning, computer vision, and image processing benchmarks have demonstrated that the proposed adaptive method (often significantly) outperforms other ADMM variants without user oversight or parameter tuning. The new adaptive method improves the applicability of relaxed ADMM by facilitating fully automated solvers that exhibit fast convergence and are usable by non-expert users.
Acknowledgments
TG and ZX were supported by the US Office of Naval Research under grant N00014-17-1-2078 and by the US National Science Foundation (NSF) under grant CCF-1535902. MF was partially supported by the Fundação para a Ciência e Tecnologia, grant UID/EEA/5008/2013. XY was supported by the General Research Fund from Hong Kong Research Grants Council under grant HKBU-12313516. CS was supported in part by Xilinx Inc., and by the US NSF under grants ECCS-1408006, CCF-1535897, and CAREER CCF-1652065.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1[1] J. Barzilai and J. Borwein. Two-point step size gradient methods. IMA J. Num. Analysis , 8:141–148, 1988.
- 2[2] S. Boyd, N. Parikh, E. Chu, B. Peleato, and J. Eckstein. Distributed optimization and statistical learning via the alternating direction method of multipliers. Found. and Trends in Mach. Learning , 3:1–122, 2011.
- 3[3] C.-C. Chang and C.-J. Lin. LIBSVM: a library for support vector machines. ACM Transactions on Intelligent Systems and Technology (TIST) , 2(3):27, 2011.
- 4[4] C. Cortes and V. Vapnik. Support-vector networks. Machine learning , 20(3):273–297, 1995.
- 5[5] D. Davis and W. Yin. Faster convergence rates of relaxed Peaceman-Rachford and ADMM under regularity assumptions. ar Xiv preprint ar Xiv:1407.5210 , 2014.
- 6[6] J. Eckstein and D. Bertsekas. On the Douglas-Rachford splitting method and the proximal point algorithm for maximal monotone operators. Mathematical Programming , 55(1-3):293–318, 1992.
- 7[7] B. Efron, T. Hastie, I. Johnstone, and R. Tibshirani. Least angle regression. The Annals of statistics , 32(2):407–499, 2004.
- 8[8] E. Elhamifar and R. Vidal. Sparse subspace clustering. In Computer Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference on , pages 2790–2797. IEEE, 2009.
