Accelerated Primal-dual Scheme for a Class of Stochastic Nonconvex-concave Saddle Point Problems
Morteza Boroun, Zeinab Alizadeh, Afrooz Jalilzadeh

TL;DR
This paper introduces a novel single-loop accelerated primal-dual algorithm for stochastic nonconvex-concave saddle point problems, achieving improved convergence rates and addressing slow convergence issues of existing methods.
Contribution
It proposes the first single-loop accelerated primal-dual method with new convergence rate results for a class of nonconvex saddle point problems satisfying the Polyak-{ extL}ojasiewicz condition.
Findings
Achieves a stochastic convergence rate of O(ε^{-4}) for ε-gap solutions.
Improves to an O(ε^{-2}) rate in deterministic settings.
Addresses slow convergence and multi-loop issues of prior algorithms.
Abstract
Stochastic nonconvex-concave min-max saddle point problems appear in many machine learning and control problems including distributionally robust optimization, generative adversarial networks, and adversarial learning. In this paper, we consider a class of nonconvex saddle point problems where the objective function satisfies the Polyak-{\L}ojasiewicz condition with respect to the minimization variable and it is concave with respect to the maximization variable. The existing methods for solving nonconvex-concave saddle point problems often suffer from slow convergence and/or contain multiple loops. Our main contribution lies in proposing a novel single-loop accelerated primal-dual algorithm with new convergence rate results appearing for the first time in the literature, to the best of our knowledge. In particular, in the stochastic regime, we demonstrate a convergence rate of $\mathcal…
| SPDM | SPDHG | SMP | |
|---|---|---|---|
| Colon-cancer | 3.75e-4 | 1.51e-2 | 2.70e-2 |
| Leukemia | 2.61e-4 | 6.99e-3 | 1.18e-2 |
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsStochastic Gradient Optimization Techniques · Sparse and Compressive Sensing Techniques · Markov Chains and Monte Carlo Methods
Accelerated Primal-dual Scheme for a Class of Stochastic Nonconvex-concave Saddle Point Problems
Morteza Boroun* Zeinab Alizadeh * Afrooz Jalilzadeh111Department of Systems and Industrial Engineering, The University of Arizona, Tucson, AZ, USA. {morteza, zalizadeh, afrooz}@arizona.edu
Abstract
Stochastic nonconvex-concave min-max saddle point problems appear in many machine learning and control problems including distributionally robust optimization, generative adversarial networks, and adversarial learning. In this paper, we consider a class of nonconvex saddle point problems where the objective function satisfies the Polyak-Łojasiewicz condition with respect to the minimization variable and it is concave with respect to the maximization variable. The existing methods for solving nonconvex-concave saddle point problems often suffer from slow convergence and/or contain multiple loops. Our main contribution lies in proposing a novel single-loop accelerated primal-dual algorithm with new convergence rate results appearing for the first time in the literature, to the best of our knowledge. In particular, in the stochastic regime, we demonstrate a convergence rate of to find an -gap solution which can be improved to in deterministic setting.
1 Introduction
In this paper, we consider the following min-max saddle point (SP) game:
[TABLE]
where , , , is a random vector, is potentially nonconvex for any and satisfies Polyak-Łojasiewicz (PL) condition (see Definition 1), is concave for any and is convex and possibly nonsmooth. Our goal is to develop an algorithm to find a first order stationary point of this SP problem.
Recent emerging applications in machine learning and control have further stimulated a surge of interest in these problems. Examples that can be formulated as (1) include generative adversarial networks (GANs) [GBC16], fair classification [NSH*+*19], communications [ABR21, BKR19], and wireless system [CL11b, FP09]. Convex-concave saddle point problems have been extensively studied in the literature [CP16, HA21]. However, recent applications in machine learning and control may involve nonconvexity. One class of nonconvex-concave min-max problems is when the objective function satisfies PL condition that we aim to study in this paper. Next, we provide two examples that can be formulated as problem (1) and satisfies PL condition.
Example 1** (Generative adversarial imitation learning).**
One practical example of PL-game is generative adversarial imitation learning of linear quadratic regulators (LQR). Imitation learning techniques aim to mimic human behavior by observing an expert demonstrating a given task [HGEJ17]. Generative adversarial imitation learning (GAIL) is studied in [HE16] which solves imitation learning via min-max optimization. Let represents the choice of the policy, represents the expert policy, and the cost parameter and the expected cumulative cost for a given policy are denoted by and , respectively. The problem of GAIL for LQR can be formulated [CHCW19] as where , , , and . It is known that satisfies PL condition in [NSH*+*19]. This problem is a special case of (1), for , where denotes the indicator function of set .***
Example 2** (Distributionally robust optimization).**
Define , where is a loss function possibly nonconvex and . Distributionally robust optimization (DRO) studies worse case performance under uncertainty to find solutions with some specific confidence level [ND16]. DRO can be formulated as where represents the uncertainty set, e.g., is an uncertainty set considered in [ND16] and denotes the divergence measure between two sets of probability measures and . As it has been shown in [GYYY20], DRO for deep learning with ReLU activation function satisfies PL condition in an -neighborhood around a random initialized point. This problem is a special case of (1), for .***
One natural way to solve problem (1) is directly with the idea of taking two simultaneous or sequential steps for reducing the objective function for a given and increasing the objective function for a given . One of the most famous algorithms for solving such problem is known as gradient descent-ascent (GDA) [NO09]. It has been discovered that such a naive approach leads to poor performance and may even diverge for simple problems. One way to resolve this issue is by adding a momentum in terms of the gradient of the objective function. Although this approach leads to an optimal convergence rate result [HA21, Zha21], it may not be directly applicable in nonconvex-concave setting. Therefore, we aim to to develop a novel primal-dual algorithm with acceleration in the primal update as well as a new momentum in the dual update.
1.1 Related Works
Nonconvex-concave SP problem. Various algorithms have been proposed for solving nonconvex-concave SP problems due to their applicability in many modern machine learning problems. The existing methods can be categorized into two types: multi-loop and single-loop. In multi-loop algorithms [KM21, OLR21] one variable is updated in a few consecutive iterations until a certain condition is satisfied before another variable gets updated. Such methods are often difficult to implement in practice because the termination of the inner loop has a high impact on the overall complexity of such algorithms, and selecting a conservative criterion may lead to a high computational cost while an inadequate number of inner iterations may lead to poor performance. Therefore, there have been some recent efforts [LTHC20, ZXSL20, XZXL22] to design and analyze single-loop algorithms to solve nonconvex-concave problems. In particular, a convergence rate of has been obtained for the aforementioned single-loop algorithms. Authors in [ZXSL20] were able to improve the rate to for a special case of nonconvex-concave problem, i.e., , where is a probability simplex. There are also several studies [RLLY18, LJJ20, ZAG22] in the stochastic regime. See Table 1 for more details.
PL condition. Rate results for nonconvex-concave problems can be improved for a class of problems where the objective function satisfies PL condition. Recently, nonconvex-PL SP problems have been studied in [NSH*+*19, ALD21] and [YOLH22] assuming that the objective satisfies one-sided PL condition. Multi-loop algorithms [NSH*+*19, ALD21] find an –first order stationary point of the problem within iterations, where denotes up to a logarithmic factor. The same rate result has been achieved in [FRM*+*21] and [YOLH22] for a single-loop schemes. More recently, to guarantee a global convergence, Yang et al. [YKH20] proposed alternating gradient descent ascent algorithm with a linear convergence rate to solve SP problem where the objective satisfies two-sided PL condition. Moreover, the convergence rate of has been shown for the stochastic regime under two-sided PL condition. Subsequently, Guo et al. [GYYY20] improved the dependency of convergence rate on the condition number (the ratio of smoothness parameter to the PL constant).
1.2 Contributions
The existing methods for solving nonconvex-concave SP problems often suffer from slow convergence and/or contain multiple loops. Our main contribution lies in proposing a novel single-loop accelerated primal-dual algorithm with convergence rate results for PL-game appearing for the first time in the literature to the best of our knowledge. Our main contributions are summarized as follows: (i) We propose an accelerated primal-dual scheme to solve problem (1). Our main idea lies in designing a novel algorithm by combining an accelerated step in the primal variable with a dual step involving a momentum in terms of the gradient of the objective function. (ii) Under a stochastic setting, using an acceleration where mini-batch sample gradients are utilized, our method achieves an oracle complexity (number of sample gradients calls) of . (iii) Under a deterministic regime, we demonstrate a convergence guarantee of to find an -stationary solution. This is the best-known rate for SP problems satisfying one-sided PL condition to the best of our knowledge.
2 Preliminaries
First we define some important notations.
Notations. denotes the Euclidean vector norm, i.e., . denotes the proximal operator with respect to at , i.e., . is used to denote the expectation of a random variable . We define . Given the mini-batch samples and , we let and . We defined -algebras and .
Now we briefly highlight a few aspects of the PL condition [Pol63] that differentiate it from convexity and make it a more relevant and appealing setting for many machine learning applications. For unconstrained minimization problem , we say that a function satisfies the PL inequality if for some , {1\over 2}\|{\color[rgb]{0,0,0}\nabla}f(x)\|^{2}\geq\mu(f(x)-f(x^{*})) for all . To verify the PL condition, we need access to the value of the objective function the norm of the gradient which is often tractable and can be estimated from a sub-sample data. However, for verifying convexity, one needs to estimate the minimum eigenvalue of the Hessian matrix. Moreover, the norm of the gradient is much more resilient to perturbation of the objective function than the smallest eigenvalue of the Hessian [BBM18].
PL condition does not require strong convexity or even convexity of the objective function. It has been shown that it is satisfied for different class of problems, for instance, conditions like restricted secant inequality [ZY13] and one-point convexity [AZ18] are special cases of PL condition. Problems satisfying such conditions include dictionary learning [AGMM15], neural networks [LY17] and phase retrieval [CC15], to name a few. In this paper, we consider a min-max SP problem and we assume that the objective function satisfies one-sided PL inequality.
Definition 1**.**
A continuously differentiable function satisfies the one-sided PL condition if there exists a constant such that for all where .
Now we state our main assumptions.
Assumption 1**.**
(i) The solution set of problem (1) is nonempty; (ii) Function is convex and possibly nonsmooth; (iii) is continuously differentiable satisfying one-sided PL condition and is concave for any .
Assumption 2**.**
* is Lipschitz continuous, i.e., there exist and such that Moreover, is linear in terms of .*
Note that Assumption 2 implies that
[TABLE]
Under stochastic setting, we assume that the sample gradients can be generated by satisfying the following standard conditions.
Assumption 3**.**
Each component function has unbiased stochastic gradients with bounded variance:
[TABLE]
3 Primal-Dual Method with Momentum
In this section, we propose a primal-dual algorithm with momentum (PDM) for deterministic PL-concave problems. The details of the method can be seen in Algorithm 1. Then, we introduce stochastic PDM (SPDM) for stochastic setting (see Algorithm 2).
Algorithm 1 consists of a single loop primal-dual steps. After initialization of parameters, at each iteration , a proximal gradient ascent step for the variable is taken in the direction of with an additive momentum term . Such a momentum is an algorithmic approach to gain acceleration for solving PL-concave problems Finally, after computing gradient at , two gradient descent steps for the variable is taken to generate and which then will be combined by a convex combination in the next iteration.
Remark 1**.**
If we let \lambda_{k}=\alpha_{k}{\color[rgb]{0,0,0}\gamma_{k}}, then the primal step in Algorithm 1 will be similar to one of the variants of the Nesterov’s acceleration (see [Nes03] and [GL16]). Moreover, when it can be shown that and which is similar to a gradient descent step for the minimization variable.
For a stochastic setting, SPDM is proposed in Algorithm 2 where the main steps of the algorithm is similar to Algorithm 1. The main difference is that instead of computing the exact gradient, we estimate the gradient of the function by drawing mini-batch samples and in Step 4.
4 Convergence Analysis
In this section, we study the convergence properties of and 2 for stochastic (and also deterministic) settings. All related proofs are provided in the appendix. Our goal is to find a first order stationary point of problem (1). For a given positive , we define a point as an -stationary solution of problem (1) if and for some .
For our analysis, for all , define as:
[TABLE]
where
Remark 2**.**
By choosing , and for any , from definition of , one can show that (see [GL16]).
Now we establish the convergence rate of SPDM for solving stochastic PL-concave SP problem (1). In Algorithm 2, to estimate the gradient of the function, we draw mini-batch samples and at each iteration, where .
Theorem 1**.**
Let generated by Algorithm 2 and suppose Assumptions 1, 2 and 3 hold. Moreover, let , and for any and . Then, there exists an iteration such that is an -stationary point of problem (1) which can be obtained within evaluations of sample gradients.
Consider function in problem (1) to be deterministic, i.e. exact gradients and are available. We show that the convergence rate can be improved to .
Theorem 2**.**
Let generated by Algorithm 1 and suppose Assumptions 1, 2 hold. Choosing parameters as Theorem 1, there exists an iteration such that is an -stationary point which can be obtained within evaluations of the gradients.
The proof for deterministic setting, i.e, Theorem 2, is similar to Theorem 1, by letting .
5 Numerical Results
Generative Adversial Imitation Learning. In this section, we implement our method to solve GAIL problem described in Example 1. The code utilized in our experiment was adapted from an existing implementation developed by [YKH20]. To validate the efficiency of the proposed scheme, we compare PDM algorithm with alternating gradient descent ascent (AGDA) [YKH20], Smoothed-GDA [ZXSL20], and AGP [XZXL22]. The optimal control problem for LQR can be formulated as follows [CHCW19]:
[TABLE]
where , are both positive definite matrices, , , , is a control, is a state, is a policy, and is a given initial distribution. In the infinite-horizon setting with a stochastic initial state , the optimal control input can be written as a linear function where is the policy and does not depend on . We denote the expected cumulative cost in (4) by , where . To estimate the expected cumulative cost, we sample initial points and estimate using sample average:
In GAIL for LQR, the goal is to learn the cost function parameters and from the expert after the trajectories induced by an expert policy are observed. Hence, the min-max formulation of the imitation learning problem is where , is a regularization term that we added so that the problem becomes strongly concave, so can apply AGDA scheme (see [YKH20]). Moreover, is the feasible set of the cost parameters. We assume is convex and there exist positive constants and such that for any we have We generate three different data sets for different choices of and and we set , and . We choose , and -4. The exact gradient of the problem in compact form has been established in [FGKM18]. non-accelerated scheme (AGDA).
In Figure 1 (a) and (b), we compared the performance of our proposed method (PDM) with AGDA [YKH20], Smoothed-GDA [ZXSL20], and AGP [XZXL22]. We set the same stepsizes for all the methods to ensure fairness in our experiment. Other parameters for competitive methods are selected as suggested in their papers. In Figure 1 (c) and (d), we compared PDM with its stochastic variant (SPDM) by running both algorithms for the same amount of time. As it can be seen SPDM outperforms PDM and its superiority is more evident as becomes larger.
Distributionally robust optimization. Consider the following DRO problem.
[TABLE]
Where , and . We compare our method with stochastic accelerated primal-dual method proposed in [Zha21] (SPDHG) and stochastic mirror prox [JNT11] (SMP). We use real datasets colon-cancer (n=62, m=2000) and leukemia (n=38, m=7129) from LIBSVM library [CL11a]. Note that in these datasets the number of features are larger than the number of samples, therefore, computing is cheap while can be costly, hence, we use an unbiased estimator with batch size of 10 for all the methods. We run all algorithms for 300 seconds. The performance of the methods are depicted in Figure 2. Table 2 summarizes the performance of our algorithm and competitive methods in terms of the gap function. Our scheme outperforms other algorithms which matches with the theoretical result. In fact, PDM has convergence rate of and the other two methods have a convergence rate of .
6 Concluding Remarks
In this paper, we proposed an accelerated primal-dual scheme for solving a class of nonconvex-concave problems where the objective function satisfies the PL condition for both deterministic and stochastic settings. By combining an accelerated step in the minimization variable with an update involving a momentum in terms of the gradient of the objective function for the maximization variable, we obtained a convergence rate of and for the stochastic and deterministic problems, respectively. To the best of our knowledge, this is the first work that proposed a primal-dual scheme with momentum to solve PL-concave minimax problems.
There are different interesting directions for future work: (i) Investigating distributed variant of the proposed scheme over a network of agents; (ii) Considering a more general setting of nonconvex-concave SP problem and developing a projection-free algorithm.
APPENDIX
In our analysis, we use the following technical lemma.
Lemma 1**.**
Given a arbitrary sequences and {\color[rgb]{0,0,0}\{\bar{\alpha}_{k}\}}_{k\geq 0}\subset\mathbb{R}^{++}, let be a sequence such that and v_{k+1}=v_{k}+{\color[rgb]{0,0,0}\tfrac{\bar{\sigma}_{k}}{\bar{\alpha}_{k}}}. Then, for all and ,
[TABLE]
To prove the convergence rate, we use the following lemma (proof is similar to Lemma 3 in [GL16]).
Lemma 2**.**
For any given and , such that , and let for some such that and for some . If , for some , then and .
To facilitate the analysis, we define some notations.
Definition 2**.**
Let , , and . Moreover, we define , , and , where , and \bar{E}_{k}^{x}\triangleq\tfrac{L_{xx}\Gamma_{{\color[rgb]{0,0,0}k-1}}(1-\alpha_{k})^{2}}{2}\sum_{\tau=0}^{k}\tfrac{(\gamma_{\tau}-\lambda_{\tau})^{2}}{\Gamma_{\tau}\alpha_{\tau}},w_{\tau}^{T}\nabla_{x}\mathcal{L}(z_{\tau+1},y_{\tau+1}).
In the next lemma, we provide a one-step analysis to obtain a bound for the norm of and progress of the dual iterates. This is the main building block of our convergence analysis in Theorem 1.
Lemma 3**.**
Let generated by Algorithm 2 and suppose Assumptions 1-3 hold. Moreover, let , , , and for any and . Then, the following holds:
[TABLE]
Proof.
Define . From Assumption 2 and step 3 of Algorithm 2, the following can be obtained,
[TABLE]
Define , and using (2) and step 8 of Algorithm 2, one can obtain
[TABLE]
Define Combining (Proof.) and (Proof.):
[TABLE]
where we used . By steps 3, 8 and 9 of Algorithm 2 one can obtain If we divide both sides of the above equality by , summing over and using the definition of , we obtain \tilde{x}_{k+1}-x_{k+1}=\Gamma_{k}\sum_{\tau=0}^{k}\left({\gamma_{\tau}-\lambda_{\tau}\over\Gamma_{\tau}}\right)(\nabla_{x}\mathcal{L}{(z_{\tau+1},y_{\tau+1})}+{\color[rgb]{0,0,0}w_{\tau}}). Using above equality, the Jensen’s inequality, and the fact that we obtain
[TABLE]
Using (Proof.) in (Proof.), one can obtain the following,
[TABLE]
Using Definition 2, summing both sides of (Proof.) over , and using the definition of in (3), we obtain the following
[TABLE]
From (Proof.) and Definition (1), one can obtain
[TABLE]
Adding to both sides:
[TABLE]
Using concavity of over , one can obtain
[TABLE]
Let us define , , , define , and . From optimality condition of step 6 in Algorithm 2, letting , one can obtain Multiplying both sides by and summing over , we obtain,
[TABLE]
Now, we simplify the inner products involving in (Proof.) and (13) using the definition of and .
[TABLE]
Moreover, using Young’s inequality, and step 8 in Algorithm 2, one can obtain
[TABLE]
Summing (13) and (Proof.), using (15) and (Proof.), we get,
[TABLE]
where . From Cauchy-Schwartz inequality, using Lemma 1 where we choose , and defining , the following holds
[TABLE]
for some . Hence, using (Proof.) in (Proof.) and rearranging terms, one can obtain the following,
[TABLE]
Choosing the parameters such that , , , , , and for any , one can show that in (Proof.) Term (A) and Term (B). Therefore, choosing , the left hand side (LHS) of (Proof.) can be bounded from below by \big{(}\sum_{k=0}^{T-1}\min\{{\tfrac{{\gamma_{k}C_{k}}}{4},\tfrac{\beta_{k}}{{4\sigma_{k}}}}\}\big{)}\big{(}\|\nabla_{x}\mathcal{L}(z_{k^{*}}),y_{k^{*}})\|^{2}+\|y_{k^{*}+1}-y_{k^{*}}\|^{2}\big{)}. Moreover, letting to be an arbitrary saddle point solution of (1), choosing , using the fact that and (15), one can obtain:
[TABLE]
where and we used . ∎
Now, we are ready to prove Theorem 1 and establish the convergence rate results.
Proof of Theorem 1. From (5), we have that
[TABLE]
Taking conditional expectation, one can show that , and . Hence, we obtain:
[TABLE]
Moreover, from the steps of Algorithm 2, . Using steps 8 and 9, one can show that . Hence . Invoking Lemma 2, we conclude that is an -stationary point of problem (1). To achieve an -stationary point, we let the rhs of (APPENDIX) equal to which implies that . Hence, total number of sample gradient evaluations is , since we chose . ∎
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1[ABR 21] Zeeshan Akhtar, Amrit Singh Bedi, and Ketan Rajawat. Conservative stochastic optimization: O ( t − 1 / 2 ) 𝑂 superscript 𝑡 1 2 {O}(t^{-1/2}) optimality gap with zero constraint violation. In 2021 American Control Conference (ACC) , pages 2224–2229. IEEE, 2021.
- 2[AGMM 15] Sanjeev Arora, Rong Ge, Tengyu Ma, and Ankur Moitra. Simple, efficient, and neural algorithms for sparse coding. In Conference on learning theory , pages 113–149. PMLR, 2015.
- 3[ALD 21] Sotirios-Konstantinos Anagnostidis, Aurelien Lucchi, and Youssef Diouane. Direct-search for a class of stochastic min-max problems. In International Conference on Artificial Intelligence and Statistics , pages 3772–3780. PMLR, 2021.
- 4[AZ 18] Zeyuan Allen-Zhu. Natasha 2: Faster non-convex optimization than sgd. Advances in neural information processing systems , 31, 2018.
- 5[BBM 18] Raef Bassily, Mikhail Belkin, and Siyuan Ma. On exponential convergence of SGD in non-convex over-parametrized learning. ar Xiv preprint ar Xiv:1811.02564 , 2018.
- 6[BKR 19] Amrit Singh Bedi, Alec Koppel, and Ketan Rajawat. Asynchronous online learning in multi-agent systems with proximity constraints. IEEE Transactions on Signal and Information Processing over Networks , 5(3):479–494, 2019.
- 7[CC 15] Yuxin Chen and Emmanuel Candes. Solving random quadratic systems of equations is nearly as easy as solving linear systems. Advances in Neural Information Processing Systems , 28, 2015.
- 8[CHCW 19] Qi Cai, Mingyi Hong, Yongxin Chen, and Zhaoran Wang. On the global convergence of imitation learning: A case for linear quadratic regulator. ar Xiv preprint ar Xiv:1901.03674 , 2019.
