Accelerated Primal-dual Scheme for a Class of Stochastic   Nonconvex-concave Saddle Point Problems

Morteza Boroun; Zeinab Alizadeh; Afrooz Jalilzadeh

arXiv:2303.00211·math.OC·September 12, 2023·ACC

Accelerated Primal-dual Scheme for a Class of Stochastic Nonconvex-concave Saddle Point Problems

Morteza Boroun, Zeinab Alizadeh, Afrooz Jalilzadeh

PDF

Open Access

TL;DR

This paper introduces a novel single-loop accelerated primal-dual algorithm for stochastic nonconvex-concave saddle point problems, achieving improved convergence rates and addressing slow convergence issues of existing methods.

Contribution

It proposes the first single-loop accelerated primal-dual method with new convergence rate results for a class of nonconvex saddle point problems satisfying the Polyak-{ extL}ojasiewicz condition.

Findings

01

Achieves a stochastic convergence rate of O(ε^{-4}) for ε-gap solutions.

02

Improves to an O(ε^{-2}) rate in deterministic settings.

03

Addresses slow convergence and multi-loop issues of prior algorithms.

Abstract

Stochastic nonconvex-concave min-max saddle point problems appear in many machine learning and control problems including distributionally robust optimization, generative adversarial networks, and adversarial learning. In this paper, we consider a class of nonconvex saddle point problems where the objective function satisfies the Polyak-{\L}ojasiewicz condition with respect to the minimization variable and it is concave with respect to the maximization variable. The existing methods for solving nonconvex-concave saddle point problems often suffer from slow convergence and/or contain multiple loops. Our main contribution lies in proposing a novel single-loop accelerated primal-dual algorithm with new convergence rate results appearing for the first time in the literature, to the best of our knowledge. In particular, in the stochastic regime, we demonstrate a convergence rate of $\mathcal…

Tables2

Table 1. Table 1: Comparison of complexity between some of the main existing methods for solving SP problem

References	Problem	Complexity		# of loops
References	Problem	det.	stoch.	# of loops
[HA21, Zha21]	SC-C	$𝒪 (ϵ^{- 0.5})$	$𝒪 (ϵ^{- 1})$	Single
[CP16, JNT11, Zha21]	C-C	$𝒪 (ϵ^{- 1})$	$𝒪 (ϵ^{- 2})$	Single
[RLLY18]	NC-C	$𝒪 (ϵ^{- 6})$	$\tilde{𝒪} (ϵ^{- 6})$	Double
[LJJ20]	NC-C	$𝒪 (ϵ^{- 6})$	$𝒪 (ϵ^{- 8})$	Single
[ZAG22]	NC-C	–	$𝒪 (ϵ^{- 6})$	Double
[FRM⁺21]	NC-PL	$\tilde{𝒪} (ϵ^{- 2})$	–	Single
This paper	PL-C	$𝒪 (ϵ^{- 2})$	$𝒪 (ϵ^{- 4})$	Single

Table 2. Table 2: Comparison of gap function sup ( x , y ) ∈ 𝒵 { Φ ( x T , y ) − Φ ( x , y T ) } subscript supremum 𝑥 𝑦 𝒵 Φ subscript 𝑥 𝑇 𝑦 Φ 𝑥 subscript 𝑦 𝑇 \sup_{(x,y)\in\mathcal{Z}}\left\{\Phi(x_{T},y)-\Phi(x,y_{T})\right\} for different methods

	SPDM	SPDHG	SMP
Colon-cancer	3.75e-4	1.51e-2	2.70e-2
Leukemia	2.61e-4	6.99e-3	1.18e-2

Equations102

x \in X min y \in Y max Φ (x, y) ≜ L (x, y) - h (y),

x \in X min y \in Y max Φ (x, y) ≜ L (x, y) - h (y),

\displaystyle\mathcal{L}(x,y)-\mathcal{L}(\bar{x},y)-\langle\nabla_{x}\mathcal{L}(\bar{x},y),x-\bar{x}\rangle\leq\tfrac{L_{xx}}{2}{\color[rgb]{0,0,0}\|x-\bar{x}\|^{2}}.

\displaystyle\mathcal{L}(x,y)-\mathcal{L}(\bar{x},y)-\langle\nabla_{x}\mathcal{L}(\bar{x},y),x-\bar{x}\rangle\leq\tfrac{L_{xx}}{2}{\color[rgb]{0,0,0}\|x-\bar{x}\|^{2}}.

E [\nabla_{x} L (x, y, ξ) ∣ F_{k}] = 0, E [\nabla_{y} L (x, y, ξ) ∣ H_{k}] = 0,

E [\nabla_{x} L (x, y, ξ) ∣ F_{k}] = 0, E [\nabla_{y} L (x, y, ξ) ∣ H_{k}] = 0,

E [∥ \nabla_{x} L (x, y; ξ) - \nabla_{x} L (x, y) ∥^{2}] \leq ν_{x}^{2},

E [∥ \nabla_{y} L (x, y; ξ) - \nabla_{y} L (x, y) ∥^{2}] \leq ν_{y}^{2} .

C_{k} ≜ 1 - L_{xx} γ_{k} - \frac{L _{xx} ( γ _{k} - λ _{k} ) ^{2}}{2 α _{k} Γ _{k} γ _{k}} (τ = k \sum T - 1 Γ_{τ}) \geq 0,

C_{k} ≜ 1 - L_{xx} γ_{k} - \frac{L _{xx} ( γ _{k} - λ _{k} ) ^{2}}{2 α _{k} Γ _{k} γ _{k}} (τ = k \sum T - 1 Γ_{τ}) \geq 0,

π_{t} minimize E [t = 0 \sum \infty x_{t}^{⊤} Q x_{t} + u_{t}^{⊤} R u_{t}]

π_{t} minimize E [t = 0 \sum \infty x_{t}^{⊤} Q x_{t} + u_{t}^{⊤} R u_{t}]

subject to x_{t + 1} = A x_{t} + B u_{t}, u_{t} = π_{t} (x_{t}), x_{0} \sim D_{0},

x \in X min y \in Y max i = 1 \sum n y_{i} lo g (1 + exp (- b_{i} a_{i}^{T} x)),

x \in X min y \in Y max i = 1 \sum n y_{i} lo g (1 + exp (- b_{i} a_{i}^{T} x)),

\langle\sigma_{k},x-v_{k}\rangle\leq{{\color[rgb]{0,0,0}\bar{\alpha}_{k}}\over 2}\|x-v_{k}\|^{2}-{{\color[rgb]{0,0,0}\bar{\alpha}_{k}}\over 2}\|x-v_{k+1}\|+{1\over 2{\color[rgb]{0,0,0}\bar{\alpha}_{k}}}\|\bar{\sigma}_{k}\|^{2}.

\langle\sigma_{k},x-v_{k}\rangle\leq{{\color[rgb]{0,0,0}\bar{\alpha}_{k}}\over 2}\|x-v_{k}\|^{2}-{{\color[rgb]{0,0,0}\bar{\alpha}_{k}}\over 2}\|x-v_{k+1}\|+{1\over 2{\color[rgb]{0,0,0}\bar{\alpha}_{k}}}\|\bar{\sigma}_{k}\|^{2}.

∥ \nabla_{x} L (z_{k^{*}}, y_{k^{*}}) ∥^{2} + ∥ y_{k^{*} + 1} - y_{k^{*}} ∥^{2}

∥ \nabla_{x} L (z_{k^{*}}, y_{k^{*}}) ∥^{2} + ∥ y_{k^{*} + 1} - y_{k^{*}} ∥^{2}

\displaystyle\leq\tfrac{1}{TD}\biggr{[}\mathcal{L}(x_{0},y^{*})-\mathcal{L}(x_{T},y^{*})+\tfrac{3\beta_{0}}{4\sigma_{0}}\|y^{*}-y_{0}\|^{2}+\tfrac{{L^{2}_{xy}}\gamma_{0}^{2}}{2\beta_{0}\tau_{0}}\|\nabla_{x}\mathcal{L}(z_{0},y_{0}))\|^{2}

\displaystyle\quad+\sum_{k=0}^{T-1}\left(\tfrac{1}{\bar{\alpha}_{k}}{\|\beta_{k}u_{k}^{3}+u_{k}^{1}-u_{k}^{2}\|^{2}}+\Xi_{k}+\zeta_{k}+U_{k}\right)\biggr{]}.

∥ Δ_{k} ∥

∥ Δ_{k} ∥

= L_{xx} ∥ x_{k} - (1 - α_{k}) \tilde{x}_{k} - α_{k} x_{k} ∥ = L_{xx} (1 - α_{k}) ∥ \tilde{x}_{k} - x_{k} ∥.

L (x_{k + 1}, y_{k + 1})

L (x_{k + 1}, y_{k + 1})

\leq L (x_{k}, y_{k + 1}) + ⟨ \nabla_{x} L (x_{k}, y_{k + 1}), x_{k + 1} - x_{k} ⟩ + \frac{L _{xx}}{2} ∥ x_{k + 1} - x_{k} ∥^{2}

\displaystyle=\mathcal{L}{(x_{k},y_{k+1})}+\big{\langle}\Delta_{k}+\nabla_{x}\mathcal{L}{(z_{k+1},y_{k+1})}-\gamma_{k}(\nabla_{x}\mathcal{L}{(z_{k+1},y_{k+1})}{\color[rgb]{0,0,0}+w_{k}})\big{\rangle}

\displaystyle\quad+L_{xx}\tfrac{\gamma_{k}^{2}}{2}\|\nabla_{x}\mathcal{L}{(z_{k+1},y_{k+1})}{\color[rgb]{0,0,0}+w_{k}}\|^{2}

\leq L (x_{k}, y_{k + 1}) - γ_{k} (1 - \frac{L _{xx} γ _{k}}{2}) ∥ \nabla_{x} L (z_{k + 1}, y_{k + 1}) ∥^{2} + γ_{k} ∥ Δ_{k} ∥∥ \nabla_{x} L (z_{k + 1}, y_{k + 1}) ∥

- γ_{k} ⟨ w_{k}, \nabla_{x} L (x_{k}, y_{k + 1}) ⟩ + L_{xx} γ_{k}^{2} ⟨ w_{k}, \nabla_{x} L (z_{k + 1}, y_{k + 1}) ⟩ + \frac{L _{xx} γ _{k}^{2}}{2} ∥ w_{k} ∥^{2} .

L (x_{k + 1}, y_{k + 1})

L (x_{k + 1}, y_{k + 1})

\displaystyle\quad+L_{xx}(1-\alpha_{k})\gamma_{k}\|{\color[rgb]{0,0,0}\nabla_{x}}\mathcal{L}{(z_{k+1},y_{k+1})}\|\|\tilde{x}_{k}-x_{k}\|+E_{k}^{x}+\tfrac{L_{xx}\gamma_{k}^{2}}{2}\|w_{k}\|^{2}

\displaystyle\leq\mathcal{L}{(x_{k},y_{k+1})}-\gamma_{k}\big{(}1-\tfrac{L_{xx}\gamma_{k}}{2}\big{)}\|\nabla_{x}\mathcal{L}{(z_{k+1},y_{k+1})}\|^{2}

\displaystyle\quad+\tfrac{L_{xx}{\gamma_{k}}^{2}}{2}\|{\color[rgb]{0,0,0}\nabla_{x}}\mathcal{L}{(z_{k+1},y_{k+1})}\|^{2}+\tfrac{L_{xx}(1-{\alpha_{k}})^{2}}{2}\|\tilde{x}_{k}-x_{k}\|^{2}+{\color[rgb]{0,0,0}E_{k}^{x}+\tfrac{L_{xx}\gamma_{k}^{2}}{2}\|w_{k}\|^{2}}

\displaystyle=\mathcal{L}{(x_{k},y_{k+1})}-\gamma_{k}\big{(}1-{L_{xx}\gamma_{k}}\big{)}\|\nabla_{x}\mathcal{L}{(z_{k+1},y_{k+1})}\|^{2}

+ \frac{L _{xx} ( 1 - α _{k} ) ^{2}}{2} ∥ \tilde{x}_{k} - x_{k} ∥^{2} + E_{k}^{x} + \frac{L _{xx} γ _{k}^{2}}{2} ∥ w_{k} ∥^{2},

∥ \tilde{x}_{k + 1} - x_{k + 1} ∥^{2}

∥ \tilde{x}_{k + 1} - x_{k + 1} ∥^{2}

\displaystyle=\left\|\Gamma_{k}\sum_{\tau=0}^{k}\tfrac{\alpha_{\tau}}{\Gamma_{\tau}}\left[\left(\tfrac{\gamma_{\tau}-\lambda_{\tau}}{\alpha_{\tau}}\right)(\nabla_{x}\mathcal{L}{(z_{\tau+1},y_{\tau+1})}+{\color[rgb]{0,0,0}w_{\tau}})\right]\right\|^{2}

\displaystyle\leq\Gamma_{k}\sum_{\tau=0}^{k}\tfrac{\alpha_{\tau}}{\Gamma_{\tau}}\left\|\left(\tfrac{\gamma_{\tau}-\lambda_{\tau}}{\alpha_{\tau}}\right)(\nabla_{x}\mathcal{L}{(z_{\tau+1},y_{\tau+1})}+{\color[rgb]{0,0,0}w_{\tau}})\right\|^{2}

\displaystyle=\Gamma_{k}\sum_{\tau=0}^{k}\tfrac{(\gamma_{\tau}-\lambda_{\tau})^{2}}{\Gamma_{\tau}\alpha_{\tau}}\|(\nabla_{x}\mathcal{L}{(z_{\tau+1},y_{\tau+1})}+{\color[rgb]{0,0,0}w_{\tau}})\|^{2}.

L (x_{k + 1}, y_{k + 1})

L (x_{k + 1}, y_{k + 1})

\displaystyle\quad+{\color[rgb]{0,0,0}E_{k}^{x}+\tfrac{L_{xx}\gamma_{k}^{2}}{2}\|w_{k}\|^{2}}+\tfrac{L_{xx}\Gamma_{{\color[rgb]{0,0,0}k-1}}(1-\alpha_{k})^{2}}{2}\times\sum_{\tau=0}^{\color[rgb]{0,0,0}k-1}\tfrac{(\gamma_{\tau}-\lambda_{\tau})^{2}}{\Gamma_{\tau}\alpha_{\tau}}\|\nabla_{x}\mathcal{L}{(z_{\tau+1},y_{\tau+1})}+w_{\tau}\|^{2}

\displaystyle\leq\mathcal{L}{(x_{k},y_{k+1})}-\gamma_{k}(1-L_{xx}\gamma_{k})\|\nabla_{x}\mathcal{L}{(z_{k+1},y_{k+1})}\|^{2}+{\color[rgb]{0,0,0}E_{k}^{x}+\tfrac{L_{xx}\gamma_{k}^{2}}{2}\|w_{k}\|^{2}}

+ \frac{L _{xx} Γ _{k}}{2} τ = 0 \sum k \frac{( γ _{τ} - λ _{τ} ) ^{2}}{Γ _{τ} α _{τ}} ∥ \nabla_{x} L (z_{τ + 1}, y_{τ + 1}) ∥^{2}

\displaystyle\quad+\tfrac{L_{xx}\Gamma_{k}}{2}\sum_{\tau=0}^{k}\tfrac{(\gamma_{\tau}-\lambda_{\tau})^{2}}{\Gamma_{\tau}\alpha_{\tau}}\|w_{\tau}\|^{2}+\tfrac{L_{xx}\Gamma_{{\color[rgb]{0,0,0}k-1}}(1-\alpha_{k})^{2}}{2}\times{\color[rgb]{0,0,0}\sum_{\tau=0}^{k}\tfrac{(\gamma_{\tau}-\lambda_{\tau})^{2}}{\Gamma_{\tau}\alpha_{\tau}}}w_{\tau}^{T}\nabla_{x}\mathcal{L}(z_{\tau+1},y_{\tau+1}).

k = 0 \sum T - 1 (L (x_{k + 1}, y_{k + 1}) - L (x_{k}, y_{k + 1}))

k = 0 \sum T - 1 (L (x_{k + 1}, y_{k + 1}) - L (x_{k}, y_{k + 1}))

\displaystyle\leq-\sum_{k=0}^{T-1}\gamma_{k}(1-L_{xx}\gamma_{k})\|\nabla_{x}\mathcal{L}{(z_{k+1},y_{k+1})}\|^{2}+\sum_{k=0}^{T-1}\tfrac{L_{xx}\Gamma_{\color[rgb]{0,0,0}{k}}}{2}\sum_{\tau=0}^{\color[rgb]{0,0,0}{k}}\tfrac{(\gamma_{\tau}-\lambda_{\tau})^{2}}{\Gamma_{\tau}\alpha_{\tau}}\|\nabla_{x}\mathcal{L}{(z_{{\color[rgb]{0,0,0}\tau+1}},y_{{\color[rgb]{0,0,0}\tau+1}})}\|^{2}

\displaystyle\quad+{\color[rgb]{0,0,0}\sum_{k=0}^{T-1}(\Xi_{k}+\zeta_{k})}

\displaystyle=\tfrac{L_{xx}}{2}\sum_{k=0}^{T-1}\tfrac{(\gamma_{k}-\lambda_{k})^{2}}{\Gamma_{k}\alpha_{k}}{\color[rgb]{0,0,0}\left(\sum_{\tau=k}^{T-1}\Gamma_{\tau}\right)}\|\nabla_{x}\mathcal{L}{(z_{k+1},y_{k+1})}\|^{2}-\sum_{k=0}^{T-1}\gamma_{k}C_{k}\|\nabla_{x}\mathcal{L}{(z_{k+1},y_{k+1})}\|^{2}+{\color[rgb]{0,0,0}\sum_{k=0}^{T-1}(\Xi_{k}+\zeta_{k})}.

k = 0 \sum T - 1 (L (x_{k + 1}, y_{k + 1}) - L (x_{k}, y_{k + 1}))

k = 0 \sum T - 1 (L (x_{k + 1}, y_{k + 1}) - L (x_{k}, y_{k + 1}))

\displaystyle\leq-\sum_{k=0}^{T-1}\gamma_{k}C_{k}\mu(\mathcal{L}{(z_{k+1},y_{k+1})}-\mathcal{L}(x^{\ast}{\color[rgb]{0,0,0}(y_{k+1})},y_{k+1}))-\sum_{k=0}^{T-1}{\gamma_{k}C_{k}\over 2}\|\nabla_{x}\mathcal{L}{(z_{k+1},y_{k+1})}\|^{2}

\displaystyle\quad+{\color[rgb]{0,0,0}{\sum_{k=0}^{T-1}(\Xi_{k}+\zeta_{k})}.}

k = 0 \sum T - 1 (L (x_{k + 1}, y) - L (x_{k}, y))

k = 0 \sum T - 1 (L (x_{k + 1}, y) - L (x_{k}, y))

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsStochastic Gradient Optimization Techniques · Sparse and Compressive Sensing Techniques · Markov Chains and Monte Carlo Methods

Full text

Accelerated Primal-dual Scheme for a Class of Stochastic Nonconvex-concave Saddle Point Problems

Morteza Boroun* Zeinab Alizadeh * Afrooz Jalilzadeh111Department of Systems and Industrial Engineering, The University of Arizona, Tucson, AZ, USA. {morteza, zalizadeh, afrooz}@arizona.edu

Abstract

Stochastic nonconvex-concave min-max saddle point problems appear in many machine learning and control problems including distributionally robust optimization, generative adversarial networks, and adversarial learning. In this paper, we consider a class of nonconvex saddle point problems where the objective function satisfies the Polyak-Łojasiewicz condition with respect to the minimization variable and it is concave with respect to the maximization variable. The existing methods for solving nonconvex-concave saddle point problems often suffer from slow convergence and/or contain multiple loops. Our main contribution lies in proposing a novel single-loop accelerated primal-dual algorithm with new convergence rate results appearing for the first time in the literature, to the best of our knowledge. In particular, in the stochastic regime, we demonstrate a convergence rate of $\mathcal{O}(\epsilon^{-4})$ to find an $\epsilon$ -gap solution which can be improved to $\mathcal{O}(\epsilon^{-2})$ in deterministic setting.

1 Introduction

In this paper, we consider the following min-max saddle point (SP) game:

[TABLE]

where $\mathcal{X}=\mathbb{R}^{n}$ , $\mathcal{Y}=\mathbb{R}^{m}$ , $\mathcal{L}(x,y)=\mathbb{E}[\mathcal{L}(x,y;\xi)]$ , $\xi$ is a random vector, $\mathcal{L}(\cdot,y)$ is potentially nonconvex for any $y\in\mathcal{Y}$ and satisfies Polyak-Łojasiewicz (PL) condition (see Definition 1), $\mathcal{L}(x,\cdot)$ is concave for any $x\in\mathcal{X}$ and $h(\cdot)$ is convex and possibly nonsmooth. Our goal is to develop an algorithm to find a first order stationary point of this SP problem.

Recent emerging applications in machine learning and control have further stimulated a surge of interest in these problems. Examples that can be formulated as (1) include generative adversarial networks (GANs) [GBC16], fair classification [NSH*+*19], communications [ABR21, BKR19], and wireless system [CL11b, FP09]. Convex-concave saddle point problems have been extensively studied in the literature [CP16, HA21]. However, recent applications in machine learning and control may involve nonconvexity. One class of nonconvex-concave min-max problems is when the objective function satisfies PL condition that we aim to study in this paper. Next, we provide two examples that can be formulated as problem (1) and satisfies PL condition.

Example 1 (Generative adversarial imitation learning).

One practical example of PL-game is generative adversarial imitation learning of linear quadratic regulators (LQR). Imitation learning techniques aim to mimic human behavior by observing an expert demonstrating a given task [HGEJ17]. Generative adversarial imitation learning (GAIL) is studied in [HE16] which solves imitation learning via min-max optimization. Let $K$ represents the choice of the policy, $K_{E}$ represents the expert policy, and the cost parameter and the expected cumulative cost for a given policy $K$ are denoted by $\theta=(Q,R)$ and $C(K,\theta)$ , respectively. The problem of GAIL for LQR can be formulated [CHCW19] as $\min_{K}\max_{\theta\in\Theta}m(K,\theta),$ where $m(K,\theta)=C(K,\theta)-C(K_{E},\theta)$ , $Q\in\mathbb{R}^{d\times d}$ , $R\in\mathbb{R}^{k\times k}$ , and $\Theta\triangleq\{(Q,R)\mid\alpha_{Q}I\preceq Q\preceq\beta_{Q}I,\alpha_{R}I\preceq R\preceq\beta_{R}I\}$ . It is known that $m$ satisfies PL condition in $K$ [NSH*+*19]. This problem is a special case of (1), for $h(\theta)=\mathbb{I}_{\Theta}(\theta)$ , where $\mathbb{I}_{\Theta}$ denotes the indicator function of set $\Theta$ .***

Example 2 (Distributionally robust optimization).

Define $\ell_{i}(x)=\ell(x,\xi_{i})$ , where $\ell:\mathcal{X}\times\Omega\to\mathbb{R}$ is a loss function possibly nonconvex and $\Omega=\{\xi_{1},\ldots,\xi_{n}\}$ . Distributionally robust optimization (DRO) studies worse case performance under uncertainty to find solutions with some specific confidence level [ND16]. DRO can be formulated as $\min_{x\in\mathcal{X}}\max_{y\in Y}\sum_{i=1}^{n}y_{i}\ell_{i}(x),$ where $\mathcal{Y}$ represents the uncertainty set, e.g., $Y=\{y\in\mathbb{R}^{m}_{+}\mid y\geq\delta/n,\ V(y,\tfrac{1}{n}\mathbf{1}_{n})\leq\rho\}$ is an uncertainty set considered in [ND16] and $V(Q,P)$ denotes the divergence measure between two sets of probability measures $Q$ and $P$ . As it has been shown in [GYYY20], DRO for deep learning with ReLU activation function satisfies PL condition in an $\epsilon$ -neighborhood around a random initialized point. This problem is a special case of (1), for $h(y)=\mathbb{I}_{Y}(y)$ .***

One natural way to solve problem (1) is directly with the idea of taking two simultaneous or sequential steps for reducing the objective function $\Phi(\cdot,y)$ for a given $y$ and increasing the objective function $\Phi(x,\cdot)$ for a given $x$ . One of the most famous algorithms for solving such problem is known as gradient descent-ascent (GDA) [NO09]. It has been discovered that such a naive approach leads to poor performance and may even diverge for simple problems. One way to resolve this issue is by adding a momentum in terms of the gradient of the objective function. Although this approach leads to an optimal convergence rate result [HA21, Zha21], it may not be directly applicable in nonconvex-concave setting. Therefore, we aim to to develop a novel primal-dual algorithm with acceleration in the primal update as well as a new momentum in the dual update.

1.1 Related Works

Nonconvex-concave SP problem. Various algorithms have been proposed for solving nonconvex-concave SP problems due to their applicability in many modern machine learning problems. The existing methods can be categorized into two types: multi-loop and single-loop. In multi-loop algorithms [KM21, OLR21] one variable is updated in a few consecutive iterations until a certain condition is satisfied before another variable gets updated. Such methods are often difficult to implement in practice because the termination of the inner loop has a high impact on the overall complexity of such algorithms, and selecting a conservative criterion may lead to a high computational cost while an inadequate number of inner iterations may lead to poor performance. Therefore, there have been some recent efforts [LTHC20, ZXSL20, XZXL22] to design and analyze single-loop algorithms to solve nonconvex-concave problems. In particular, a convergence rate of $\mathcal{O}(\epsilon^{-4})$ has been obtained for the aforementioned single-loop algorithms. Authors in [ZXSL20] were able to improve the rate to $\mathcal{O}(\epsilon^{-2})$ for a special case of nonconvex-concave problem, i.e., $\min_{x}\max_{y\in Y}f(x)^{T}y$ , where $Y$ is a probability simplex. There are also several studies [RLLY18, LJJ20, ZAG22] in the stochastic regime. See Table 1 for more details.

PL condition. Rate results for nonconvex-concave problems can be improved for a class of problems where the objective function satisfies PL condition. Recently, nonconvex-PL SP problems have been studied in [NSH*+*19, ALD21] and [YOLH22] assuming that the objective satisfies one-sided PL condition. Multi-loop algorithms [NSH*+*19, ALD21] find an $\epsilon$ –first order stationary point of the problem within $\mathcal{\tilde{O}}(\epsilon^{-2})$ iterations, where $\tilde{O}(\cdot)$ denotes $\mathcal{O}(\cdot)$ up to a logarithmic factor. The same rate result has been achieved in [FRM*+*21] and [YOLH22] for a single-loop schemes. More recently, to guarantee a global convergence, Yang et al. [YKH20] proposed alternating gradient descent ascent algorithm with a linear convergence rate to solve SP problem where the objective satisfies two-sided PL condition. Moreover, the convergence rate of $\mathcal{O}(\epsilon^{-1})$ has been shown for the stochastic regime under two-sided PL condition. Subsequently, Guo et al. [GYYY20] improved the dependency of convergence rate on the condition number (the ratio of smoothness parameter to the PL constant).

1.2 Contributions

The existing methods for solving nonconvex-concave SP problems often suffer from slow convergence and/or contain multiple loops. Our main contribution lies in proposing a novel single-loop accelerated primal-dual algorithm with convergence rate results for PL-game appearing for the first time in the literature to the best of our knowledge. Our main contributions are summarized as follows: (i) We propose an accelerated primal-dual scheme to solve problem (1). Our main idea lies in designing a novel algorithm by combining an accelerated step in the primal variable with a dual step involving a momentum in terms of the gradient of the objective function. (ii) Under a stochastic setting, using an acceleration where mini-batch sample gradients are utilized, our method achieves an oracle complexity (number of sample gradients calls) of $\mathcal{O}(\epsilon^{-4})$ . (iii) Under a deterministic regime, we demonstrate a convergence guarantee of $\mathcal{O}(\epsilon^{-2})$ to find an $\epsilon$ -stationary solution. This is the best-known rate for SP problems satisfying one-sided PL condition to the best of our knowledge.

2 Preliminaries

First we define some important notations.

Notations. $\|x\|$ denotes the Euclidean vector norm, i.e., $\|x\|=\sqrt{x^{T}x}$ . $\mbox{prox}_{g}(x)$ denotes the proximal operator with respect to $g$ at $x$ , i.e., $\mbox{prox}_{g}(y)\triangleq\mbox{argmin}_{x}\{\tfrac{1}{2}\|x-y\|^{2}+g(x)\}$ . $\mathbb{E}[x]$ is used to denote the expectation of a random variable $x$ . We define $x^{*}(y)\triangleq\mbox{argmin}_{x}\mathcal{L}({x,y)}$ . Given the mini-batch samples $\mathcal{U}=\{\xi^{i}\}_{i=1}^{b}$ and $\mathcal{V}=\{\bar{\xi}^{i}\}_{i=1}^{b}$ , we let $\nabla_{x}\mathcal{L}_{\mathcal{U}}(x,y)={1\over b}\sum_{i=1}^{b}\nabla_{x}\mathcal{L}(x,y;\xi^{i})$ and $\nabla_{y}\mathcal{L}_{\mathcal{V}}(x,y)={1\over b}\sum_{i=1}^{b}\nabla_{y}\mathcal{L}(x,y;\bar{\xi}^{i})$ . We defined $\sigma$ -algebras $\mathcal{H}_{k}=\{\mathcal{U}_{1},\mathcal{V}_{1},\mathcal{U}_{2},\mathcal{V}_{2},\ldots,\mathcal{U}_{k-1},\mathcal{V}_{k-1}\}$ and $\mathcal{F}_{k}=\{\mathcal{H}_{k}\cup V_{k}\}$ .

Now we briefly highlight a few aspects of the PL condition [Pol63] that differentiate it from convexity and make it a more relevant and appealing setting for many machine learning applications. For unconstrained minimization problem $\min_{x\in\mathbb{R}^{n}}f(x)$ , we say that a function satisfies the PL inequality if for some $\mu>0$ , ${1\over 2}\|{\color[rgb]{0,0,0}\nabla}f(x)\|^{2}\geq\mu(f(x)-f(x^{*}))$ for all $x\in\mathbb{R}^{n}$ . To verify the PL condition, we need access to the value of the objective function the norm of the gradient which is often tractable and can be estimated from a sub-sample data. However, for verifying convexity, one needs to estimate the minimum eigenvalue of the Hessian matrix. Moreover, the norm of the gradient is much more resilient to perturbation of the objective function than the smallest eigenvalue of the Hessian [BBM18].

PL condition does not require strong convexity or even convexity of the objective function. It has been shown that it is satisfied for different class of problems, for instance, conditions like restricted secant inequality [ZY13] and one-point convexity [AZ18] are special cases of PL condition. Problems satisfying such conditions include dictionary learning [AGMM15], neural networks [LY17] and phase retrieval [CC15], to name a few. In this paper, we consider a min-max SP problem and we assume that the objective function satisfies one-sided PL inequality.

Definition 1.

A continuously differentiable function $\mathcal{L}(x,y)$ satisfies the one-sided PL condition if there exists a constant $\mu>0$ such that ${1\over 2}\|\nabla_{x}\mathcal{L}{(x,y)}\|^{2}\geq\mu(\mathcal{L}{(x,y)}-\mathcal{L}({x^{*}(y),y)}),$ for all $x\in\mathcal{X},y\in\mathcal{Y},$ where $\mathcal{L}({x^{*}(y),y)})=\min_{x}\mathcal{L}({x,y)}$ .

Now we state our main assumptions.

Assumption 1.

(i) The solution set of problem (1) is nonempty; (ii) Function $h(y)$ is convex and possibly nonsmooth; (iii) $\mathcal{L}(x,y)$ is continuously differentiable satisfying one-sided PL condition and $\mathcal{L}(x,\cdot)$ is concave for any $x\in\mathcal{X}$ .

Assumption 2.

$\nabla_{x}\mathcal{L}$ * is Lipschitz continuous, i.e., there exist $L_{xx}\geq 0$ and $L_{xy}\geq 0$ such that $\|\nabla_{x}\mathcal{L}(x,y)-\nabla_{x}\mathcal{L}(\bar{x},\bar{y})\|\leq{L_{xx}}\|x-\bar{x}\|+{L_{xy}}\|y-\bar{y}\|.$ Moreover, $\mathcal{L}(x,y)$ is linear in terms of $y$ .*

Note that Assumption 2 implies that

[TABLE]

Under stochastic setting, we assume that the sample gradients can be generated by satisfying the following standard conditions.

Assumption 3.

Each component function $\mathcal{L}(x,y;\xi)$ has unbiased stochastic gradients with bounded variance:

[TABLE]

3 Primal-Dual Method with Momentum

In this section, we propose a primal-dual algorithm with momentum (PDM) for deterministic PL-concave problems. The details of the method can be seen in Algorithm 1. Then, we introduce stochastic PDM (SPDM) for stochastic setting (see Algorithm 2).

Algorithm 1 consists of a single loop primal-dual steps. After initialization of parameters, at each iteration $k\geq 0$ , a proximal gradient ascent step for the variable $y$ is taken in the direction of $\nabla_{y}\mathcal{L}$ with an additive momentum term $q_{k}$ . Such a momentum is an algorithmic approach to gain acceleration for solving PL-concave problems Finally, after computing gradient $\nabla_{x}\mathcal{L}$ at $(z_{k+1},y_{k+1})$ , two gradient descent steps for the variable $x$ is taken to generate $x_{k+1}$ and $\tilde{x}_{k+1}$ which then will be combined by a convex combination in the next iteration.

Remark 1.

If we let $\lambda_{k}=\alpha_{k}{\color[rgb]{0,0,0}\gamma_{k}}$ , then the primal step in Algorithm 1 will be similar to one of the variants of the Nesterov’s acceleration (see [Nes03] and [GL16]). Moreover, when $\lambda_{k}=\gamma_{k}$ it can be shown that $z_{k+1}=x_{k}$ and $x_{k+1}=\tilde{x}_{k+1}$ which is similar to a gradient descent step for the minimization variable.

For a stochastic setting, SPDM is proposed in Algorithm 2 where the main steps of the algorithm is similar to Algorithm 1. The main difference is that instead of computing the exact gradient, we estimate the gradient of the function by drawing mini-batch samples $\mathcal{U}_{k}$ and $\mathcal{V}_{k}$ in Step 4.

4 Convergence Analysis

In this section, we study the convergence properties of and 2 for stochastic (and also deterministic) settings. All related proofs are provided in the appendix. Our goal is to find a first order stationary point of problem (1). For a given positive $\epsilon$ , we define a point $(x,y)$ as an $\epsilon$ -stationary solution of problem (1) if $\|\nabla_{x}\Phi(x,y)\|\leq\epsilon$ and $\nabla_{y}\mathcal{L}(x,y)\in h(y)+\mathcal{B}(0,r\epsilon)$ for some $r>0$ .

For our analysis, for all $k\in(0,T-1)$ , define $C_{k}$ as:

[TABLE]

where $\Gamma_{k}\triangleq\begin{cases}1&k=0\\ (1-\alpha_{k})\Gamma_{k-1}&k\geq 1\\ \end{cases}.$

Remark 2.

By choosing $\alpha_{k}=\tfrac{2}{k+1}$ , $\lambda_{k}=\tfrac{1}{2L_{xx}}$ and $\gamma_{k}\in[\lambda_{k},(1+\alpha_{k}/4)\lambda_{k}]$ for any $k\geq 0$ , from definition of $C_{k}$ , one can show that $C_{k}\geq 11/32$ (see [GL16]).

Now we establish the convergence rate of SPDM for solving stochastic PL-concave SP problem (1). In Algorithm 2, to estimate the gradient of the function, we draw mini-batch samples $\mathcal{U}_{k}$ and $\mathcal{V}_{k}$ at each iteration, where $|\mathcal{U}_{k}|=|\mathcal{V}_{k}|=b$ .

Theorem 1.

Let $\{x_{k},y_{k},z_{k}\}_{{k}\geq 0}$ generated by Algorithm 2 and suppose Assumptions 1, 2 and 3 hold. Moreover, let $\sigma_{k}=\tfrac{\mu}{36L^{2}_{xy}}$ $\alpha_{k}=\tfrac{2}{k+1}$ , $\lambda_{k}=\tfrac{1}{2L_{xx}}$ and $\gamma_{k}\in[\lambda_{k},(1+\alpha_{k}/4)\lambda_{k}]$ for any $k\geq 0$ and $b=T$ . Then, there exists an iteration $k\in\{0,\ldots,T\}$ such that $(z_{k},y_{k})$ is an $\epsilon$ -stationary point of problem (1) which can be obtained within $\mathcal{O}(\epsilon^{-4})$ evaluations of sample gradients.

Consider function $\mathcal{L}$ in problem (1) to be deterministic, i.e. exact gradients $\nabla_{x}\mathcal{L}$ and $\nabla_{y}\mathcal{L}$ are available. We show that the convergence rate can be improved to $\mathcal{O}(\epsilon^{-2})$ .

Theorem 2.

Let $\{x_{k},y_{k},z_{k}\}_{{k}\geq 0}$ generated by Algorithm 1 and suppose Assumptions 1, 2 hold. Choosing parameters as Theorem 1, there exists an iteration $k\in\{0,\ldots,T\}$ such that $(z_{k},y_{k})$ is an $\epsilon$ -stationary point which can be obtained within $\mathcal{O}(\epsilon^{-2})$ evaluations of the gradients.

The proof for deterministic setting, i.e, Theorem 2, is similar to Theorem 1, by letting $\nu_{x}=\nu_{y}=0$ .

5 Numerical Results

Generative Adversial Imitation Learning. In this section, we implement our method to solve GAIL problem described in Example 1. The code utilized in our experiment was adapted from an existing implementation developed by [YKH20]. To validate the efficiency of the proposed scheme, we compare PDM algorithm with alternating gradient descent ascent (AGDA) [YKH20], Smoothed-GDA [ZXSL20], and AGP [XZXL22]. The optimal control problem for LQR can be formulated as follows [CHCW19]:

[TABLE]

where $Q\in\mathbb{R}^{d\times d}$ , $R\in\mathbb{R}^{k\times k}$ are both positive definite matrices, $A\in\mathbb{R}^{d\times d}$ , $B\in\mathbb{R}^{d\times k}$ , , $u_{t}\in\mathbb{R}^{k}$ is a control, $x_{t}\in\mathbb{R}^{d}$ is a state, $\pi_{t}$ is a policy, and $\mathbb{D}_{0}$ is a given initial distribution. In the infinite-horizon setting with a stochastic initial state $x_{0}\sim\mathbb{D}_{0}$ , the optimal control input can be written as a linear function $u_{t}=-K^{*}x_{t}$ where $K^{*}\in\mathbb{R}^{k\times d}$ is the policy and does not depend on $t$ . We denote the expected cumulative cost in (4) by $C(K,\theta)$ , where $\theta=(Q,R)$ . To estimate the expected cumulative cost, we sample $n$ initial points $x_{0}^{(1)},\ldots,x_{0}^{(n)}$ and estimate $C(K,\theta)$ using sample average: $C_{n}(K;\theta):={1\over n}\sum_{i=1}^{n}\left[\sum_{t=0}^{\infty}x_{t}^{\top}Qx_{t}+u_{t}^{\top}Ru_{t}\right]_{x_{0}=x_{0}^{(i)}}.$

In GAIL for LQR, the goal is to learn the cost function parameters $Q$ and $R$ from the expert after the trajectories induced by an expert policy $K_{E}$ are observed. Hence, the min-max formulation of the imitation learning problem is $\min_{K}\max_{\theta\in\Theta}\ m_{n}(K,\theta),$ where $m_{n}(K,\theta)=C_{n}(K,\theta)-C_{n}(K_{E},\theta)-\phi(\theta)$ , $\phi$ is a regularization term that we added so that the problem becomes strongly concave, so can apply AGDA scheme (see [YKH20]). Moreover, $\Theta$ is the feasible set of the cost parameters. We assume $\Theta$ is convex and there exist positive constants $\alpha_{Q},\beta_{Q},\alpha_{R}$ and $\beta_{R}$ such that for any $(Q,R)\in\Theta$ we have $\alpha_{Q}I\preceq Q\preceq\beta_{Q}I,\ \alpha_{R}I\preceq R\preceq\beta_{R}I.$ We generate three different data sets for different choices of $d$ and $k$ and we set $n=100$ , $\alpha_{Q}=\alpha_{R}=0.1$ and $\beta_{Q}=\beta_{R}=100$ . We choose $\alpha_{k}=\tfrac{2}{(k+1)}$ , $\sigma_{k}=0.4$ and $\lambda_{k}=\gamma_{k}=2e$ -4. The exact gradient of the problem in compact form has been established in [FGKM18]. non-accelerated scheme (AGDA).

In Figure 1 (a) and (b), we compared the performance of our proposed method (PDM) with AGDA [YKH20], Smoothed-GDA [ZXSL20], and AGP [XZXL22]. We set the same stepsizes for all the methods to ensure fairness in our experiment. Other parameters for competitive methods are selected as suggested in their papers. In Figure 1 (c) and (d), we compared PDM with its stochastic variant (SPDM) by running both algorithms for the same amount of time. As it can be seen SPDM outperforms PDM and its superiority is more evident as $n$ becomes larger.

Distributionally robust optimization. Consider the following DRO problem.

[TABLE]

Where $\mathcal{Y}=\{y\in\mathbb{R}^{m}_{+}\mid y\geq\delta/n,\ \tfrac{1}{2}\|ny-\mathbf{1}_{n}\|\leq\rho\}$ , $\delta=1/100$ and $\rho=50$ . We compare our method with stochastic accelerated primal-dual method proposed in [Zha21] (SPDHG) and stochastic mirror prox [JNT11] (SMP). We use real datasets colon-cancer (n=62, m=2000) and leukemia (n=38, m=7129) from LIBSVM library [CL11a]. Note that in these datasets the number of features are larger than the number of samples, therefore, computing $\nabla_{y}\mathcal{L}$ is cheap while $\nabla_{x}\mathcal{L}$ can be costly, hence, we use an unbiased estimator $\nabla_{x}\mathcal{L}_{\mathcal{U}}$ with batch size of 10 for all the methods. We run all algorithms for 300 seconds. The performance of the methods are depicted in Figure 2. Table 2 summarizes the performance of our algorithm and competitive methods in terms of the gap function. Our scheme outperforms other algorithms which matches with the theoretical result. In fact, PDM has convergence rate of $\mathcal{O}(1/k)$ and the other two methods have a convergence rate of $\mathcal{O}(1/\sqrt{k})$ .

6 Concluding Remarks

In this paper, we proposed an accelerated primal-dual scheme for solving a class of nonconvex-concave problems where the objective function satisfies the PL condition for both deterministic and stochastic settings. By combining an accelerated step in the minimization variable with an update involving a momentum in terms of the gradient of the objective function for the maximization variable, we obtained a convergence rate of $\mathcal{O}(\epsilon^{-4})$ and $\mathcal{O}(\epsilon^{-2})$ for the stochastic and deterministic problems, respectively. To the best of our knowledge, this is the first work that proposed a primal-dual scheme with momentum to solve PL-concave minimax problems.

There are different interesting directions for future work: (i) Investigating distributed variant of the proposed scheme over a network of agents; (ii) Considering a more general setting of nonconvex-concave SP problem and developing a projection-free algorithm.

APPENDIX

In our analysis, we use the following technical lemma.

Lemma 1.

Given a arbitrary sequences $\{\bar{\sigma}_{k}\}_{k\geq 0}\subset\mathbb{R}^{n}$ and ${\color[rgb]{0,0,0}\{\bar{\alpha}_{k}\}}_{k\geq 0}\subset\mathbb{R}^{++}$ , let $\{v_{k}\}_{k\geq 0}$ be a sequence such that $v_{0}\in\mathbb{R}^{n}$ and $v_{k+1}=v_{k}+{\color[rgb]{0,0,0}\tfrac{\bar{\sigma}_{k}}{\bar{\alpha}_{k}}}$ . Then, for all $k\geq 0$ and $x\in\mathbb{R}^{n}$ ,

[TABLE]

To prove the convergence rate, we use the following lemma (proof is similar to Lemma 3 in [GL16]).

Lemma 2.

For any given $z,\bar{z}\in\mathbb{R}^{n}$ and $c>0$ , such that $\|z-\bar{z}\|\leq c\epsilon$ , and $y\in\mathbb{R}^{m}$ let $\bar{y}\triangleq\mbox{prox}_{\sigma,h}(y+\sigma(\nabla_{y}\mathcal{L}(\bar{z},y)+q+u))$ for some $q,u\in\mathbb{R}^{m}$ such that $\|q\|\leq\ell\|\nabla_{x}\mathcal{L}(x,y)\|$ and $\|u\|^{2}\leq\nu_{y}^{2}/b$ for some $\ell,\nu_{y},b>0$ . If $\|\nabla_{x}\mathcal{L}(z,y)\|^{2}+\|\bar{y}-y\|^{2}\leq\epsilon^{2}$ , for some $\epsilon>0$ , then $\|\nabla_{x}\mathcal{L}(z,y)\|\leq\epsilon$ and $\nabla_{y}\mathcal{L}(z,y)\in h(\bar{y})+\mathcal{B}(0,(1/\sigma+\ell+cL_{yx})\epsilon+\nu_{y}/\sqrt{b})$ .

To facilitate the analysis, we define some notations.

Definition 2.

Let $u_{k}^{1}\triangleq\nabla_{y}\mathcal{L}_{\mathcal{V}_{k}}{(x_{k},y_{k})}-\nabla_{y}\mathcal{L}{(x_{k},y_{k})}$ , $u_{k}^{2}\triangleq\nabla_{y}\mathcal{L}_{\mathcal{V}_{k}}{(x_{k-1},y_{k})}-\nabla_{y}\mathcal{L}{(x_{k-1},y_{k})}$ , and $u_{k}^{3}\triangleq\nabla_{y}\mathcal{L}_{\mathcal{V}_{k}}{(z_{k+1},y_{k})}-\nabla_{y}\mathcal{L}{(z_{k+1},y_{k})}$ . Moreover, we define $\zeta_{k}\triangleq{L_{xx}\Gamma_{k}\over 2}\sum_{\tau=0}^{k}{(\gamma_{\tau}-\lambda_{\tau})^{2}\over\Gamma_{\tau}\alpha_{\tau}}\|w_{\tau}\|^{2}$ , $U_{k}\triangleq\langle\beta_{k}u_{k}^{3}+u_{k}^{1}-u_{k}^{2},y_{k}+v_{k}\rangle$ , and $\Xi_{k}\triangleq E_{k}^{x}+\bar{E}_{k}^{x}+\tfrac{L_{xx}\gamma_{k}^{2}}{2}\|w_{k}\|^{2}$ , where $E_{k}^{x}\triangleq-\gamma_{k}\left\langle w_{k},\nabla_{x}\mathcal{L}(x_{k},y_{k+1})\right\rangle\quad+L_{xx}\gamma_{k}^{2}\left\langle w_{k},\nabla_{x}\mathcal{L}(z_{k+1},y_{k+1})\right\rangle$ , and $\bar{E}_{k}^{x}\triangleq\tfrac{L_{xx}\Gamma_{{\color[rgb]{0,0,0}k-1}}(1-\alpha_{k})^{2}}{2}\sum_{\tau=0}^{k}\tfrac{(\gamma_{\tau}-\lambda_{\tau})^{2}}{\Gamma_{\tau}\alpha_{\tau}},w_{\tau}^{T}\nabla_{x}\mathcal{L}(z_{\tau+1},y_{\tau+1})$ .

In the next lemma, we provide a one-step analysis to obtain a bound for the norm of $\nabla_{x}\mathcal{L}$ and progress of the dual iterates. This is the main building block of our convergence analysis in Theorem 1.

Lemma 3.

Let $\{x_{k},y_{k},z_{k}\}_{{k}\geq 0}$ generated by Algorithm 2 and suppose Assumptions 1-3 hold. Moreover, let $\beta_{k}\triangleq\gamma_{k}C_{k}\mu$ , $D\triangleq{\min\{{\tfrac{{\gamma_{k}C_{k}}}{4},\tfrac{\beta_{k}}{{4\sigma_{k}}}}}\}$ , $\sigma_{k}=\tfrac{\mu}{36L^{2}_{xy}}$ $\alpha_{k}=\tfrac{2}{k+1}$ , $\lambda_{k}=\tfrac{1}{2L_{xx}}$ and $\gamma_{k}\in[\lambda_{k},(1+\alpha_{k}/4)\lambda_{k}]$ for any $k\geq 0$ and $b=T$ . Then, the following holds:

[TABLE]

Proof.

Define $\Delta_{k}\triangleq\nabla_{x}\mathcal{L}{(x_{k},y_{k+1})}-\nabla_{x}\mathcal{L}{(z_{k+1},y_{k+1})}$ . From Assumption 2 and step 3 of Algorithm 2, the following can be obtained,

[TABLE]

Define $w_{k}\triangleq\nabla_{x}\mathcal{L}_{\mathcal{U}_{k}}(z_{k+1},y_{k+1})-\nabla_{x}\mathcal{L}(z_{k+1},y_{k+1})$ , and using (2) and step 8 of Algorithm 2, one can obtain

[TABLE]

Define $E_{k}^{x}\triangleq-\gamma_{k}\left\langle w_{k},\nabla_{x}\mathcal{L}(x_{k},y_{k+1})\right\rangle+L_{xx}\gamma_{k}^{2}\left\langle w_{k},\nabla_{x}\mathcal{L}(z_{k+1},y_{k+1})\right\rangle.$ Combining (Proof.) and (Proof.):

[TABLE]

where we used $ab\leq\tfrac{(a^{2}+b^{2})}{2}$ . By steps 3, 8 and 9 of Algorithm 2 one can obtain $\tilde{x}_{k+1}-x_{k+1}=(1-\alpha_{k})\tilde{x}_{k}+\alpha_{k}x_{k}-\lambda_{k}(\nabla_{x}\mathcal{L}{(z_{k+1},y_{k+1})}+w_{k})-[x_{k}-\gamma_{k}(\nabla_{x}\mathcal{L}{(z_{k+1},y_{k+1})}+w_{k})]=(1-\alpha_{k})(\tilde{x}_{k}-x_{k})+(\gamma_{k}-\lambda_{k})(\nabla_{x}\mathcal{L}{(z_{k+1},y_{k+1})}+w_{k}).$ If we divide both sides of the above equality by $\Gamma_{k}$ , summing over $k$ and using the definition of $\Gamma_{k}$ , we obtain $\tilde{x}_{k+1}-x_{k+1}=\Gamma_{k}\sum_{\tau=0}^{k}\left({\gamma_{\tau}-\lambda_{\tau}\over\Gamma_{\tau}}\right)(\nabla_{x}\mathcal{L}{(z_{\tau+1},y_{\tau+1})}+{\color[rgb]{0,0,0}w_{\tau}}).$ Using above equality, the Jensen’s inequality, and the fact that $\sum_{\tau=0}^{k}{\alpha_{\tau}\over\Gamma_{\tau}}={\alpha_{0}\over\Gamma_{0}}+\sum_{\tau=1}^{k}{1\over\Gamma_{\tau}}\left(1-{{\Gamma_{\tau}}\over{\Gamma_{\tau-1}}}\right)={1\over\Gamma_{0}}+\left(\sum_{\tau=1}^{k}{1\over\Gamma_{\tau}}-{1\over\Gamma_{\tau-1}}\right)={1\over\Gamma_{k}},$ we obtain

[TABLE]

Using (Proof.) in (Proof.), one can obtain the following,

[TABLE]

Using Definition 2, summing both sides of (Proof.) over $k$ , and using the definition of $C_{k}$ in (3), we obtain the following

[TABLE]

From (Proof.) and Definition (1), one can obtain

[TABLE]

Adding $\sum_{k=0}^{T-1}(\mathcal{L}(x_{k+1},y)-\mathcal{L}(x_{k},y))$ to both sides:

[TABLE]

Using concavity of $\mathcal{L}$ over $y$ , one can obtain

[TABLE]

Let us define $u_{k}^{1}=\nabla_{y}\mathcal{L}_{\mathcal{V}_{k}}{(x_{k},y_{k})}-\nabla_{y}\mathcal{L}{(x_{k},y_{k})}$ , $u_{k}^{2}=\nabla_{y}\mathcal{L}_{\mathcal{V}_{k}}{(x_{k-1},y_{k})}-\nabla_{y}\mathcal{L}{(x_{k-1},y_{k})}$ , $u_{k}^{3}=\nabla_{y}\mathcal{L}_{\mathcal{V}_{k}}{(z_{k+1},y_{k})}-\nabla_{y}\mathcal{L}{(z_{k+1},y_{k})}$ , define $\bar{p}_{k}=\nabla_{y}\mathcal{L}{(z_{k+1},y_{k})}$ , and $\bar{q}_{k}={1\over\beta_{k}}(\nabla_{y}\mathcal{L}{(x_{k},y_{k})}-\nabla_{y}\mathcal{L}{(x_{k-1},y_{k})})$ . From optimality condition of step 6 in Algorithm 2, letting $s_{k}=\bar{p}_{k}+\bar{q}_{k}+u_{k}^{1}+u_{k}^{2}+u_{k}^{3}$ , one can obtain $h(y_{k+1})-\langle s_{k},y_{k+1}-y\rangle\leq h(y)+\tfrac{1}{2\sigma_{k}}[\|y-y_{k}\|^{2}-\|y-y_{k+1}\|^{2}-\|y_{k+1}-y_{k}\|^{2}].$ Multiplying both sides by $\beta_{k}=\gamma_{k}C_{k}\mu$ and summing over $k$ , we obtain,

[TABLE]

Now, we simplify the inner products involving in (Proof.) and (13) using the definition of $\bar{p}_{k}$ and $\bar{q}_{k}$ .

[TABLE]

Moreover, using Young’s inequality, and step 8 in Algorithm 2, one can obtain

[TABLE]

Summing (13) and (Proof.), using (15) and (Proof.), we get,

[TABLE]

where $\gamma_{-1}=\gamma_{0}$ . From Cauchy-Schwartz inequality, using Lemma 1 where we choose $v_{0}=y_{0}$ , and defining $U_{k}\triangleq\langle\beta_{k}u_{k}^{3}+u_{k}^{1}-u_{k}^{2},y_{k}+v_{k}\rangle$ , the following holds

[TABLE]

for some $\bar{\alpha}_{k}\geq 0$ . Hence, using (Proof.) in (Proof.) and rearranging terms, one can obtain the following,

[TABLE]

Choosing the parameters such that $\sigma_{k}\leq\tfrac{\mu^{2}}{216L^{2}_{xy}}$ , $\alpha_{k}=\tfrac{2}{k+1}$ , $\lambda_{k}=\tfrac{1}{2L_{xx}}$ , $\tau_{k}=\tfrac{9L^{2}_{xy}}{\mu}$ , $\bar{\alpha}_{k}=\tfrac{\beta_{k}}{4\sigma_{k}}$ , and $\gamma_{k}\in[\lambda_{k},(1+\alpha_{k}/4)\lambda_{k}]$ for any $k\geq 0$ , one can show that in (Proof.) Term (A) $\leq{-{\gamma_{k}C_{k}}\over 4}$ and Term (B) $\leq{-\beta_{k}\over{4\sigma_{k}}}$ . Therefore, choosing $k^{*}=\mbox{argmin}\{\|\nabla\mathcal{L}(z_{k},y_{k})\|^{2}+\|y_{k+1}-y_{k}\|^{2}\}$ , the left hand side (LHS) of (Proof.) can be bounded from below by $\big{(}\sum_{k=0}^{T-1}\min\{{\tfrac{{\gamma_{k}C_{k}}}{4},\tfrac{\beta_{k}}{{4\sigma_{k}}}}\}\big{)}\big{(}\|\nabla_{x}\mathcal{L}(z_{k^{*}}),y_{k^{*}})\|^{2}+\|y_{k^{*}+1}-y_{k^{*}}\|^{2}\big{)}$ . Moreover, letting $(x^{*},y^{*})$ to be an arbitrary saddle point solution of (1), choosing $y=y^{*}$ , using the fact that $\mathcal{L}(x^{\ast}{(y_{k+1})},y_{k+1})\leq\mathcal{L}(x^{*},y_{k+1})$ and (15), one can obtain:

[TABLE]

where $D\triangleq{\min\{{\tfrac{{\gamma_{k}C_{k}}}{4},\tfrac{\beta_{k}}{{4\sigma_{k}}}}}\}$ and we used ${\sum_{k=0}^{T-1}D}=TD$ . ∎

Now, we are ready to prove Theorem 1 and establish the convergence rate results.

Proof of Theorem 1. From (5), we have that

[TABLE]

Taking conditional expectation, one can show that $\mathbb{E}[C\mid\mathcal{H}_{k}]\leq\tfrac{9\nu^{2}_{y}}{T},\mathbb{E}[\Xi_{k}\mid\mathcal{F}_{k}]=\tfrac{L_{xx}\gamma^{2}_{k}\nu^{2}_{x}}{2T}$ , $\mathbb{E}[\zeta_{k}\mid\mathcal{F}_{k}]\leq\tfrac{L_{xx}\lambda^{2}_{k}\nu^{2}_{x}}{32T},$ and $\mathbb{E}[U_{k}\mid\mathcal{H}_{k}]=0$ . Hence, we obtain:

[TABLE]

Moreover, from the steps of Algorithm 2, $\|z_{k+1}-z_{k}\|\leq\lambda_{k-1}\|r_{k-1}\|+\|x_{k}-\tilde{x}_{k}\|$ . Using steps 8 and 9, one can show that $\|x_{k}-\tilde{x}_{k}\|\leq\mathcal{O}(\tfrac{1}{(T+1)\sqrt{T}})$ . Hence $\|z_{k+1}-z_{k}\|\leq\mathcal{O}(\sqrt{\epsilon})$ . Invoking Lemma 2, we conclude that $(z_{k^{*}},y_{k^{*}})$ is an $\epsilon$ -stationary point of problem (1). To achieve an $\epsilon$ -stationary point, we let the rhs of (APPENDIX) equal to $\epsilon^{2}$ which implies that $T=\mathcal{O}(\epsilon^{-2})$ . Hence, total number of sample gradient evaluations is $\sum_{k=0}^{T-1}b=T^{2}=\mathcal{O}(\epsilon^{-4})$ , since we chose $b=T$ . ∎

Bibliography39

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[ABR 21] Zeeshan Akhtar, Amrit Singh Bedi, and Ketan Rajawat. Conservative stochastic optimization: O ( t − 1 / 2 ) 𝑂 superscript 𝑡 1 2 {O}(t^{-1/2}) optimality gap with zero constraint violation. In 2021 American Control Conference (ACC) , pages 2224–2229. IEEE, 2021.
2[AGMM 15] Sanjeev Arora, Rong Ge, Tengyu Ma, and Ankur Moitra. Simple, efficient, and neural algorithms for sparse coding. In Conference on learning theory , pages 113–149. PMLR, 2015.
3[ALD 21] Sotirios-Konstantinos Anagnostidis, Aurelien Lucchi, and Youssef Diouane. Direct-search for a class of stochastic min-max problems. In International Conference on Artificial Intelligence and Statistics , pages 3772–3780. PMLR, 2021.
4[AZ 18] Zeyuan Allen-Zhu. Natasha 2: Faster non-convex optimization than sgd. Advances in neural information processing systems , 31, 2018.
5[BBM 18] Raef Bassily, Mikhail Belkin, and Siyuan Ma. On exponential convergence of SGD in non-convex over-parametrized learning. ar Xiv preprint ar Xiv:1811.02564 , 2018.
6[BKR 19] Amrit Singh Bedi, Alec Koppel, and Ketan Rajawat. Asynchronous online learning in multi-agent systems with proximity constraints. IEEE Transactions on Signal and Information Processing over Networks , 5(3):479–494, 2019.
7[CC 15] Yuxin Chen and Emmanuel Candes. Solving random quadratic systems of equations is nearly as easy as solving linear systems. Advances in Neural Information Processing Systems , 28, 2015.
8[CHCW 19] Qi Cai, Mingyi Hong, Yongxin Chen, and Zhaoran Wang. On the global convergence of imitation learning: A case for linear quadratic regulator. ar Xiv preprint ar Xiv:1901.03674 , 2019.

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Videos

Taxonomy

Accelerated Primal-dual Scheme for a Class of Stochastic Nonconvex-concave Saddle Point Problems

Abstract

1 Introduction

Example 1** (Generative adversarial imitation learning).**

Example 2** (Distributionally robust optimization).**

1.1 Related Works

1.2 Contributions

2 Preliminaries

Definition 1**.**

Assumption 1**.**

Assumption 2**.**

Assumption 3**.**

3 Primal-Dual Method with Momentum

Remark 1**.**

4 Convergence Analysis

Remark 2**.**

Theorem 1**.**

Theorem 2**.**

5 Numerical Results

6 Concluding Remarks

APPENDIX

Lemma 1**.**

Lemma 2**.**

Definition 2**.**

Lemma 3**.**

Proof.

Example 1 (Generative adversarial imitation learning).

Example 2 (Distributionally robust optimization).

Definition 1.

Assumption 1.

Assumption 2.

Assumption 3.

Remark 1.

Remark 2.

Theorem 1.

Theorem 2.

Lemma 1.

Lemma 2.

Definition 2.

Lemma 3.