Stochastic Primal-Dual Algorithms with Faster Convergence than   $O(1/\sqrt{T})$ for Problems without Bilinear Structure

Yan Yan; Yi Xu; Qihang Lin; Lijun Zhang; Tianbao Yang

arXiv:1904.10112·cs.LG·December 20, 2019

Stochastic Primal-Dual Algorithms with Faster Convergence than $O(1/\sqrt{T})$ for Problems without Bilinear Structure

Yan Yan, Yi Xu, Qihang Lin, Lijun Zhang, Tianbao Yang

PDF

Open Access

TL;DR

This paper introduces new stochastic primal-dual algorithms that achieve faster convergence rates than the traditional $O(1/\sqrt{T})$ for convex-concave problems without requiring bilinear structure, applicable to robust learning and AUC maximization.

Contribution

The paper develops and analyzes stochastic primal-dual algorithms with a mixture of stochastic and deterministic updates, achieving improved convergence rates for non-bilinear convex-concave problems.

Findings

01

Achieves $O(1/T)$ convergence rate under certain conditions.

02

Applicable to problems with weak strong convexity and strong concavity.

03

Effective in robust model learning and empirical AUC maximization.

Abstract

Previous studies on stochastic primal-dual algorithms for solving min-max problems with faster convergence heavily rely on the bilinear structure of the problem, which restricts their applicability to a narrowed range of problems. The main contribution of this paper is the design and analysis of new stochastic primal-dual algorithms that use a mixture of stochastic gradient updates and a logarithmic number of deterministic dual updates for solving a family of convex-concave problems with no bilinear structure assumed. Faster convergence rates than $O (1/ T)$ with $T$ being the number of stochastic gradient updates are established under some mild conditions of involved functions on the primal and the dual variable. For example, for a family of problems that enjoy a weak strong convexity in terms of the primal variable and has a strongly concave function of the dual variable, the…

Tables1

Table 1. Table 1: Data statistics.

Datasets	#Examples	#Features
w8a	49,749	300
rcv1	20,242	47,236
a9a	32,561	123
real-sim	72,309	20,958
covtype	581,012	54
URL	2,396,130	3,231,961

Equations202

x \in X min y \in dom (ϕ^{*}) max y^{⊤} ℓ (x) - ϕ^{*} (y) + g (x)

x \in X min y \in dom (ϕ^{*}) max y^{⊤} ℓ (x) - ϕ^{*} (y) + g (x)

x \in X min P (x) := ϕ (ℓ (x)) + g (x) .

x \in X min P (x) := ϕ (ℓ (x)) + g (x) .

x \in X min \frac{1}{n} i = 1 \sum n ϕ_{i} (a_{i}^{⊤} x + b_{i}) + g (x),

x \in X min \frac{1}{n} i = 1 \sum n ϕ_{i} (a_{i}^{⊤} x + b_{i}) + g (x),

x \in X min y \in Δ_{n} max i = 1 \sum n y_{i} ℓ_{i} (x) - V (y, y_{0}) + g (x),

x \in X min y \in Δ_{n} max i = 1 \sum n y_{i} ℓ_{i} (x) - V (y, y_{0}) + g (x),

A (x) = ar g y \in dom (ϕ^{*}) max y^{⊤} ℓ (x) - ϕ^{*} (y),

A (x) = ar g y \in dom (ϕ^{*}) max y^{⊤} ℓ (x) - ϕ^{*} (y),

\displaystyle\min_{x\in X}\bigg{\{}P(x)=

\displaystyle\min_{x\in X}\bigg{\{}P(x)=

=

f (x_{1}) \geq f (x_{2}) + \partial f (x_{2})^{⊤} (x_{1} - x_{2}) + \frac{λ}{2} ∥ x_{1} - x_{2} ∥^{2},

f (x_{1}) \geq f (x_{2}) + \partial f (x_{2})^{⊤} (x_{1} - x_{2}) + \frac{λ}{2} ∥ x_{1} - x_{2} ∥^{2},

f (x_{1}) \geq f (x_{2}) + \partial f (x_{2})^{⊤} (x_{1} - x_{2}) + \frac{λ}{2} ∥ x_{1} - x_{2} ∥^{p} .

f (x_{1}) \geq f (x_{2}) + \partial f (x_{2})^{⊤} (x_{1} - x_{2}) + \frac{λ}{2} ∥ x_{1} - x_{2} ∥^{p} .

h^{*} (y) = x max y^{⊤} x - h (x) .

h^{*} (y) = x max y^{⊤} x - h (x) .

d i s t (x, X^{*}) \leq c (P (x) - P^{*})^{1/2},

d i s t (x, X^{*}) \leq c (P (x) - P^{*})^{1/2},

d i s t (x, X^{*}) \leq c (P (x) - P^{*})^{θ},

d i s t (x, X^{*}) \leq c (P (x) - P^{*})^{θ},

\nabla_{x, t}^{⊤} (x_{t} - x) \leq \frac{∥ x _{t} - x ∥ ^{2} - ∥ x _{t + 1} - x ∥ ^{2}}{2 η _{x}} + \frac{η _{x} M ^{2}}{2}

\nabla_{x, t}^{⊤} (x_{t} - x) \leq \frac{∥ x _{t} - x ∥ ^{2} - ∥ x _{t + 1} - x ∥ ^{2}}{2 η _{x}} + \frac{η _{x} M ^{2}}{2}

\nabla_{y, t}^{⊤} (y - y_{t}) \leq \frac{∥ y _{t} - y ∥ ^{2} - ∥ y _{t + 1} - y ∥ ^{2}}{2 η _{y}} + \frac{η _{y} B ^{2}}{2} .

E [f (x_{t}) - f (x)] \leq

E [f (x_{t}) - f (x)] \leq

E [f (x_{t}, y_{t}) - f (x, y_{t})] \leq

E [f (x_{t}, y_{t}) - f (x, y_{t})] \leq

E [f (x_{t}, y) - f (x_{t}, y_{t})] \leq

E [f (x_{t}, y) - f (x_{t}, y_{t})] \leq

E [(f (\overset{x}{^}_{T}, y) - f (x, \overset{y}{^}_{T}))] \leq

E [(f (\overset{x}{^}_{T}, y) - f (x, \overset{y}{^}_{T}))] \leq

E [y \in Y max f (\overset{x}{ˉ}_{T}, y) - f (x^{*}, \overset{y}{ˉ}_{T})] \leq

E [y \in Y max f (\overset{x}{ˉ}_{T}, y) - f (x^{*}, \overset{y}{ˉ}_{T})] \leq

E [P (x_{0}^{(s + 1)})] - P^{*} \leq ϵ_{s} .

E [P (x_{0}^{(s + 1)})] - P^{*} \leq ϵ_{s} .

E [f (\overset{x}{ˉ}_{s}, \overset{y}{^} (\overset{x}{ˉ}_{s})) - f (x^{*}, \overset{y}{ˉ}_{s})] \leq \frac{E [ ∣∣ x ^{*} - x _{0}^{(s)} ∣ ∣ ^{2} ]}{η _{x, s} T _{s}} + \frac{E [ ∣∣ y ^ ( x ˉ _{s} ) - y _{0}^{(s)} ∣ ∣ ^{2} ]}{η _{y, s} T _{s}} + \frac{5 η _{x, s} M ^{2}}{2} + \frac{5 η _{y, s} B ^{2}}{2} .

E [f (\overset{x}{ˉ}_{s}, \overset{y}{^} (\overset{x}{ˉ}_{s})) - f (x^{*}, \overset{y}{ˉ}_{s})] \leq \frac{E [ ∣∣ x ^{*} - x _{0}^{(s)} ∣ ∣ ^{2} ]}{η _{x, s} T _{s}} + \frac{E [ ∣∣ y ^ ( x ˉ _{s} ) - y _{0}^{(s)} ∣ ∣ ^{2} ]}{η _{y, s} T _{s}} + \frac{5 η _{x, s} M ^{2}}{2} + \frac{5 η _{y, s} B ^{2}}{2} .

E [P (\overset{x}{ˉ}_{s}) - P^{*}] \leq

E [P (\overset{x}{ˉ}_{s}) - P^{*}] \leq

\displaystyle\mathrm{E}[||x^{*}-x_{0}^{(s)}||^{2}]\leq\mathrm{E}\bigg{[}\frac{2}{\mu}(P(x_{0}^{(s)})-P^{*})\bigg{]}\leq\frac{2\epsilon_{s-1}}{\mu}.

\displaystyle\mathrm{E}[||x^{*}-x_{0}^{(s)}||^{2}]\leq\mathrm{E}\bigg{[}\frac{2}{\mu}(P(x_{0}^{(s)})-P^{*})\bigg{]}\leq\frac{2\epsilon_{s-1}}{\mu}.

∣∣ \overset{y}{^} (\overset{x}{ˉ}_{s}) - y_{0}^{(s)} ∣ ∣^{2} =

∣∣ \overset{y}{^} (\overset{x}{ˉ}_{s}) - y_{0}^{(s)} ∣ ∣^{2} =

\leq

=

=

E [∣∣ \overset{y}{^} (\overset{x}{ˉ}_{s}) - y_{0}^{(s)} ∣ ∣^{2}] \leq

E [∣∣ \overset{y}{^} (\overset{x}{ˉ}_{s}) - y_{0}^{(s)} ∣ ∣^{2}] \leq

\leq

\leq

E [P (\overset{x}{ˉ}_{s}) - P^{*}] \leq

E [P (\overset{x}{ˉ}_{s}) - P^{*}] \leq

E [P (\overset{x}{ˉ}_{s}) - P^{*}] \leq

E [P (\overset{x}{ˉ}_{s}) - P^{*}] \leq

E [P (\overset{x}{ˉ}_{S}) - P^{*}] \leq ϵ_{S} = ϵ .

E [P (\overset{x}{ˉ}_{S}) - P^{*}] \leq ϵ_{S} = ϵ .

T \geq

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsStochastic Gradient Optimization Techniques · Sparse and Compressive Sensing Techniques · Statistical Methods and Inference

Full text

Stochastic Primal-Dual Algorithms with Faster Convergence than $O(1/\sqrt{T})$ for Problems without Bilinear Structure

\NameYan Yan1\[email protected]

\NameYi Xu1\[email protected]

\NameQihang Lin2\[email protected]

\NameLijun Zhang3\[email protected]

\NameTianbao Yang1\[email protected]

\addr1Department of Computer Science

The University of Iowa

Iowa City

IA 52242

\addr2Department of Management Sciences

University of Iowa

Iowa City

IA 52242

\addr3National Key Laboratory for Novel Software Technology Nanjing University

Nanjing 210023

China

Abstract

Previous studies on stochastic primal-dual algorithms for solving min-max problems with faster convergence heavily rely on the bilinear structure of the problem, which restricts their applicability to a narrowed range of problems. The main contribution of this paper is the design and analysis of new stochastic primal-dual algorithms that use a mixture of stochastic gradient updates and a logarithmic number of deterministic dual updates for solving a family of convex-concave problems with no bilinear structure assumed. Faster convergence rates than $O(1/\sqrt{T})$ with $T$ being the number of stochastic gradient updates are established under some mild conditions of involved functions on the primal and the dual variable. For example, for a family of problems that enjoy a weak strong convexity in terms of the primal variable and has a strongly concave function of the dual variable, the convergence rate of the proposed algorithm is $O(1/T)$ . We also investigate the effectiveness of the proposed algorithms for learning robust models and empirical AUC maximization.

1 Introduction

This paper is motivated by solving the following convex-concave problem:

[TABLE]

where $X\subseteq\mathbb{R}^{d}$ is a closed convex set, $\ell(x)=(\ell_{1}(x),\ldots,\ell_{n}(x))^{\top}:X\rightarrow\mathbb{R}^{n}$ is a lower-semicontinuous mapping whose component function $\ell_{i}(\mathbf{x})$ is lower-semicontinuous and convex, $\phi^{*}(y):\text{dom}(\phi^{*})\rightarrow\mathbb{R}$ is a convex function whose convex conjugate is denoted by $\phi$ , and $g(\mathbf{x}):X\rightarrow\mathbb{R}$ is a lower-semicontinuous convex function. To ensure the convexity of the problem, it is assumed that $\text{dom}(\phi^{*})\subseteq\mathbb{R}_{+}^{n}$ if $\ell(x)$ is not an affine function. By using the convex conjugate $\phi^{*}$ , the problem (1) is equivalent to the following convex minimization problem:

[TABLE]

A particular family of min-max problem (1) and its minimization form (2) that has been considered extensively in the literature (Zhang and Lin, 2015; Yu et al., 2015; Tan et al., 2018; Shalev-Shwartz and Zhang, 2013; Lin et al., 2014) is that $\ell(x)=Ax+b$ is an affine function and $\phi(s)=\sum_{i=1}^{n}\phi_{i}(s_{i})$ for $s\in\mathbb{R}^{n}$ is decomposable. In this case, the problem (2) is known as (regularized) empirical risk minimization problem in machine learning:

[TABLE]

where $a_{i}$ is the $i$ -th row of $A$ and $b_{i}$ is the i-th element of $b$ .

However, stochastic optimization algorithms with fast convergence rates are still under-explored for a more challenging family of problems of (1) and (2) where $\ell(x)$ is not necessarily an affine or smooth function and $\phi$ is not necessarily decomposable. It is our goal to design new stochastic primal-dual algorithms for solving these problems with a fast convergence rate. A key motivating example of the considered problem is to solve a distributionally robust optimization problem:

[TABLE]

where $\Delta_{n}=\{y\in\mathbb{R}^{n};y_{i}\geq 0,\sum_{i}y_{i}=1\}$ is a simplex, and $V(y,y_{0})$ denotes a divergence measure (e.g., $\phi$ -divergence) between two sets of probabilities $y$ and $y_{0}$ . In machine learning with $\ell_{i}(x)$ denoting the loss of a model $x$ on the $i$ -th example, the above problem corresponds to robust risk minimization paradigm, which can achieve variance-based regularization for learning a predictive model from $n$ examples (Namkoong and Duchi, 2017). Other examples of the considered challenging problems can be found in robust learning from multiple perturbed distributions (Chen et al., 2017a) in which $\ell_{i}(x)$ corresponds to the loss from the $i$ -th perturbed distribution, and minimizing non-decomposable loss functions (Fan et al., 2017; Dekel and Singer, 2006).

With stochastic (sub)-gradients computed for $x$ and $y$ , one can employ the conventional primal-dual stochastic gradient method or its variant (Nemirovski et al., 2009; Juditsky et al., 2011) for solving the problem (1). Under appropriate basic assumptions, one can derive the standard $O(1/\sqrt{T})$ convergence rate with $T$ being the number of stochastic updates. However, the convergence rate $O(1/\sqrt{T})$ is known as a slow convergence rate. It is always desirable to design optimization algorithms with a faster convergence. Nonetheless, to the best of our knowledge stochastic primal-dual algorithms with a fast convergence rate of $O(1/T)$ in terms of minimizing $P(x)$ remain unknown in general, even under the strong convexity of $\phi^{*}$ and $g$ . In contrast, if $\phi$ is decomposable and $P$ is strongly convex, the standard stochastic gradient method for solving (2) with an appropriate scheme of step size has a convergence rate of $O(1/T)$ (Hazan et al., 2007; Hazan and Kale, 2011a). A direct extension of algorithms and analysis for stochastic strongly convex minimization to the stochastic concave-concave optimization does not give a satisfactory $O(1/T)$ convergence rate 111One may obtain a dimensionality dependent convergence rate of $O(n/T)$ by following conventional analysis, but it is not the standard dimensionality independent rate that we aim to achieve. . It is still an open problem that whether there exists a stochastic primal-dual algorithm by solving the convex-concave problem (1) that enjoys a fast rate of $O(1/T)$ in terms of minimizing $P(x)$ .

The major contribution of this paper is to fill this gap by developing stochastic primal-dual algorithms for solving (1) such that they enjoy a faster convergence than $O(1/\sqrt{T})$ in terms of the primal objective gap. In particular, under the assumptions that $\nabla\phi$ is Lipschitz continuous, $\ell_{i}(x)$ are Lipschitz continuous and the minimization problem (2) satisfies the strong convexity condition, the proposed algorithms enjoy an iteration complexity of $O(1/\epsilon)$ for finding a solution $x$ such that $\mathrm{E}[P(x)-\min_{x\in X}P(x)]\leq\epsilon$ , which corresponds to a faster convergence rate of $O(1/T)$ . The key difference of the proposed algorithms from the traditional stochastic primal-dual algorithm is that it is required to compute a logarithmic number of deterministic updates for $y$ in the following form:

[TABLE]

which can be usually solved in $O(n)$ time complexity. It would be worth noting that $\mathcal{A}(x)=\nabla\phi(\ell(x))$ (See Appendix A). When $n$ is a moderate number, the proposed algorithms could converge faster than the traditional primal-dual stochastic gradient method. It is also important to note that we do not assume the proximal mapping of $\phi^{*}$ and $g$ can be easily computed. Instead, our algorithms only require (stochastic) sub-gradients of $\phi^{*}$ and $g$ , which make them applicable and efficient for solving more challenging problems where $g$ is an empirical sum of individual functions.

In addition, the proposed algorithms and theories can be easily extended to the case that $\nabla\phi$ is Hölder continuous and the minimization problem (2) satisfies a more general local error bound condition as defined later, with intermediate faster rates established.

2 Related Work

Stochastic primal-dual gradient method and its variant were first analyzed by (Nemirovski et al., 2009) for solving a more general problem $\min_{x\in X}\max_{y\in Y}\mathrm{E}_{\xi}[f(x,y;\xi)]$ . Under the standard bounded stochastic (sub)-gradient assumption, a convergence rate of $O(1/\sqrt{T})$ was established for a primal-dual gap, which implies a convergence rate of $O(1/\sqrt{T})$ for minimizing the primal objective $P(x)=\max_{y\in Y}\mathrm{E}_{\xi}[f(x,y;\xi)]$ . Later, there are couple of studies that aim to strengthen this convergence rate by leveraging the smoothness of $f(x,y;\xi)$ or the involved function when there is a special structure of the objective function (Juditsky et al., 2011; Chen et al., 2014, 2017b). However, the worst-case convergence rate of these later algorithms is still dominated by $O(1/\sqrt{T})$ . Without smoothness assumption on $\ell(\mathbf{x})$ or a bilinear structure, these later algorithms are not directly applicable to solving (1). In addition, Frank Wolfe algorithms are analyzed for saddle point problems in (Gidel et al., 2016), which could also achieve a convergence rate of $O(1/\sqrt{T})$ in terms of primal-dual gap under the smoothness condition.

Recently, there emerge several algorithms with faster convergence for solving (1) by leveraging the bilinear structure and strong convexity of $\phi^{*}$ and $g$ . For example, Zhang and Lin (2015) proposed a stochastic primal-dual coordinate (SPDC) method for solving (3) under the condition that $\ell(x)=Ax$ is of bilinear structure and $\phi^{*}$ is strongly convex. When $g$ is also a strongly convex function, SPDC enjoys a linear convergence for the primal-dual gap. Other variants of SPDC have been considered in (Yu et al., 2015; Tan et al., 2018) for solving (1) with bilinear structure. Palaniappan and Bach (2016) proposed stochastic variance reduction methods for solving a family of saddle-point problems. When applied to (1), they require $\ell(\mathbf{x})$ is either an affine function or a smooth mapping. If additionally $g$ and $\phi^{*}$ are strongly convex, their algorithms also enjoy a linear convergence for finding a solution that is $\epsilon$ -close to the optimal solution in squared Euclidean distance. Du and Hu (2018) established a similar linear convergence of a primal-dual SVRG algorithm for solving (1) when $\ell=Ax$ is an affine function with a full column rank for $A$ , $g$ is smooth, and $\phi^{*}$ is smooth and strongly convex, which are stronger assumptions than ours. All of these algorithms except (Du and Hu, 2018) also need to compute the proximal mapping of $\phi_{i}^{*}$ and $g$ at each iteration. In contrast, the present work is complementary to these studies aiming to solve a more challenging family of problems. In particular, the proposed algorithms do not require the bilinear structure or the smoothness of $\ell$ , and the smoothness and strong convexity of $\phi^{*}$ and $g$ are also not necessary. In addition, we do not assume that $g$ and $\phi^{*}$ have an efficient proximal mapping.

Several recent studies have been devoted to stochastic AUC optimization based on a min-max formulation that has a bilinear structure (Liu et al., 2018a; Natole et al., 2018), aiming to derive a faster convergence rate of $O(1/T)$ . The differences from the present work is that (i) (Liu et al., 2018a)’s analysis is restricted to the online setting for AUC optimization; (ii) (Natole et al., 2018) only proves a convergence rate of $O(1/T)$ in term of squared distance of found primal solution to the optimal solution under the strong convexity of the regularizer on the primal variable, which is weaker than our results on the convergence of the primal objective gap. To the best of our knowledge, the present work is the first one that establishes a convergence rate of $O(1/T)$ in terms of minimizing $P(x)$ for the proposed stochastic primal-dual methods by solving a general convex-concave problem (1) without bilinear structure or smoothness assumption on $\ell(\mathbf{x})$ under (weakly local) strong convexity.

Restart schemes are recently considered to get improved convergence rate under some conditions. In (Roulet and d’Aspremont, 2017), restart scheme is analyzed for smooth convex problems under the sharpness and Hölder continuity condition. In (Dvurechensky et al., 2018), a universal algorithm is proposed for variational inequalities under Hölder conituity condition where the Hölder parameters are unknown. Stochastic algorithms are proposed for strongly convex stochastic composite problems in (Ghadimi and Lan, 2012, 2013).

Finally, we would like to mention that our algorithms and techniques share many similarities to that proposed in (Xu et al., 2017) for solving stochastic convex minimization problems under the local error bound condition. However, their algorithms are not directly applicable to the convex-concave problem (1) or the problem (2) with non-decomposable function $\phi$ . The novelty of this work is the design and analysis of new algorithms that can leverage the weak local strong convexity or more general local error bound condition of the primal minimzation problem (2) through solving the convex-concave problem (1) for enjoying a faster convergence.

3 Preliminaries

Recall that the problem of interest:

[TABLE]

where $Y=\text{dom}(\phi^{*})$ . Let $X^{*}$ denote the optimal set of the primal variable for the above problem, $P^{*}$ denote the optimal primal objective value and $x^{*}=\arg\min_{z\in X^{*}}||x-z||$ is the optimal solution closest to $x$ , where $\|\cdot\|$ denotes the Euclidean norm.

Let $\Pi_{\Omega}[\cdot]$ denote the projection onto the set $\Omega$ . Denote by $\mathcal{S}_{\epsilon}:=\{x\in X:P(x)-P^{*}\leq\epsilon\}$ and $\mathcal{L}_{\epsilon}:=\{x\in X:P(x)-P^{*}=\epsilon\}$ denote the $\epsilon$ -level set and $\epsilon$ -sublevel set of the primal problem, respectively. A function $f(x):X\rightarrow\mathbb{R}$ is $L$ -smooth if it is differentiable and its gradient is $L$ -Lipchitz continuous, i.e., $\|\nabla f(x_{1})-\nabla f(x_{2})\|\leq L\|x_{1}-x_{2}\|,\forall x_{1},x_{2}\in X$ . A differentiable function $f$ is said to have an $(L,v)$ -Hölder continuous gradient with $v\in(0,1]$ iff $\|\nabla f(x_{1})-\nabla f(x_{2})\|\leq L\|x_{1}-x_{2}\|^{v}$ . When $v=1$ , Hölder continuous gradient reduces to Lipchitz continuous gradient. A function $f$ is called $\lambda$ -strongly convex if for any $x_{1},x_{2}\in X$ there exists $\lambda>0$ such that

[TABLE]

where $\partial f(\mathbf{x})$ denotes any subgradient of $f$ at $x$ . A more general definition is the uniform convexity. $f$ is uniformly convex with degree $p\geq 2$ if for any $x_{1},x_{2}\in X$ there exists $\lambda>0$ such that

[TABLE]

For analysis of the proposed algorithms, we need a few basic notions about convex conjugate. For an extended real-valued convex function $h:\mathbb{R}^{d}\rightarrow\mathbb{R}\cup\{\infty,-\infty\}$ , the convex conjugate of $h$ is defined as

[TABLE]

The convex conjugate of $h^{*}$ is $h$ . Due to the convex duality, if $h^{*}$ is $\lambda$ -strongly convex then $h$ is differentiable and is $(1/\lambda)$ -smooth. More generally, if $h^{*}$ is $p$ -uniformly convex then $h$ is differentiable and its gradient is $(L,v)$ -Hölder continuous where $v=\frac{1}{p-1}$ , $L=(\frac{1}{\lambda})^{v}$ (Nesterov, 2015).

One of the conditions that allows us to derive a fast rate of $O(1/T)$ for a stochastic algorithm is that both $g$ and $\phi^{*}$ are strongly convex, which implies that $f(x,y)$ is strongly convex in terms of $x$ and strongly concave in terms of $y$ . One might regard this as a trivial task given the $O(1/T)$ result for stochastic strongly convex minimization where a stochastic gradient is available for the objective function to be minimized (Hazan et al., 2007; Hazan and Kale, 2011a). However, the analysis for stochastic strongly convex minimization is not directly applicable to stochastic primal-dual algorithms, as briefly explained later as we present our results.

Moreover, the strong convexity of $g$ can be relaxed to a weak strong convexity of $P$ to derive a similar order of convergence rate, i.e., for any $x\in X$ , we have

[TABLE]

where $dist(x,X^{*})=\min_{z\in X^{*}}\|z-x\|_{2}$ is the distance between $x$ and the optimal set $X^{*}$ . More generally, we can consider a setting in which $P$ satisfies a local error bound (or local growth) condition as defined below.

Definition 1.

A function $P(x)$ is said to be satisfied local error bound (LEB) condition if for any $x\in\mathcal{S}_{\epsilon}$ ,

[TABLE]

where $c>0$ is a constant, and $\theta\in[0,1]$ is a parameter.

This condition was recently studied in (Yang and Lin, 2018) for developing a faster subgradient method than the standard subgradient method, and was laster considered in (Xu et al., 2017) for stochastic convex optimization. A global version of the above condition (known as the global error bound condition) has a long history in mathematical programming (Pang, 1997). However, exploiting this condition for developing stochastic primal-dual algorithms seems to be new. When $\theta=1/2$ , the above condition is also referred to as weakly local strong convexity. When $\theta=0$ , it can capture general convex functions as long as $\text{dist}(x,X^{*})$ is upper bounded for $x\in\mathcal{S}_{\epsilon}$ , which is true if $X^{*}$ is compact or $X$ is compact.

In parallel with the relaxed condition on $P$ , we can also relax the smoothness condition on $\phi$ or strong convexity condition on $\phi^{*}$ to Hölder continuous gradient condition on $\phi$ or a uniformly convexity condition on $\phi^{*}$ . Under the local error bound condition of $P$ and the Hölder continuous gradient condition of $\phi$ , we are able to develop stochastic primal-dual algorithms with intermediate complexity depending on $\theta$ and $\nu$ , which varies from $O(1/\epsilon^{2})$ to $O(\log(1/\epsilon))$ .

Formally, we will develop stochastic primal-dual algorithms for solving (3) under the following assumptions.

Assumption 1.

For Problem (3), we assume

(1)

There exist $x_{0}\in X$ and $\epsilon_{0}>0$ such that $P(x_{0})-P^{*}\leq\epsilon_{0}$ ; 2. (2)

Let $\nabla_{x}f(x,y;\xi)$ and $\nabla_{y}f(x,y;\xi)$ denote the stochastic subgradient of $f(x,y)$ w.r.t. $x$ and $y$ , respectively. There exists constants $M\geq 0$ and $B\geq 0$ such that $||\nabla_{x}f(x,y;\xi)||\leq M$ and $||\nabla_{y}f(x,y;\xi)||\leq B$ . 3. (3)

$\phi^{*}(\cdot)$ * is $p$ -uniformly convex with $\lambda_{\phi}>0$ such that $\phi$ has $(L,v)$ -Hölder continuous gradient where $v=\frac{1}{p-1}$ and $L=(1/\lambda_{\phi})^{v}$ .* 4. (4)

$\ell(x)$ * is $G$ -Lipchitz continuous for $x\in X$ .* 5. (5)

One of the following conditions hold: (i) $P(x)$ is $\mu$ -strongly convex; (ii) $P(x)$ satisfies the LEB condition for $c>0$ and $\theta\in(0,1]$ .

Remark. Assumption 1 (1) assumes that there is a lower bound of $P^{*}$ , which is usually satisfied in machine learning problems. Assumption 1 (2) is a common assumption usually made in existing stochastic-based methods. Note that we do not assume $g$ and $\phi^{*}$ have efficient proximal mapping. Instead, we only require a stochastic subgradient of $g$ and $\phi^{*}$ . Assumption 1 (3) is a general condition which unifies both smooth and non-smooth assumptions on $\phi$ . When $v=1$ , $\phi(\cdot)$ satisfies the classical smooth condition with parameter $L$ . When $v=0$ , it is the classical non-smooth assumption on the boundness of the subgradients. We will state our convergence results in terms of $v$ and $L$ instead of $p$ and $\lambda_{\phi}$ . Assumption 1 (4) on the Lipschitz continuity of $\ell(x)$ is more general than assuming a bilinear form $\ell(x)=Ax+b$ . Finally, we note that assuming the strong convexity of $P(x)$ allows us to develop a stochastic primal-dual algorithm with simpler updates.

4 Main Results

In this section, we will present our main results for solving (3). Our development is divided into three parts. First, we present a stochastic primal-dual algorithm and its convergence result when the primal objective function $P(x)$ is strongly convex and $\phi^{*}$ is also strongly convex. Then we extend the result into a more general case, i.e., $P(x)$ satisfying LEB condition and $\phi^{*}$ is uniformly convex. Lastly, we propose an adaptive variant with the same order of convergence result when the value of parameter $c$ in LEB condition is unknown, which is also useful for tackling problems without knowing the value of $\theta$ . For both cases, we assume $P(x_{0})-P^{*}\leq\epsilon_{0}$ .

4.1 Restarted Stochastic Primal-Dual Algorithm for Strongly Convex $P$

The detailed updates of the proposed stochastic algorithm for strongly convex $P$ are presented in Algorithm 1, to which we refer as restarted stochastic primal-dual algorithm or RSPD ${}^{\text{sc}}$ for short. The algorithm is based on a restarting idea that have been used widely in existing studies (Hazan and Kale, 2011b; Ghadimi and Lan, 2013; Xu et al., 2017; Yang and Lin, 2018). It runs in epoch-wise and it has two loops. The steps 3-7 are the standard updates of stochastic primal-dual subgradient method (Nemirovski et al., 2009). However, the key difference from these previous studies is that the restarted solution for the dual variable $y$ for the next epoch $s+1$ is computed based on the averaged primal variable for the $s$ -th epoch. It is this step that explores the strong convexity of $\phi^{*}$ , which together with the restarting scheme allows us exploring the strong convexity of $P$ to derive a fast convergence rate of $O(1/T)$ with $T$ being the total number of iterations.

Below, we will briefly discuss the path for proving the fast convergence rate of RSPD. We first show that why the standard analysis for strongly convex minimization can not be generalized to the stochastic convex-concave problem to derive the fast convergence rate of $O(1/T)$ . Let $\nabla_{x,t}=\nabla_{x}f(x_{t},y_{t};\xi_{t})$ and similarly for $\nabla_{y,t}$ . A standard convergence analysis for the inner loop (steps 3-6) of Algorithm 1 usually starts from the following inequalities.

Lemma 1.

For the updates in Step 4 and 5 omitting the subscript $s$ , the following holds for any $x\in X,y\in Y$

[TABLE]

For stochastic strongly convex minimization problems in which $y$ is absent in the above inequalities, one can take expectation over (8) and then apply the $\lambda$ -strong convexity of $f(x)$ to get the following inequality

[TABLE]

Based on the above inequalities for all $t=1,\ldots,T$ , one can design a particular scheme of step size $\eta_{x,t}=1/(\lambda t)$ that allows us to derive $\widetilde{O}(1/T)$ convergence rate. However, such analysis cannot be extended to the primal-dual case.

A naive approach would be taking expectation for both (8) and (9) for a fixed $x,y$ and applying the $\lambda_{x}$ -strong convexity (resp. $\lambda_{y}$ -strong concavity) of $f(x,y)$ in terms of $x$ (resp. $y$ ), which yields the following inequalities

[TABLE]

It is notable that in deriving the above inequalities, $x$ and $y$ have to be independent of $\xi_{1},\ldots,\xi_{T}$ .

By adding the above inequalities together and applying the same analysis for the R.H.S with $\eta_{x,t}=1/(\lambda_{x}t)$ and $\eta_{y,t}=1/(\lambda_{y}t)$ , we can obtain the following inequalities for any fixed $y\in Y$ and $x\in X$ independent of $\xi_{1},\ldots,\xi_{T}$ :

[TABLE]

where $\hat{x}_{T}=\sum_{t=0}^{T-1}x_{t}/T$ and $\hat{y}_{T}=\sum_{t=0}^{T-1}y_{t}/T$ . However, the above inequality does not imply the convergence for the standard definition of primal-dual gap of $\max_{x\in X,y\in Y}(f(\hat{x}_{T},y)-f(x,\hat{y}_{T}))$ or even the primal objective gap $P(\hat{x}_{T})-\min_{x\in X}P(x)$ . The main obstacle is that we cannot set $y=\arg\max_{y\in Y}f(\hat{x}_{T},y)$ which will make $y$ depend on $\xi_{1},\ldots,\xi_{T}$ and hence make the expectional analysis fail. It would be worth noting that following (Gidel et al., 2016), one could derive the upper bound of primal-dual gap of $(\hat{x}_{T},\hat{y}_{T})$ by $\max_{y\in Y}f(\hat{x}_{T},y)-\min_{x\in X}f(x,\hat{y}_{T})\leq\sqrt{2}P_{\mathcal{L}}\sqrt{f(\hat{x}_{T},y^{*})-f(x^{*},\hat{y}_{T})}$ (see Equation (5), (13) and (14) therein), where $P_{\mathcal{L}}$ can be upper bounded by a constant and $y^{*}\in\arg\max_{y\in Y}f(x^{*},y^{*})$ . Even if one sets $x=x^{*}$ and $y=y^{*}$ in (10), the convergence rate of primal-dual gap is only of $O(\sqrt{\log(T)/T})$ , which is not what we pursue.

Another approach that gets around of the issue introduced by taking the expectation is by using high probability analysis. To this end, one can use concentration inequalities to bound the martingale difference sequence $\sum_{t=1}^{T}(\nabla_{x}f(x_{t},y_{t};\xi_{t})-\nabla_{x}f(x_{t},y_{t}))^{\top}(x_{t}-x)$ and $\sum_{t=1}^{T}(\nabla_{y}f(x_{t},y_{t};\xi_{t})-\nabla_{y}f(x_{t},y_{t}))^{\top}(y-y_{t})$ for a fixed $x$ and $y$ (Kakade and Tewari, 2008). However, in order to prove the primal objective gap $P(\hat{x}_{T})-P^{*}$ one has to bound the later martingale difference sequence for any possible $y\in Y$ so that one can get $P(\hat{x}_{t})$ from $\max_{y\in Y}f(\hat{x}_{T},y)$ . A standard approach for achieving this high probability bound is by using a covering number argument for the set $Y$ . However, this will inevitably introduce dependence on the dimensionality of $y$ . For example, an $\epsilon$ -cover of a bounded ball of radius $R$ in $\mathbb{R}^{n}$ has cardinality of $O((R/\epsilon)^{n})$ , and of a simplex in $\mathbb{R}^{n}$ has cardinality of $O((1/\epsilon)^{n-1})$ .

To tackle the aforementioned challenges for both exceptional analysis and high probability analysis, we develop a different analysis for the proposed RSPD algorithm in order to achieve a faster convergence rate of $O(1/T)$ without explicit dependence on the dimensionality of $y$ . In this subsection, we will focus on expectional convergence result, which will be extended to high probability convergence in next subsection. Our expectional analysis is build on the following lemma that is used to derive $O(1/\sqrt{T})$ convergence rate in the literature (Nemirovski et al., 2009).

Lemma 2.

*Let the Lines 4 and 5 of Algorithm 1 run for $T$ iterations with a fixed step size $\eta_{x}$ and $\eta_{y}$ . Then *

[TABLE]

where $\bar{x}_{T}=\sum_{t=0}^{T-1}x_{t}/T$ , $\bar{y}_{T}=\sum_{t=0}^{T-1}y_{t}/T$ , $\hat{y}_{T}=\arg\max_{y\in Y}f(\bar{x}_{T},y)$ and $x^{*}\in X^{*}$ .

Remark: A nice property of the above result is that the max over $y$ in the L.H.S is taken before expectation.

Nevertheless, a simple approach for setting the step size as $O(1/\sqrt{T})$ still yields a convergence rate of $O(1/\sqrt{T})$ by assuming the size of $Y$ is bounded (Nemirovski et al., 2009). The proposed RSPD algorithm has the special design of computing the restarted solutions and setting the step sizes, which together allows us to achieve $O(1/T)$ convergence rate as stated in the following theorem. The key idea is that by using $y_{0}^{(s+1)}=\mathcal{A}(x_{0}^{(s+1)})$ as a restarted point for the dual variable, we are able to connect $\|\hat{y}_{T}-y_{0}\|$ to $P(x_{0}^{(s)})-P^{*}$ by using the strong convexity of $P$ and of $\phi^{*}$ . The convergence result of RSPD ${}^{\text{sc}}$ is presented below.

Theorem 2.

Suppose that Assumption 1 holds with $v=1$ and $P(x)$ being $\mu$ -strongly convex. By setting $S=\lceil\log(\frac{\epsilon_{0}}{\epsilon})\rceil$ and $T_{1}=\frac{\max\{405M^{2},810L^{2}G^{2}B^{2}\}}{\mu\epsilon_{0}}$ , then Algorithm 1 guarantees that $\mathrm{E}[P({\bar{x}}_{S})-P^{*}]\leq\epsilon$ . The total number of iterations is $O(\frac{1}{\mu\epsilon})$ .

Remark. The equivalent convergence rate of the above result is $O(1/(\mu T))$ given a total number of iterations $T$ . This matches the state-of-the-art convergence result for stochastic strongly convex minimization (Hazan and Kale, 2011b). Our algorithm can be applied to solving (2) for non-decomposable $\phi$ . In contrast to the standard stochastic primal-dual subgradient method, the additional computational overhead in RSPD ${}^{\text{sc}}$ is introduced by computing the restarted points $y_{0}^{s+1}=\mathcal{A}(x_{0}^{(s+1)})$ . However, such computation only happens for a logarithmic number of times in the order of $O(\log(1/\epsilon))$ . We defer the discussion on the total time complexity of RSPD to the next section for some particular applications.

Proof.

To prove Theorem 2, we first need Lemma 2. Its proof will be given in Appendix B.

Let $\epsilon_{s}=\frac{\epsilon_{s-1}}{2}$ , by the setting of Algorithm 1, we know $\eta_{x,s+1}=\frac{2\epsilon_{s}}{45M^{2}}$ , $\eta_{y,s+1}=\frac{2\epsilon_{s}}{45B^{2}}$ , and $x_{0}^{(s+1)}={\bar{x}}_{s}=\frac{1}{T_{s}}\sum_{t=1}^{T_{s}}x_{t}^{(s)}$ for $s=0,1,\dots$ . We will show $\mathrm{E}[P(x_{0}^{(s+1)})]-P^{*}\leq\epsilon_{s}$ by induction for $s=0,1,\dots$ . It is easy to verify $\mathrm{E}[P(x_{0}^{(1)})]-P^{*}\leq\epsilon_{0}$ for a sufficiently large $\epsilon_{0}$ according to Assumption 1. Next, we need to show that conditional on $\mathrm{E}[P(x_{0}^{(s)})]-P^{*}\leq\epsilon_{s-1}$ , then we have

[TABLE]

Consider the update of $s$ -th stage. By Lemma 2 for the update of $s$ -the stage, we have

[TABLE]

Since $P({\bar{x}}_{s})=f({\bar{x}}_{s},\hat{y}({\bar{x}}_{s}))$ and $P(x^{*})=\max_{y\in Y}f(x^{*},y)\geq f(x^{*},{\bar{y}}_{s})$ , we have

[TABLE]

For the first term on the RHS of (12), by the strong convexity of $P(x)$ and the condition $\mathrm{E}[P(x_{0}^{(s)})]-P^{*}\leq\epsilon_{s-1}$ we have

[TABLE]

For the second term on the RHS of (12),

[TABLE]

where the first equality is due to the set up of the algorithm and Lemma 5, the second equality is due to $\phi(\cdot)$ is smooth ( $v=1$ ). Since $P(x)$ is strongly convex with parameter $\mu>0$ , its optimal solution $x_{*}$ is unique, then we have

[TABLE]

Then the inequality (12) becomes

[TABLE]

By the setting of $\eta_{x,s}=\frac{2\epsilon_{s-1}}{45M^{2}}$ , $\eta_{y,s}=\frac{2\epsilon_{s-1}}{45B^{2}}$ and $T_{s}=\frac{\max\{405M^{2},810L^{2}G^{2}B^{2}\}}{\mu\epsilon_{s-1}}$ , we know $\frac{\frac{4L^{2}G^{2}}{\mu}}{\eta_{y,s}T}\leq\frac{1}{9}$ , then

[TABLE]

Therefore, by induction, after running $S=\lceil\log(\frac{\epsilon_{0}}{\epsilon})\rceil$ stages, we have

[TABLE]

The total iteration complexity is $\sum_{s=1}^{S}T_{s}=O(\frac{1}{\epsilon})$ . ∎

4.2 RSPD Algorithm under the LEB condition

In the previous subsection, we introduce the RSPD ${}^{\text{sc}}$ algorithm for solving problem (1) when the objective function $P(x)$ is strongly convex and $\phi(\cdot)$ is $L$ -smooth. However, these conditions are sometimes too strong for many machine learning problems. In this subsection, we will relax these strong conditions by assuming that $P(x)$ satisfies the LEB condition (7) and $\phi(\cdot)$ has $(L,v)$ -Hölder continuous gradient with $v\in[0,1]$ . We will develop a different variant of RSPD that also has high probability convergence guarantee.

Denote by $\mathcal{B}_{x}(x_{0},R)=\{x\in X:\|x-x_{0}\|\leq R\}$ a ball centered at $x_{0}$ with a radius $R$ intersected with $X$ , and similarly by $\mathcal{B}_{y}(y_{0},R)=\{y\in Y:\|y-y_{0}\|\leq R\}$ a ball centered at $y_{0}$ with a radius $R$ intersected with $Y$ . The second variant of the RSPD algorithm for solving problem (1) is summarized in Algorithm 2, which is similar to the RSPD ${}^{\text{sc}}$ algorithm except that the iterates are projected to bounded balls centered at the initial solutions of each epoch. This complication on the updates is introduced for the purpose of high-probability analysis, which also allows us to tackle problems that satisfies the LEB condition with $\theta>1/2$ . After each epoch, the proposed RSPD algorithm reduces the radius of the Euclidean ball. It is notable that this ball shrinkage technique is not new and has already used in Epoch-SGD method (Hazan and Kale, 2011b) for high probability bound analysis. We set the same value of initial radius for primal variable $x$ and dual variable $y$ in RSPD algorithm for the convenience of analysis. However, one can use different values but the same order of convergence result will be obtained by changing the analysis slightly. Another feature of RSPD that is different from RSPD ${}^{\text{sc}}$ is that RSPD uses a constant number of iterations in the inner loop in order to accommodate the local error bound condition.

We summarize the theoretical result of Algorithm 2 with a high probability bound in the following theorem.

Theorem 3.

*Suppose that Assumption 1 holds and $P(x)$ obeys the LEB condition (7). Given $\delta\in(0,1)$ , let $S=\lceil\log(\frac{\epsilon_{0}}{\epsilon})\rceil$ , $\tilde{\delta}=\delta/S$ , $R_{1}=O(\frac{c\epsilon_{0}}{\epsilon^{1-\theta}})$ and *

[TABLE]

Algorithm 2 guarantees that $P({\bar{x}}_{S})-P^{*}\leq 2\epsilon$ with at least probability $1-\delta$ . The total number of iterations is $\widetilde{O}(\frac{1}{\epsilon^{2(1-v\theta)}})$ , where $\widetilde{O}$ suppresses a logarithmic factor.

Remark. When $v\theta>0$ , RSPD enjoys the improved iteration complexity than $O(1/\sqrt{T})$ . When $v=1$ (i.e., $\phi(\cdot)$ is smooth), if $\theta=\frac{1}{2}$ (e.g., $P(x)$ is (weakly) strongly convex), then RSPD enjoys the iteration complexity of $O(\log(1/\epsilon)/\epsilon)$ , which is only worse by a logarithmic factor than the expectional convergence result in Theorem 2 for strongly convex $P$ . When $v=0$ or $\theta=0$ (i.e., $\phi$ is non-differentiable with no Hölder continuous gradient or $P$ does not obey the error bound condition), the convergence rate reduces to the standard $\widetilde{O}(1/\sqrt{T})$ .

Proof.

To prove Theorem 3, we first present the following two lemmes. The first one presents Azuma’s inequality which handles martingale difference sequence. The second one analyzes the behaviour of the update within a stage of Algorithm 2. Proof of Lemma 4 is in Appendix C.

Lemma 3.

(Azuma’s inequality) Let $X_{1},...,X_{T}$ be the martingale difference sequence. Suppose that $|X_{t}|\leq b$ . Then for $\delta>0$ we have

[TABLE]

Lemma 4.

Let the Lines 4, 5, and 6 of Algorithm 2 run for $T$ iterations by fixed step size $\eta_{x}$ and $\eta_{y}$ starting from $x_{0}$ and $y_{0}$ . Then with the probability at least $1-\tilde{\delta}$ where $\tilde{\delta}\in(0,1)$ , we have

[TABLE]

where $\bar{x}_{T}=\sum_{t=0}^{T-1}x_{t}/T$ , $\bar{y}_{T}=\sum_{t=0}^{T-1}y_{t}/T$ , $\hat{y}_{T}=\arg\max_{y\in Y\cap\mathcal{B}(y_{0},R_{y})}f(\bar{x}_{T},y)$ and any fixed $x\in X\cap\mathcal{B}(x_{0},R_{x})$ .

Now we proceed to proof of Theorem 3. Let $\epsilon_{s}=\frac{\epsilon_{s-1}}{2}$ , by the setting of Algorithm 2, we know $R_{s+1}=\frac{R_{1}}{2^{s}}\geq\frac{c\epsilon_{s}}{\epsilon^{1-\theta}}$ , $\eta_{x,s+1}=\frac{\epsilon_{s}}{40M^{2}}$ , $\eta_{y,s+1}=\frac{\epsilon_{s}}{40B^{2}}$ , and $x_{0}^{(s+1)}={\bar{x}}_{s}=\frac{1}{T}\sum_{t=1}^{T}x_{t}^{(s)}$ for $s=0,1,\dots$ . We will show $P(x_{0}^{(s+1)})-P^{*}\leq\epsilon_{s}+\epsilon$ by induction for $s=0,1,\dots$ with a high probability. It is easy to verify $P(x_{0}^{(1)})-P^{*}\leq\epsilon_{0}+\epsilon$ for a sufficiently large $\epsilon_{0}$ according to Assumption 1. Next, we need to show that conditional on $P(x_{0}^{(s)})-P^{*}\leq\epsilon_{s-1}+\epsilon$ , we have

[TABLE]

with a high probability.

Consider the update of the $s$ -th stage. Define $\hat{y}({\bar{x}}_{s})=\arg\max_{y\in Y}f({\bar{x}}_{s},y)$ and $x_{0,\epsilon}^{(s),\dagger}=\arg\min_{x\in\mathcal{S}_{\epsilon}}\|x-x_{0}^{(s)}\|$ . We would like to show that both $\|x_{0,\epsilon}^{(s),\dagger}-x_{0}^{(s)}\|\leq R_{x}$ and $\|\hat{y}({\bar{x}}_{s})-y_{0}^{(s)}\|\leq R_{y}$ always hold, so that we are able to plug $x=x_{0,\epsilon}^{(s),\dagger}$ and $y=\hat{y}({\bar{x}}_{s})$ into (4) in Lemma 4. To this end, we have for $x_{0,\epsilon}^{(s),\dagger}$ ,

[TABLE]

where the first inequality is due to Lemma 4 in (Yang and Lin, 2018), the second inequality is due to (7) and the third inequality is due to $x_{0,\epsilon}^{(s),\dagger}\in\mathcal{S}_{\epsilon}$ .

For $\hat{y}({\bar{x}}_{s})$ , we have

[TABLE]

where the first equality is due to the set up of the algorithm and Lemma 5, the first inequality is due to the $(L,v)$ -Hölder continuous gradients of $\phi$ (Assumption 1 (3)), the second inequality is due to $G$ -Lipschitz continuity of $\ell$ (Assumption 1 (4)), and the last equality is due to the setting of $R_{y,s}=LG^{v}R_{x,s}^{v}$ .

By showing that $\|x_{0,\epsilon}^{(s),\dagger}-x_{0}^{(s)}\|\leq R_{x}$ and $\|\hat{y}({\bar{x}}_{s})-y_{0}^{(s)}\|\leq R_{y}$ , we then plug in $x=x_{0,\epsilon}^{(s),\dagger}$ and $y=\hat{y}({\bar{x}}_{s})$ into (4) in Lemma 4 as follows

[TABLE]

Finally, we would like to show $P(\bar{x}_{s})-P(x_{0,\epsilon}^{(s),\dagger})\leq\epsilon_{s}=\frac{\epsilon_{s-1}}{2}$ by properly setting the values of $T$ , $\eta_{x,s}$ , $\eta_{y,s}$ , $R_{x,s}$ and $R_{y,s}$ .

First, to make $\frac{5\eta_{x,s}M^{2}}{2}=\frac{\epsilon_{s-1}}{16}$ in term $(c)$ and $\frac{5\eta_{y,s}B^{2}}{2}=\frac{\epsilon_{s-1}}{16}$ in term $(d)$ , we have $\eta_{x,s}=\frac{\epsilon_{s-1}}{40M^{2}}$ and $\eta_{y,s}=\frac{\epsilon_{s-1}}{40B^{2}}$ , respectively. Recalling that $\epsilon_{s}=\frac{\epsilon_{s-1}}{2}$ , this requires $\eta_{x,s+1}=\frac{\eta_{x,s}}{2}$ and $\eta_{y,s+1}=\frac{\eta_{y,s}}{2}$ , as in Line 5 and 6 of Algorithm 2. Next, we can plug $\eta_{x,s}$ and $\eta_{y,s}$ into term $(a)$ and $(b)$ . By setting $T\geq\max\{\frac{320M^{2}R_{x,s}^{2}}{\epsilon_{s-1}^{2}},\frac{320B^{2}L^{2}G^{2v}R_{x,s}^{2v}}{\epsilon_{s-1}^{2}}\}$ , we have

[TABLE]

Then, for $(e)$ , by setting $T\geq\frac{8192\log(\frac{1}{\tilde{\delta}})M^{2}R_{x,s}^{2}}{\epsilon_{s-1}^{2}}$ , we have $\frac{4MR_{x,s}\sqrt{2\log\frac{1}{\tilde{\delta}}}}{\sqrt{T}}\leq\frac{\epsilon_{s-1}}{16}$ . Last, for $(f)$ , by setting $T\geq\frac{8192\log(\frac{1}{\tilde{\delta}})B^{2}L^{2}G^{2v}R_{x,s}^{2v}}{\epsilon_{s-1}^{2}}$ , we have $\frac{4BLG^{v}R_{x,s}^{v}\sqrt{2\log\frac{1}{\tilde{\delta}}}}{\sqrt{T}}\leq\frac{\epsilon_{s-1}}{16}$ .

Therefore, we have

[TABLE]

i.e.,

[TABLE]

By induction, after running $S=\lceil\log(\frac{\epsilon_{0}}{\epsilon})\rceil$ stages, with probability $(1-\tilde{\delta})^{S}\geq 1-S\tilde{\delta}$ , we have

[TABLE]

where we set $\tilde{\delta}=\delta/S$ . Considering the requirements from (4.2), for $T$ , we have

[TABLE]

Recall that $v\in[0,1]$ , $R_{x,1}\geq\frac{c\epsilon_{0}}{\epsilon^{1-\theta}}$ , $R_{x,s}=\frac{R_{x,1}}{2^{s-1}}$ and $\epsilon_{s-1}=\frac{\epsilon_{0}}{2^{s-1}}$ . On one hand, we have

[TABLE]

On the other hand, for $s\leq\lfloor\log(\frac{\epsilon_{0}}{\epsilon})\rfloor$ , we have

[TABLE]

The above terms show that $T$ would not change as $s$ changes. Provided $R_{x,1}=O(\frac{c\epsilon_{0}}{\epsilon^{1-\theta}})$ and $R_{x,1}\geq\frac{c\epsilon_{0}}{\epsilon^{1-\theta}}$ , we have the total number of iterations is at most $ST=O\bigg{(}\frac{\lceil\log(\frac{\epsilon_{0}}{\epsilon})\rceil\lceil\log({S}/{\delta})\rceil}{\epsilon^{2(1-v\theta)}}\bigg{)}=\widetilde{O}\bigg{(}\frac{1}{\epsilon^{2(1-v\theta)}}\bigg{)}$ .

∎

4.3 Adaptive Variants of RSPD

When setting the initial value of radius $R_{1}$ (as well as the value of $T$ ) in Algorithm 2, one requires to know $c$ , $\theta$ and $\epsilon$ (setting $R_{1}\geq\frac{c\epsilon_{0}}{\epsilon^{1-\theta}}$ ), which may not be feasible in practice. Below, we introduce an adaptive variant of Algorithm 2 to find the $\epsilon$ -optimal solution without knowing either $c$ or $\theta$ and $\epsilon$ to initiate the algorithm under that $v=1$ . The developments in this section are mostly direct extension of techniques introduced (Xu et al., 2017; Yang and Lin, 2018).

The idea of tackling unknown $c$ is similar to the grid search: starting from a guess of $c$ for setting $R_{1},T$ to run RSPD and then restarting RSPD using a larger $c$ (increased by a constant factor) or equivalently a larger $R_{1},T$ . However, in order to not waste the updates for using a smaller $c$ and also remove the dependence on $\epsilon$ for setting $R_{1},T$ , we equivalently increase $R_{1}$ and $T$ in a way that depends on $\theta$ such that a similar convergence rate can be still established. The details are presented in Algorithm 3. The following theorem gives convergence result of Algorithm 3. Its proof is in Appendix D.

Theorem 4.

Suppose that Assumption 1 holds with $v=1$ , and there exists ${\hat{\epsilon}}_{1}\in(\epsilon,\epsilon_{0}/2]$ such that the initial value $R_{1}^{(1)}$ satisfies $R_{1}^{(1)}=\frac{c\epsilon_{0}}{{\hat{\epsilon}}_{1}^{1-\theta}}$ and the error bound condition holds on $\mathcal{S}_{{\hat{\epsilon}}_{1}}$ with $c>0,\theta\in(0,1)$ . For any $\delta\in(0,1)$ , $\epsilon\leq\epsilon_{0}/4$ , let $\hat{\delta}=\frac{\delta}{S(S+1)},S=\lceil\log_{2}(\frac{\epsilon_{0}}{\epsilon})\rceil$ , $\kappa=1$ , and $T_{1}=\max\bigg{\{}320M^{2},320B^{2}L^{2}G^{2},8192\log(\frac{1}{\tilde{\delta}})M^{2},8192\log(\frac{1}{\tilde{\delta}})B^{2}L^{2}G^{2}\bigg{\}}\cdot\frac{(R_{1}^{(1)})^{2}}{\epsilon_{0}^{2}}.$ After at most $K=\lceil\log(\frac{\hat{\epsilon}_{1}}{\epsilon})\rceil+1$ calls of RSPD, Algorithm 3 guarantees that $P(x^{(K)})-P(x^{*})\leq 2\epsilon$ with probability $1-\delta$ with an iteration complextiy of $\widetilde{O}(\log(1/\delta)/\epsilon^{2(1-\theta)})$ .

Remark: The requirement on the local error bound condition of the above theorem seems slightly stronger than that holds on $\mathcal{S}_{\epsilon}$ . However, for a convex function it has been shown that a local error bound condition implies an error bound condition on any compact set with the same $\theta$ but possibly different $c$ (Bolte et al., 2015). The above theorem and Algorithm 3 do not cover the case $\theta=1$ . But this can be easily resolved by setting $R_{1}=\hat{c}_{1}\epsilon_{0}$ according to an initial guess of $c$ , and then increasing $\hat{c}_{1}$ or $R_{1}$ by two times and rerun RSPD. It is easy to see that after $\log(c/\hat{c}_{1})$ times the estimated value of $c$ will become larger than the true $c$ and the convergence theory in previous subsection will apply. As a result the total iteration complexity is only amplified by a factor of $\log(c/\hat{c}_{1})$ .

Finally, we can show that even if $\theta$ is unknown, by setting $\theta=0$ in Algorithm 3, we can still prove an improved convergence. Let $B_{\epsilon}=\max_{v\in\mathcal{L}_{\epsilon}}\min_{z\in X^{*}}||v-z||$ be the maximum distance between the points in the $\epsilon$ -level set $\mathcal{L}_{\epsilon}$ and the optimal set $X^{*}$ . Proof of the following theorem is similar to the one of Theorem 4 (in Appendix D) with slight modification.

Theorem 5.

Suppose that Assumption 1 (1 $\sim$ 4) holds with $v=1$ , and $R^{(1)}_{1}$ is sufficiently large such that there exists ${\hat{\epsilon}}_{1}\in[\epsilon,\frac{\epsilon_{0}}{2}]$ and $R^{(1)}_{1}=\frac{B_{{\hat{\epsilon}}_{1}}\epsilon_{0}}{{\hat{\epsilon}}_{1}}$ . Given $\delta\in(0,1)$ , let $\theta=0$ , $\hat{\delta}=\frac{\delta}{S(S+1)}$ , $S=\lceil\log_{2}(\frac{\epsilon_{0}}{\epsilon})\rceil$ , $T_{1}=\max\bigg{\{}320M^{2},320B^{2}L^{2}G^{2},8192\log(\frac{1}{\tilde{\delta}})M^{2},8192\log(\frac{1}{\tilde{\delta}})B^{2}L^{2}G^{2}\bigg{\}}\cdot\frac{(R_{1}^{(1)})^{2}}{\epsilon_{0}^{2}},$ and $\kappa=1$ . After at most $K=\lceil\log(\frac{\hat{\epsilon}_{1}}{\epsilon})\rceil+1$ calls of RSPD, Algorithm 3 guarantees that $P(x^{(K)})-P(x^{*})\leq 2\epsilon$ with probability $1-\delta$ with an iteration complexity of $\widetilde{O}(\log(\frac{1}{\delta})B_{{\hat{\epsilon}}_{1}}^{2}/\epsilon^{2})$ .

Remark: This iteration complexity is still an improved one compared with that in (Nemirovski et al., 2009), reducing the dependence on the size of $X$ and $Y$ to the $B_{{\hat{\epsilon}}_{1}}$ .

5 Applications and Experiments

In this section, we investigate the effectiveness of our algorithms on two applications, i.e., distributionally robust optimization (DRO) and area under receiver operating characteristic curve (AUC) maximization. We perform DRO experiments on four benchmark datasets, a9a, real-sim, rcv1 and w8a. AUC experiments are performed on a9a, real-sim, covtype and URL. Table 1 shows the statistics of the used six datasets.

DRO. First, we consider solving the DRO (4) for binary classification as mentioned in the Introduction. We use the square distance for $V$ that was studied in (Namkoong and Duchi, 2017), i.e., $V(y,\mathbf{1}/n)=\frac{\lambda_{1}}{2}\|ny-\mathbf{1}\|_{2}^{2}$ . For the loss function, we consider the non-smooth hinge loss $\ell_{i}(x)=\max\{0,1-b_{i}x^{\top}a_{i}\}$ , where $a_{i}\in\mathbb{R}^{d}$ denotes the feature vector and $b_{i}\in\{1,-1\}$ denotes the label. We also include a regularizer $g(x)$ on the model parameter $x$ . Using different regularizers will give different properties for the primal objective function. For example, if $g(x)=\frac{\lambda_{2}}{2}\|x\|_{2}^{2}$ , then the primal objective function $P(x)$ is obviously a strongly convex function. If $g(x)=\lambda_{2}\|x\|_{1}$ , then we can prove that the primal objective function $P(x)$ is a piecewise quadratic convex function, which satisfies the LEB condition with $\theta=1/2$ . The proof is given in Appendix E. We report the result of RSPD ${}^{\text{sc}}$ for solving the problem with $g(x)=\frac{\lambda_{2}}{2}\|x\|_{2}^{2}$ here.

We compare with the baseline called Bandit Mirror Descent (BMD) algorithm considered in (Namkoong and Duchi, 2016), which has a convergence rate of $O(1/\sqrt{T})$ . The stochastic gradients are computed in the same way as in (Namkoong and Duchi, 2016). Computing the restarted dual solution $y^{(s+1)}_{0}=\mathcal{A}(x^{(s+1)}_{0})$ takes $O(nd)$ time complexity, and each update for the primal variable and the dual variable takes $O(d)$ and $O(n)$ , respectively. Therefore, the total time complexity of RSPD for finding an $\epsilon$ -optimal solution is $O(nd\log(1/\epsilon)+\frac{n+d}{\epsilon})$ . In contrast, the time complexity of BMD is $O((n+d)/\epsilon^{2})$ .

We conduct experiments on four datasets from libsvm website using $\ell_{2}$ regularization for $g(x)$ . The regularizer parameters are set to be $\lambda_{1}=\lambda_{2}=\frac{1}{n}$ for all datasets. The initial step sizes of all algorithms are tuned in the range of $\{10^{-5:1:3}\}$ . All algorithms start with the same initial solutions with $y_{0}=\frac{\textbf{1}}{n}$ and $x_{0}=\mathbf{0}$ . In implementing RSPD ${}^{\text{sc}}$ , we start with an initial $T=10^{4}$ increased by a factor of $2$ at each epoch. The results of objective gap against the number of gradients and against CPU time are shown in Figure 1 and Figure 2, respectively. It is clear that the proposed algorithm converge much faster than the baseline algorithm BMD.

AUC Maximization. Next, we consider empirical AUC maximization by solving the min-max saddle-point formulation proposed by (Ying et al., 2016):

[TABLE]

where $\mathbf{x}_{i}\in\mathbb{R}^{d},z_{i}\in\{1,-1\}$ denote the feature-label pairs of a training example, $F(\mathbf{w},a,b,\alpha;(\mathbf{x},z))=(1-p)(\mathbf{w}^{\top}\mathbf{x}-a)^{2}I_{[z=1]}+p(\mathbf{w}^{\top}\mathbf{x}-b)^{2}I_{[z=-1]}-p(1-p)\alpha^{2}+2(1+\alpha)(p\mathbf{w}^{\top}\mathbf{x}I_{[z=-1]}-(1-p)\mathbf{w}^{\top}\mathbf{x}I_{[z=1]})$ , $p$ is the percentage of positive example, and $I_{[\cdot]}$ is the indicator function. Let ${\bf{v}}=[\mathbf{w}^{\top},a,b]^{\top}\in\mathbb{R}^{d+2}$ . In order to achieve good AUC performance, we add a ball constraint on $\mathbf{w}$ . Bounds on $(a,b)$ can be derived similarly to (Ying et al., 2016). If we use $\ell_{1}$ ball $\|\mathbf{v}\|_{1}\leq B$ , it was shown in (Liu et al., 2018a) that the primal objective function satisfies the LEB with $\theta=1/2$ . If we use $\ell_{2}$ ball constraint $\|\mathbf{v}\|_{2}\leq B$ , under a mild condition that $\min_{\mathbf{v}\in\mathbb{R}^{d+2}}P(\mathbf{v})<\min_{\|\mathbf{v}\|_{2}\leq B}P(\mathbf{v})$ it was shown that a LEB with $\theta=1/2$ is satisfied (Liu et al., 2018b). Then the iteration complexity of RSPD is given by $\widetilde{O}(1/\epsilon)$ . Since the dual variable is one-dimensional, computing the restarted dual solution $y^{(s+1)}_{0}$ takes $O(d)$ complexity given the averaged feature vectors for the positive and negative examples are precomputed. Hence, when LEB with $\theta=1/2$ is satisfied, the total time complexity of RSPD or ARSPD is $\widetilde{O}(d\log(1/\epsilon)+d/\epsilon)$ . We also note that SPDC (Zhang and Lin, 2015) is applicable in the AUC task, but it does not give a linear rate for the considered AUC problem, because there is no strong convexity for primal variable as required for achieving a linear rate. Adding a small strongly convex regularizer on the primal variable, its total time complexity is $O(nd^{2}+d^{2}/\sqrt{\epsilon})$ since every iteration needs to solve a linear system (i.e., the proximal mapping of the quadratic part of the primal variable), where $n$ is sample size. Here, we report the results of the proposed adaptive algorithm for the problem with an $\ell_{2}$ ball constraint and an $\ell_{1}$ ball, respectively.

Since the function $F$ is smooth in terms of $\mathbf{v}$ and $\alpha$ , we include more applicable baselines for comparison. In particular, we compare with four algorithms, i.e., PDSG (Nemirovski et al., 2009), SPAM (Natole et al., 2018), SMP (Juditsky et al., 2011) and primal-dual SVRG (Palaniappan and Bach, 2016). For primal-dua SVRG, we directly use the formulation of AUC proposed in the paper and conduct the experiment using the code provided by the authors 222Code derived at https://sites.google.com/site/pbalamuru/home/sagsaddle-code. SPAM is an algorithm proposed particularly for the stochastic AUC maximization. SMP and SVRG utilize the smoothness of the objective function. The complexity of PDSG and SMP for finding an $\epsilon$ -stationary solution is given by $O(d/\epsilon^{2})$ . Note that both SPAM and SVRG require a strong convexity of the objective function on the primal variable. To this end, we add an $\ell_{2}$ regularizer, i.e., $\frac{\lambda}{2}||\mathbf{w}||_{2}^{2}$ with a small value of $\lambda=\Theta(\epsilon)$ . These two algorithms have a total time complexity for finding a solution $\mathbf{v}$ such that $\|\mathbf{v}-\mathbf{v}_{*}\|^{2}\leq\epsilon$ given by $\widetilde{O}(d/\epsilon^{2})$ and $\widetilde{O}(nd+nd/\epsilon)$ , respectively. We can see that all baseline algorithms have worse time complexity than RSPD, especially the primal dual SVRG algorithm.

In the $\ell_{2}$ ball setting, we fix $B=10$ and $\lambda=10^{-4}$ on all datasets. In the $\ell_{1}$ ball setting, we set $B=100$ on a9a, covtype and URL, and $B=1000$ on real-sim. The initial step sizes of all algorithms are tuned in the range of $\{10^{-5:1:3}\}$ . For ARSPD, we set $S=5$ and simply set $\theta=0$ pretending that we do not know the value of true $\theta$ and tune $\kappa=\{0.25,0.5,0.75,1\}$ . The initial solution of all algorithms are set to $\mathbf{0}$ . For the $\ell_{2}$ ball setting, the convergence curves of AUC on four data sets against the number of gradients and CPU time are shown in Figure 3 and Figure 4, including two large-scale datasets covtype and URL, on which SVRG is too slow to be plotted. For the $\ell_{1}$ ball setting, the convergence curves of AUC against the number of gradients and CPU time are shown in Figure 5 and Figure6. We can see that the overall performance of ARSPD is the best among all algorithms.

6 Conclusion

In this paper, we have proposed novel stochastic primal-dual algorithms for solving convex-concave problems with no bilinear structure assumed, which employ a mixture of stochastic gradient updates and deterministic dual updates. A fast convergence rate of $O(1/T)$ was achieved under strong convexity on the primal and dual variables. In addition, we design variants for more general problems without strong convexity achiving adaptive rates. Empirical results verify the effectiveness of our algorithms.

Appendix A A Lemma Regarding $\mathcal{A}(x)$

Lemma 5.

Let $\mathcal{A}(x)=\arg\max_{y\in\text{dom}(\phi^{*})}y^{\top}\ell(x)-\phi^{*}(y)$ , where $\phi^{*}$ is the convex conjugate of a differentiable function $\phi$ , then

[TABLE]

Proof.

Let $\hat{y}=\mathcal{A}(x)$ , then we know

[TABLE]

Since $\phi$ is differentinable, and then by using Lemma 11.4 in (Cesa-Bianchi and Lugosi, 2006) we have

[TABLE]

That is

[TABLE]

∎

Appendix B Proof of Lemma 2

For simplicity of presentation, we use the notations $\Delta_{x}^{t}=\nabla_{x}f(x_{t},y_{t};\xi_{t})$ , $\Delta_{y}^{t}=\nabla_{y}f(x_{t},y_{t},;\xi_{t})$ , $\partial_{x}^{t}=\nabla_{x}f(x_{t},y_{t})$ and $\partial_{y}^{t}=\nabla_{y}f(x_{t},y_{t})$ . To prove Lemma 2, we would leverage the following two update approaches:

[TABLE]

where $x_{0}={\tilde{x}}_{0}$ and $y_{0}={\tilde{y}}_{0}$ . The first two updates are identical to Line 4 and Line 5 in Algorithm 1. This can be verified easily. Take the first one as example:

[TABLE]

Let $\psi(x)=x^{\top}u+\frac{1}{2\gamma}||x-v||^{2}$ with $x^{\prime}=\arg\min_{x\in X}\psi(x)$ , which includes the four update approaches in (17) as special cases. By using the strong convexity of $\psi(x)$ and the first order optimality condition ( $\nabla\psi(x^{\prime})^{\top}(x-x^{\prime})\geq 0$ ), for any $x$ , we have

[TABLE]

which implies

[TABLE]

Then

[TABLE]

Applying the above result to the updates in (17), we have

[TABLE]

Adding the above four inequalities, we have

[TABLE]

where the last inequality uses the facts that $||\Delta_{x}^{t}||\leq M$ , $||\partial_{x}^{t}||\leq M$ , $||\Delta_{y}^{t}||\leq B$ and $||\partial_{y}^{t}||\leq B$ . Then we combine the LHS and RHS by summing up $t=0,...,T-1$ :

[TABLE]

By Jensen’s inequality, we have

[TABLE]

where $\bar{x}_{T}=\sum_{t=0}^{T-1}x_{t}/T$ , $\bar{y}_{T}=\sum_{t=0}^{T-1}y_{t}/T$ . Let $\hat{y}_{T}=\arg\max_{y\in Y}f(\bar{x}_{T},y)$ and $x_{*}\in X^{*}$ , we get

[TABLE]

Then we complete the proof by taking the expectation on both sides of above inequality and using the the facts that $\mathrm{E}[(x_{t}-{\tilde{x}}_{t})^{\top}(\partial_{x}^{t}-\Delta_{x}^{t})+(y_{t}-{\tilde{y}}_{t})^{\top}(\partial_{y}^{t}-\Delta_{y}^{t})]=0$ .

Appendix C Proof of Lemma 4

For simplicity of presentation, we use the notations $\Delta_{x}^{t}=\nabla_{x}f(x_{t},y_{t};\xi_{t})$ , $\Delta_{y}^{t}=\nabla_{y}f(x_{t},y_{t},;\xi_{t})$ , $\partial_{x}^{t}=\nabla_{x}f(x_{t},y_{t})$ and $\partial_{y}^{t}=\nabla_{y}f(x_{t},y_{t})$ .

To prove Lemma 4, we would leverage the following two update approaches:

[TABLE]

where $x_{0}={\tilde{x}}_{0}$ and $y_{0}={\tilde{y}}_{0}$ . The first two lines are identical to Line 5 and 6 in Algorithm 2. This can be verified easily. Take the first one as example:

[TABLE]

Let us define $\psi(x)=x^{\top}u+\frac{1}{2\gamma}||x-v||^{2}$ with $x^{\prime}=\arg\min_{x\in X}\psi(x)$ , which includes the four update approaches in (28) as special cases. By using the strong convexity of $\psi(x)$ and the first order optimality condition ( $\nabla\psi(x^{\prime})^{\top}(x-x^{\prime})\geq 0$ ), for any $x$ , we have

[TABLE]

which implies

[TABLE]

Then

[TABLE]

Applying the above result to the updates in (28) (treating $u$ above as $\Delta_{x}^{t}$ , $\Delta_{y}^{t}$ , $\partial_{x}^{t}-\Delta_{x}^{t}$ , $\partial_{y}^{t}-\Delta_{y}^{t}$ , respectively), we have

[TABLE]

Adding the above four inequalities, we have

[TABLE]

where the last inequality uses the facts that $||\Delta_{x}^{t}||\leq M$ , $||\partial_{x}^{t}||\leq M$ , $||\Delta_{y}^{t}||\leq B$ and $||\partial_{y}^{t}||\leq B$ . Then we combine the LHS and RHS by summing up $t=0,...,T-1$ :

[TABLE]

By Jensen’s inequality, we have

[TABLE]

where $\bar{x}_{T}=\sum_{t=0}^{T-1}x_{t}/T$ , $\bar{y}_{T}=\sum_{t=0}^{T-1}y_{t}/T$ . Let $\hat{y}_{T}=\arg\max_{y\in Y\cap\mathcal{B}(y_{0},R_{y})}f(\bar{x}_{T},y)$ and any fixed $x\in X\cap\mathcal{B}(x_{0},R_{x})$ , we get

[TABLE]

Then we employ Azuma’s inequality (Lemma 3) to upper bound the last term with a high probability. Let $V_{t}=({\tilde{x}}_{t}-x_{t})^{T}(\partial_{x}^{t}-\Delta_{x}^{t})+({\tilde{y}}_{t}-y_{t})^{T}(\partial_{y}^{t}-\Delta_{y}^{t})$ be martingale difference sequence. We have

[TABLE]

where the first inequality is due to the triangle inequality, the second inequality is due to Cauchy–Schwarz inequality, the third inequality is due to Assumption 1 (2), and the last inequality is due to $\tilde{x}_{t},x_{t}\in X\cap\mathcal{B}(x_{0},R_{x})$ , $\tilde{y}_{t},y_{t}\in Y\cap\mathcal{B}(y_{0},R_{y})$ . Therefore, by Azuma’s inequality with probability at least $1-\tilde{\delta}$ , we have for any $x\in X\cap\mathcal{B}(x_{0},R_{x})$

[TABLE]

Appendix D Proof of Theorem 4 (Theorem 5)

The proof is similar to the proof of Theorem 3 in (Xu et al., 2017). For completeness, we include it here. The proof of Theorem 5 can be also obtained by a slight change of the following proof.

Proof.

Based on the proof of Theorem 3, since $v=1$ and by the settings of $S=\lceil\log_{2}(\frac{\epsilon_{0}}{\epsilon})\rceil\geq\lceil\log_{2}(\frac{\epsilon_{0}}{\hat{\epsilon}_{1}})\rceil$ , $R_{1}^{(1)}=\frac{c\epsilon_{0}}{\hat{\epsilon}_{1}^{1-\theta}}$ , $T_{1}=\max\bigg{\{}320M^{2},320B^{2}L^{2}G^{2},8192\log(\frac{1}{\tilde{\delta}})M^{2},8192\log(\frac{1}{\tilde{\delta}})B^{2}L^{2}G^{2}\bigg{\}}\cdot\frac{(R_{1}^{(1)})^{2}}{\epsilon_{0}^{2}},$ it can be shown that

[TABLE]

with a probability $1-\frac{\delta}{S+1}$ . Next, by running RSPD with initial $x^{(1)}$ satisfying (37) and the settings of $S=\lceil\log_{2}(\frac{\epsilon_{0}}{\epsilon})\rceil\geq\lceil\log_{2}(\frac{2\hat{\epsilon}_{1}}{\hat{\epsilon}_{1}/2})\rceil$ , $R_{1}^{(2)}=\frac{c\epsilon_{0}}{(\hat{\epsilon}_{1}/2)^{1-\theta}}\geq\frac{c2\hat{\epsilon}_{1}}{(\hat{\epsilon}_{1}/2)^{1-\theta}}$ , and $T_{2}=T_{1}\cdot 2^{2(1-\theta)}=\max\bigg{\{}320M^{2},320B^{2}L^{2}G^{2},8192\log(\frac{1}{\tilde{\delta}})M^{2},8192\log(\frac{1}{\tilde{\delta}})B^{2}L^{2}G^{2}\bigg{\}}\cdot\frac{(R_{1}^{(2)})^{2}}{\epsilon_{0}^{2}},$ Theorem 3 ensures that with a probability at least $(1-\delta/(S+1))^{2}$ ,

[TABLE]

By continuing this process with $K=\lceil\log_{2}(\hat{\epsilon}_{1}/\epsilon)\rceil+1$ , we can show that

[TABLE]

with a probability at least $(1-\delta/(S+1))^{K}\geq 1-\delta\frac{K}{S+1}\geq 1-\delta$ . The total number of iterations for $K$ calls of RSPD can be bounded by

[TABLE]

∎

Appendix E Piecewise Quadratic Function of Distributionally Robust Optimization

We would like to prove the $\ell_{1}$ regularized DRO function is convex and piecewise quadratic, so it satifies the LEB condition with $\theta=1/2$ . First we present the following proposition.

Proposition 1.

(Proposition 2.3 (Rockafellar, 1987)) Let $\rho_{V,Q}(s)=\sup_{v\in V}\{s^{\top}v-\frac{1}{2}v^{\top}Qv\}$ where $Q$ is symmetric and positive semidefinite, and $\rho_{V,Q}(s)$ is lower semicontinuous, convex and piecewise linear-quadratic. Its effective domain $L=\{s|\rho_{V,Q}<\infty\}$ is nonempty convex polyhedron that can be decomposed into finitely many polyhedral convex sets, on each of which $\rho_{V,Q}$ is quadratic or linear.

We can rewrite DRO as $\max_{y\in\Delta_{n}}\sum_{i=1}^{n}y_{i}\ell_{i}(x)-\frac{\lambda_{1}}{2}||ny-\mathbf{1}||^{2}=\max_{y\in\Delta_{n}}\sum_{i=1}^{n}y_{i}(\ell_{i}(x)+n\lambda_{1})-\frac{n^{2}\lambda_{1}}{2}y^{\top}\mathbf{I}y+\frac{n\lambda_{1}}{2}$ , which is piecewise linear-quadratic in $\Big{(}\ell(x)+n\lambda_{1}\mathbf{1}\Big{)}$ according to the above proposition. If $\ell(x)$ is piecewise linear, the composition of the piecewise linear and piecewise quadratic functions is piecewise quadratic.

Bibliography36

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1Bolte et al. (2015) Jerome Bolte, Trong Phong Nguyen, Juan Peypouquet, and Bruce Suter. From error bounds to the complexity of first-order descent methods for convex functions. Co RR , abs/1510.08234, 2015.
2Cesa-Bianchi and Lugosi (2006) N. Cesa-Bianchi and G. Lugosi. Prediction, Learning, and Games . Cambridge University Press, 2006.
3Chen et al. (2017 a) Robert S. Chen, Brendan Lucier, Yaron Singer, and Vasilis Syrgkanis. Robust optimization for non-convex objectives. In Advances in Neural Information Processing Systems 30 (NIPS , pages 4705–4714. 2017 a.
4Chen et al. (2014) Y. Chen, G. Lan, and Y. Ouyang. Optimal primal-dual methods for a class of saddle point problems. SIAM Journal on Optimization , 24(4):1779–1814, 2014. 10.1137/130919362 . · doi ↗
5Chen et al. (2017 b) Yunmei Chen, Guanghui Lan, and Yuyuan Ouyang. Accelerated schemes for a class of variational inequalities. Mathematical Programming , 165(1):113–149, Sep 2017 b.
6Dekel and Singer (2006) Ofer Dekel and Yoram Singer. Support vector machines on a budget. In NIPS , pages 345–352, 2006.
7Du and Hu (2018) Simon S. Du and Wei Hu. Linear convergence of the primal-dual gradient method for convex-concave saddle point problems without strong convexity. Co RR , abs/1802.01504, 2018.
8Dvurechensky et al. (2018) Pavel Dvurechensky, Alexander Gasnikov, Fedor Stonyakin, and Alexander Titov. Generalized mirror prox: Solving variational inequalities with monotone operator, inexact oracle, and unknown h \ \ \backslash ” older parameters. ar Xiv preprint ar Xiv:1806.05140 , 2018.

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Videos

Taxonomy

Stochastic Primal-Dual Algorithms with Faster Convergence than O(1/T)O(1/\sqrt{T})O(1/T​) for Problems without Bilinear Structure

Abstract

1 Introduction

2 Related Work

3 Preliminaries

Definition 1**.**

Assumption 1**.**

4 Main Results

4.1 Restarted Stochastic Primal-Dual Algorithm for Strongly Convex PPP

Lemma 1**.**

Lemma 2**.**

Theorem 2**.**

Proof.

4.2 RSPD Algorithm under the LEB condition

Theorem 3**.**

Proof.

Lemma 3**.**

Lemma 4**.**

4.3 Adaptive Variants of RSPD

Theorem 4**.**

Theorem 5**.**

5 Applications and Experiments

6 Conclusion

Appendix A A Lemma Regarding A(x)\mathcal{A}(x)A(x)

Lemma 5**.**

Proof.

Appendix B Proof of Lemma 2

Appendix C Proof of Lemma 4

Appendix D Proof of Theorem 4 (Theorem 5)

Proof.

Appendix E Piecewise Quadratic Function of Distributionally Robust Optimization

Proposition 1**.**

Stochastic Primal-Dual Algorithms with Faster Convergence than $O(1/\sqrt{T})$ for Problems without Bilinear Structure

Definition 1.

Assumption 1.

4.1 Restarted Stochastic Primal-Dual Algorithm for Strongly Convex $P$

Lemma 1.

Lemma 2.

Theorem 2.

Theorem 3.

Lemma 3.

Lemma 4.

Theorem 4.

Theorem 5.

Appendix A A Lemma Regarding $\mathcal{A}(x)$

Lemma 5.

Proposition 1.