Stochastic Primal-Dual Algorithms with Faster Convergence than $O(1/\sqrt{T})$ for Problems without Bilinear Structure
Yan Yan, Yi Xu, Qihang Lin, Lijun Zhang, Tianbao Yang

TL;DR
This paper introduces new stochastic primal-dual algorithms that achieve faster convergence rates than the traditional $O(1/\sqrt{T})$ for convex-concave problems without requiring bilinear structure, applicable to robust learning and AUC maximization.
Contribution
The paper develops and analyzes stochastic primal-dual algorithms with a mixture of stochastic and deterministic updates, achieving improved convergence rates for non-bilinear convex-concave problems.
Findings
Achieves $O(1/T)$ convergence rate under certain conditions.
Applicable to problems with weak strong convexity and strong concavity.
Effective in robust model learning and empirical AUC maximization.
Abstract
Previous studies on stochastic primal-dual algorithms for solving min-max problems with faster convergence heavily rely on the bilinear structure of the problem, which restricts their applicability to a narrowed range of problems. The main contribution of this paper is the design and analysis of new stochastic primal-dual algorithms that use a mixture of stochastic gradient updates and a logarithmic number of deterministic dual updates for solving a family of convex-concave problems with no bilinear structure assumed. Faster convergence rates than with being the number of stochastic gradient updates are established under some mild conditions of involved functions on the primal and the dual variable. For example, for a family of problems that enjoy a weak strong convexity in terms of the primal variable and has a strongly concave function of the dual variable, the…
| Datasets | #Examples | #Features |
|---|---|---|
| w8a | 49,749 | 300 |
| rcv1 | 20,242 | 47,236 |
| a9a | 32,561 | 123 |
| real-sim | 72,309 | 20,958 |
| covtype | 581,012 | 54 |
| URL | 2,396,130 | 3,231,961 |
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsStochastic Gradient Optimization Techniques · Sparse and Compressive Sensing Techniques · Statistical Methods and Inference
Stochastic Primal-Dual Algorithms with Faster Convergence than for Problems without Bilinear Structure
\NameYan Yan1\[email protected]
\NameYi Xu1\[email protected]
\NameQihang Lin2\[email protected]
\NameLijun Zhang3\[email protected]
\NameTianbao Yang1\[email protected]
\addr1Department of Computer Science
The University of Iowa
Iowa City
IA 52242
\addr2Department of Management Sciences
University of Iowa
Iowa City
IA 52242
\addr3National Key Laboratory for Novel Software Technology Nanjing University
Nanjing 210023
China
Abstract
Previous studies on stochastic primal-dual algorithms for solving min-max problems with faster convergence heavily rely on the bilinear structure of the problem, which restricts their applicability to a narrowed range of problems. The main contribution of this paper is the design and analysis of new stochastic primal-dual algorithms that use a mixture of stochastic gradient updates and a logarithmic number of deterministic dual updates for solving a family of convex-concave problems with no bilinear structure assumed. Faster convergence rates than with being the number of stochastic gradient updates are established under some mild conditions of involved functions on the primal and the dual variable. For example, for a family of problems that enjoy a weak strong convexity in terms of the primal variable and has a strongly concave function of the dual variable, the convergence rate of the proposed algorithm is . We also investigate the effectiveness of the proposed algorithms for learning robust models and empirical AUC maximization.
1 Introduction
This paper is motivated by solving the following convex-concave problem:
[TABLE]
where is a closed convex set, is a lower-semicontinuous mapping whose component function is lower-semicontinuous and convex, is a convex function whose convex conjugate is denoted by , and is a lower-semicontinuous convex function. To ensure the convexity of the problem, it is assumed that if is not an affine function. By using the convex conjugate , the problem (1) is equivalent to the following convex minimization problem:
[TABLE]
A particular family of min-max problem (1) and its minimization form (2) that has been considered extensively in the literature (Zhang and Lin, 2015; Yu et al., 2015; Tan et al., 2018; Shalev-Shwartz and Zhang, 2013; Lin et al., 2014) is that is an affine function and for is decomposable. In this case, the problem (2) is known as (regularized) empirical risk minimization problem in machine learning:
[TABLE]
where is the -th row of and is the i-th element of .
However, stochastic optimization algorithms with fast convergence rates are still under-explored for a more challenging family of problems of (1) and (2) where is not necessarily an affine or smooth function and is not necessarily decomposable. It is our goal to design new stochastic primal-dual algorithms for solving these problems with a fast convergence rate. A key motivating example of the considered problem is to solve a distributionally robust optimization problem:
[TABLE]
where is a simplex, and denotes a divergence measure (e.g., -divergence) between two sets of probabilities and . In machine learning with denoting the loss of a model on the -th example, the above problem corresponds to robust risk minimization paradigm, which can achieve variance-based regularization for learning a predictive model from examples (Namkoong and Duchi, 2017). Other examples of the considered challenging problems can be found in robust learning from multiple perturbed distributions (Chen et al., 2017a) in which corresponds to the loss from the -th perturbed distribution, and minimizing non-decomposable loss functions (Fan et al., 2017; Dekel and Singer, 2006).
With stochastic (sub)-gradients computed for and , one can employ the conventional primal-dual stochastic gradient method or its variant (Nemirovski et al., 2009; Juditsky et al., 2011) for solving the problem (1). Under appropriate basic assumptions, one can derive the standard convergence rate with being the number of stochastic updates. However, the convergence rate is known as a slow convergence rate. It is always desirable to design optimization algorithms with a faster convergence. Nonetheless, to the best of our knowledge stochastic primal-dual algorithms with a fast convergence rate of in terms of minimizing remain unknown in general, even under the strong convexity of and . In contrast, if is decomposable and is strongly convex, the standard stochastic gradient method for solving (2) with an appropriate scheme of step size has a convergence rate of (Hazan et al., 2007; Hazan and Kale, 2011a). A direct extension of algorithms and analysis for stochastic strongly convex minimization to the stochastic concave-concave optimization does not give a satisfactory convergence rate 111One may obtain a dimensionality dependent convergence rate of by following conventional analysis, but it is not the standard dimensionality independent rate that we aim to achieve. . It is still an open problem that whether there exists a stochastic primal-dual algorithm by solving the convex-concave problem (1) that enjoys a fast rate of in terms of minimizing .
The major contribution of this paper is to fill this gap by developing stochastic primal-dual algorithms for solving (1) such that they enjoy a faster convergence than in terms of the primal objective gap. In particular, under the assumptions that is Lipschitz continuous, are Lipschitz continuous and the minimization problem (2) satisfies the strong convexity condition, the proposed algorithms enjoy an iteration complexity of for finding a solution such that , which corresponds to a faster convergence rate of . The key difference of the proposed algorithms from the traditional stochastic primal-dual algorithm is that it is required to compute a logarithmic number of deterministic updates for in the following form:
[TABLE]
which can be usually solved in time complexity. It would be worth noting that (See Appendix A). When is a moderate number, the proposed algorithms could converge faster than the traditional primal-dual stochastic gradient method. It is also important to note that we do not assume the proximal mapping of and can be easily computed. Instead, our algorithms only require (stochastic) sub-gradients of and , which make them applicable and efficient for solving more challenging problems where is an empirical sum of individual functions.
In addition, the proposed algorithms and theories can be easily extended to the case that is Hölder continuous and the minimization problem (2) satisfies a more general local error bound condition as defined later, with intermediate faster rates established.
2 Related Work
Stochastic primal-dual gradient method and its variant were first analyzed by (Nemirovski et al., 2009) for solving a more general problem . Under the standard bounded stochastic (sub)-gradient assumption, a convergence rate of was established for a primal-dual gap, which implies a convergence rate of for minimizing the primal objective . Later, there are couple of studies that aim to strengthen this convergence rate by leveraging the smoothness of or the involved function when there is a special structure of the objective function (Juditsky et al., 2011; Chen et al., 2014, 2017b). However, the worst-case convergence rate of these later algorithms is still dominated by . Without smoothness assumption on or a bilinear structure, these later algorithms are not directly applicable to solving (1). In addition, Frank Wolfe algorithms are analyzed for saddle point problems in (Gidel et al., 2016), which could also achieve a convergence rate of in terms of primal-dual gap under the smoothness condition.
Recently, there emerge several algorithms with faster convergence for solving (1) by leveraging the bilinear structure and strong convexity of and . For example, Zhang and Lin (2015) proposed a stochastic primal-dual coordinate (SPDC) method for solving (3) under the condition that is of bilinear structure and is strongly convex. When is also a strongly convex function, SPDC enjoys a linear convergence for the primal-dual gap. Other variants of SPDC have been considered in (Yu et al., 2015; Tan et al., 2018) for solving (1) with bilinear structure. Palaniappan and Bach (2016) proposed stochastic variance reduction methods for solving a family of saddle-point problems. When applied to (1), they require is either an affine function or a smooth mapping. If additionally and are strongly convex, their algorithms also enjoy a linear convergence for finding a solution that is -close to the optimal solution in squared Euclidean distance. Du and Hu (2018) established a similar linear convergence of a primal-dual SVRG algorithm for solving (1) when is an affine function with a full column rank for , is smooth, and is smooth and strongly convex, which are stronger assumptions than ours. All of these algorithms except (Du and Hu, 2018) also need to compute the proximal mapping of and at each iteration. In contrast, the present work is complementary to these studies aiming to solve a more challenging family of problems. In particular, the proposed algorithms do not require the bilinear structure or the smoothness of , and the smoothness and strong convexity of and are also not necessary. In addition, we do not assume that and have an efficient proximal mapping.
Several recent studies have been devoted to stochastic AUC optimization based on a min-max formulation that has a bilinear structure (Liu et al., 2018a; Natole et al., 2018), aiming to derive a faster convergence rate of . The differences from the present work is that (i) (Liu et al., 2018a)’s analysis is restricted to the online setting for AUC optimization; (ii) (Natole et al., 2018) only proves a convergence rate of in term of squared distance of found primal solution to the optimal solution under the strong convexity of the regularizer on the primal variable, which is weaker than our results on the convergence of the primal objective gap. To the best of our knowledge, the present work is the first one that establishes a convergence rate of in terms of minimizing for the proposed stochastic primal-dual methods by solving a general convex-concave problem (1) without bilinear structure or smoothness assumption on under (weakly local) strong convexity.
Restart schemes are recently considered to get improved convergence rate under some conditions. In (Roulet and d’Aspremont, 2017), restart scheme is analyzed for smooth convex problems under the sharpness and Hölder continuity condition. In (Dvurechensky et al., 2018), a universal algorithm is proposed for variational inequalities under Hölder conituity condition where the Hölder parameters are unknown. Stochastic algorithms are proposed for strongly convex stochastic composite problems in (Ghadimi and Lan, 2012, 2013).
Finally, we would like to mention that our algorithms and techniques share many similarities to that proposed in (Xu et al., 2017) for solving stochastic convex minimization problems under the local error bound condition. However, their algorithms are not directly applicable to the convex-concave problem (1) or the problem (2) with non-decomposable function . The novelty of this work is the design and analysis of new algorithms that can leverage the weak local strong convexity or more general local error bound condition of the primal minimzation problem (2) through solving the convex-concave problem (1) for enjoying a faster convergence.
3 Preliminaries
Recall that the problem of interest:
[TABLE]
where . Let denote the optimal set of the primal variable for the above problem, denote the optimal primal objective value and is the optimal solution closest to , where denotes the Euclidean norm.
Let denote the projection onto the set . Denote by and denote the -level set and -sublevel set of the primal problem, respectively. A function is -smooth if it is differentiable and its gradient is -Lipchitz continuous, i.e., . A differentiable function is said to have an -Hölder continuous gradient with iff . When , Hölder continuous gradient reduces to Lipchitz continuous gradient. A function is called -strongly convex if for any there exists such that
[TABLE]
where denotes any subgradient of at . A more general definition is the uniform convexity. is uniformly convex with degree if for any there exists such that
[TABLE]
For analysis of the proposed algorithms, we need a few basic notions about convex conjugate. For an extended real-valued convex function , the convex conjugate of is defined as
[TABLE]
The convex conjugate of is . Due to the convex duality, if is -strongly convex then is differentiable and is -smooth. More generally, if is -uniformly convex then is differentiable and its gradient is -Hölder continuous where , (Nesterov, 2015).
One of the conditions that allows us to derive a fast rate of for a stochastic algorithm is that both and are strongly convex, which implies that is strongly convex in terms of and strongly concave in terms of . One might regard this as a trivial task given the result for stochastic strongly convex minimization where a stochastic gradient is available for the objective function to be minimized (Hazan et al., 2007; Hazan and Kale, 2011a). However, the analysis for stochastic strongly convex minimization is not directly applicable to stochastic primal-dual algorithms, as briefly explained later as we present our results.
Moreover, the strong convexity of can be relaxed to a weak strong convexity of to derive a similar order of convergence rate, i.e., for any , we have
[TABLE]
where is the distance between and the optimal set . More generally, we can consider a setting in which satisfies a local error bound (or local growth) condition as defined below.
Definition 1**.**
A function is said to be satisfied local error bound (LEB) condition if for any ,
[TABLE]
where is a constant, and is a parameter.
This condition was recently studied in (Yang and Lin, 2018) for developing a faster subgradient method than the standard subgradient method, and was laster considered in (Xu et al., 2017) for stochastic convex optimization. A global version of the above condition (known as the global error bound condition) has a long history in mathematical programming (Pang, 1997). However, exploiting this condition for developing stochastic primal-dual algorithms seems to be new. When , the above condition is also referred to as weakly local strong convexity. When , it can capture general convex functions as long as is upper bounded for , which is true if is compact or is compact.
In parallel with the relaxed condition on , we can also relax the smoothness condition on or strong convexity condition on to Hölder continuous gradient condition on or a uniformly convexity condition on . Under the local error bound condition of and the Hölder continuous gradient condition of , we are able to develop stochastic primal-dual algorithms with intermediate complexity depending on and , which varies from to .
Formally, we will develop stochastic primal-dual algorithms for solving (3) under the following assumptions.
Assumption 1**.**
For Problem (3), we assume
- (1)
There exist and such that ; 2. (2)
Let and denote the stochastic subgradient of w.r.t. and , respectively. There exists constants and such that and . 3. (3)
* is -uniformly convex with such that has -Hölder continuous gradient where and .* 4. (4)
* is -Lipchitz continuous for .* 5. (5)
One of the following conditions hold: (i) is -strongly convex; (ii) satisfies the LEB condition for and .
Remark. Assumption 1 (1) assumes that there is a lower bound of , which is usually satisfied in machine learning problems. Assumption 1 (2) is a common assumption usually made in existing stochastic-based methods. Note that we do not assume and have efficient proximal mapping. Instead, we only require a stochastic subgradient of and . Assumption 1 (3) is a general condition which unifies both smooth and non-smooth assumptions on . When , satisfies the classical smooth condition with parameter . When , it is the classical non-smooth assumption on the boundness of the subgradients. We will state our convergence results in terms of and instead of and . Assumption 1 (4) on the Lipschitz continuity of is more general than assuming a bilinear form . Finally, we note that assuming the strong convexity of allows us to develop a stochastic primal-dual algorithm with simpler updates.
4 Main Results
In this section, we will present our main results for solving (3). Our development is divided into three parts. First, we present a stochastic primal-dual algorithm and its convergence result when the primal objective function is strongly convex and is also strongly convex. Then we extend the result into a more general case, i.e., satisfying LEB condition and is uniformly convex. Lastly, we propose an adaptive variant with the same order of convergence result when the value of parameter in LEB condition is unknown, which is also useful for tackling problems without knowing the value of . For both cases, we assume .
4.1 Restarted Stochastic Primal-Dual Algorithm for Strongly Convex
The detailed updates of the proposed stochastic algorithm for strongly convex are presented in Algorithm 1, to which we refer as restarted stochastic primal-dual algorithm or RSPD for short. The algorithm is based on a restarting idea that have been used widely in existing studies (Hazan and Kale, 2011b; Ghadimi and Lan, 2013; Xu et al., 2017; Yang and Lin, 2018). It runs in epoch-wise and it has two loops. The steps 3-7 are the standard updates of stochastic primal-dual subgradient method (Nemirovski et al., 2009). However, the key difference from these previous studies is that the restarted solution for the dual variable for the next epoch is computed based on the averaged primal variable for the -th epoch. It is this step that explores the strong convexity of , which together with the restarting scheme allows us exploring the strong convexity of to derive a fast convergence rate of with being the total number of iterations.
Below, we will briefly discuss the path for proving the fast convergence rate of RSPD. We first show that why the standard analysis for strongly convex minimization can not be generalized to the stochastic convex-concave problem to derive the fast convergence rate of . Let and similarly for . A standard convergence analysis for the inner loop (steps 3-6) of Algorithm 1 usually starts from the following inequalities.
Lemma 1**.**
For the updates in Step 4 and 5 omitting the subscript , the following holds for any
[TABLE]
For stochastic strongly convex minimization problems in which is absent in the above inequalities, one can take expectation over (8) and then apply the -strong convexity of to get the following inequality
[TABLE]
Based on the above inequalities for all , one can design a particular scheme of step size that allows us to derive convergence rate. However, such analysis cannot be extended to the primal-dual case.
A naive approach would be taking expectation for both (8) and (9) for a fixed and applying the -strong convexity (resp. -strong concavity) of in terms of (resp. ), which yields the following inequalities
[TABLE]
[TABLE]
It is notable that in deriving the above inequalities, and have to be independent of .
By adding the above inequalities together and applying the same analysis for the R.H.S with and , we can obtain the following inequalities for any fixed and independent of :
[TABLE]
where and . However, the above inequality does not imply the convergence for the standard definition of primal-dual gap of or even the primal objective gap . The main obstacle is that we cannot set which will make depend on and hence make the expectional analysis fail. It would be worth noting that following (Gidel et al., 2016), one could derive the upper bound of primal-dual gap of by (see Equation (5), (13) and (14) therein), where can be upper bounded by a constant and . Even if one sets and in (10), the convergence rate of primal-dual gap is only of , which is not what we pursue.
Another approach that gets around of the issue introduced by taking the expectation is by using high probability analysis. To this end, one can use concentration inequalities to bound the martingale difference sequence and for a fixed and (Kakade and Tewari, 2008). However, in order to prove the primal objective gap one has to bound the later martingale difference sequence for any possible so that one can get from . A standard approach for achieving this high probability bound is by using a covering number argument for the set . However, this will inevitably introduce dependence on the dimensionality of . For example, an -cover of a bounded ball of radius in has cardinality of , and of a simplex in has cardinality of .
To tackle the aforementioned challenges for both exceptional analysis and high probability analysis, we develop a different analysis for the proposed RSPD algorithm in order to achieve a faster convergence rate of without explicit dependence on the dimensionality of . In this subsection, we will focus on expectional convergence result, which will be extended to high probability convergence in next subsection. Our expectional analysis is build on the following lemma that is used to derive convergence rate in the literature (Nemirovski et al., 2009).
Lemma 2**.**
*Let the Lines 4 and 5 of Algorithm 1 run for iterations with a fixed step size and . Then *
[TABLE]
where , , and .
Remark: A nice property of the above result is that the max over in the L.H.S is taken before expectation.
Nevertheless, a simple approach for setting the step size as still yields a convergence rate of by assuming the size of is bounded (Nemirovski et al., 2009). The proposed RSPD algorithm has the special design of computing the restarted solutions and setting the step sizes, which together allows us to achieve convergence rate as stated in the following theorem. The key idea is that by using as a restarted point for the dual variable, we are able to connect to by using the strong convexity of and of . The convergence result of RSPD is presented below.
Theorem 2**.**
Suppose that Assumption 1 holds with and being -strongly convex. By setting and , then Algorithm 1 guarantees that . The total number of iterations is .
Remark. The equivalent convergence rate of the above result is given a total number of iterations . This matches the state-of-the-art convergence result for stochastic strongly convex minimization (Hazan and Kale, 2011b). Our algorithm can be applied to solving (2) for non-decomposable . In contrast to the standard stochastic primal-dual subgradient method, the additional computational overhead in RSPD is introduced by computing the restarted points . However, such computation only happens for a logarithmic number of times in the order of . We defer the discussion on the total time complexity of RSPD to the next section for some particular applications.
Proof.
To prove Theorem 2, we first need Lemma 2. Its proof will be given in Appendix B.
Let , by the setting of Algorithm 1, we know , , and for . We will show by induction for . It is easy to verify for a sufficiently large according to Assumption 1. Next, we need to show that conditional on , then we have
[TABLE]
Consider the update of -th stage. By Lemma 2 for the update of -the stage, we have
[TABLE]
Since and , we have
[TABLE]
For the first term on the RHS of (12), by the strong convexity of and the condition we have
[TABLE]
For the second term on the RHS of (12),
[TABLE]
where the first equality is due to the set up of the algorithm and Lemma 5, the second equality is due to is smooth (). Since is strongly convex with parameter , its optimal solution is unique, then we have
[TABLE]
Then the inequality (12) becomes
[TABLE]
By the setting of , and , we know , then
[TABLE]
Therefore, by induction, after running stages, we have
[TABLE]
The total iteration complexity is . ∎
4.2 RSPD Algorithm under the LEB condition
In the previous subsection, we introduce the RSPD algorithm for solving problem (1) when the objective function is strongly convex and is -smooth. However, these conditions are sometimes too strong for many machine learning problems. In this subsection, we will relax these strong conditions by assuming that satisfies the LEB condition (7) and has -Hölder continuous gradient with . We will develop a different variant of RSPD that also has high probability convergence guarantee.
Denote by a ball centered at with a radius intersected with , and similarly by a ball centered at with a radius intersected with . The second variant of the RSPD algorithm for solving problem (1) is summarized in Algorithm 2, which is similar to the RSPD algorithm except that the iterates are projected to bounded balls centered at the initial solutions of each epoch. This complication on the updates is introduced for the purpose of high-probability analysis, which also allows us to tackle problems that satisfies the LEB condition with . After each epoch, the proposed RSPD algorithm reduces the radius of the Euclidean ball. It is notable that this ball shrinkage technique is not new and has already used in Epoch-SGD method (Hazan and Kale, 2011b) for high probability bound analysis. We set the same value of initial radius for primal variable and dual variable in RSPD algorithm for the convenience of analysis. However, one can use different values but the same order of convergence result will be obtained by changing the analysis slightly. Another feature of RSPD that is different from RSPD is that RSPD uses a constant number of iterations in the inner loop in order to accommodate the local error bound condition.
We summarize the theoretical result of Algorithm 2 with a high probability bound in the following theorem.
Theorem 3**.**
*Suppose that Assumption 1 holds and obeys the LEB condition (7). Given , let , , and *
[TABLE]
Algorithm 2 guarantees that with at least probability . The total number of iterations is , where suppresses a logarithmic factor.
Remark. When , RSPD enjoys the improved iteration complexity than . When (i.e., is smooth), if (e.g., is (weakly) strongly convex), then RSPD enjoys the iteration complexity of , which is only worse by a logarithmic factor than the expectional convergence result in Theorem 2 for strongly convex . When or (i.e., is non-differentiable with no Hölder continuous gradient or does not obey the error bound condition), the convergence rate reduces to the standard .
Proof.
To prove Theorem 3, we first present the following two lemmes. The first one presents Azuma’s inequality which handles martingale difference sequence. The second one analyzes the behaviour of the update within a stage of Algorithm 2. Proof of Lemma 4 is in Appendix C.
Lemma 3**.**
(Azuma’s inequality) Let be the martingale difference sequence. Suppose that . Then for we have
[TABLE]
Lemma 4**.**
Let the Lines 4, 5, and 6 of Algorithm 2 run for iterations by fixed step size and starting from and . Then with the probability at least where , we have
[TABLE]
where , , and any fixed .
Now we proceed to proof of Theorem 3. Let , by the setting of Algorithm 2, we know , , , and for . We will show by induction for with a high probability. It is easy to verify for a sufficiently large according to Assumption 1. Next, we need to show that conditional on , we have
[TABLE]
with a high probability.
Consider the update of the -th stage. Define and . We would like to show that both and always hold, so that we are able to plug and into (4) in Lemma 4. To this end, we have for ,
[TABLE]
where the first inequality is due to Lemma 4 in (Yang and Lin, 2018), the second inequality is due to (7) and the third inequality is due to .
For , we have
[TABLE]
where the first equality is due to the set up of the algorithm and Lemma 5, the first inequality is due to the -Hölder continuous gradients of (Assumption 1 (3)), the second inequality is due to -Lipschitz continuity of (Assumption 1 (4)), and the last equality is due to the setting of .
By showing that and , we then plug in and into (4) in Lemma 4 as follows
[TABLE]
Finally, we would like to show by properly setting the values of , , , and .
First, to make in term and in term , we have and , respectively. Recalling that , this requires and , as in Line 5 and 6 of Algorithm 2. Next, we can plug and into term and . By setting , we have
[TABLE]
Then, for , by setting , we have . Last, for , by setting , we have .
Therefore, we have
[TABLE]
i.e.,
[TABLE]
By induction, after running stages, with probability , we have
[TABLE]
where we set . Considering the requirements from (4.2), for , we have
[TABLE]
Recall that , , and . On one hand, we have
[TABLE]
On the other hand, for , we have
[TABLE]
The above terms show that would not change as changes. Provided and , we have the total number of iterations is at most ST=O\bigg{(}\frac{\lceil\log(\frac{\epsilon_{0}}{\epsilon})\rceil\lceil\log({S}/{\delta})\rceil}{\epsilon^{2(1-v\theta)}}\bigg{)}=\widetilde{O}\bigg{(}\frac{1}{\epsilon^{2(1-v\theta)}}\bigg{)}.
∎
4.3 Adaptive Variants of RSPD
When setting the initial value of radius (as well as the value of ) in Algorithm 2, one requires to know , and (setting ), which may not be feasible in practice. Below, we introduce an adaptive variant of Algorithm 2 to find the -optimal solution without knowing either or and to initiate the algorithm under that . The developments in this section are mostly direct extension of techniques introduced (Xu et al., 2017; Yang and Lin, 2018).
The idea of tackling unknown is similar to the grid search: starting from a guess of for setting to run RSPD and then restarting RSPD using a larger (increased by a constant factor) or equivalently a larger . However, in order to not waste the updates for using a smaller and also remove the dependence on for setting , we equivalently increase and in a way that depends on such that a similar convergence rate can be still established. The details are presented in Algorithm 3. The following theorem gives convergence result of Algorithm 3. Its proof is in Appendix D.
Theorem 4**.**
Suppose that Assumption 1 holds with , and there exists such that the initial value satisfies and the error bound condition holds on with . For any , , let , , and T_{1}=\max\bigg{\{}320M^{2},320B^{2}L^{2}G^{2},8192\log(\frac{1}{\tilde{\delta}})M^{2},8192\log(\frac{1}{\tilde{\delta}})B^{2}L^{2}G^{2}\bigg{\}}\cdot\frac{(R_{1}^{(1)})^{2}}{\epsilon_{0}^{2}}. After at most calls of RSPD, Algorithm 3 guarantees that with probability with an iteration complextiy of .
Remark: The requirement on the local error bound condition of the above theorem seems slightly stronger than that holds on . However, for a convex function it has been shown that a local error bound condition implies an error bound condition on any compact set with the same but possibly different (Bolte et al., 2015). The above theorem and Algorithm 3 do not cover the case . But this can be easily resolved by setting according to an initial guess of , and then increasing or by two times and rerun RSPD. It is easy to see that after times the estimated value of will become larger than the true and the convergence theory in previous subsection will apply. As a result the total iteration complexity is only amplified by a factor of .
Finally, we can show that even if is unknown, by setting in Algorithm 3, we can still prove an improved convergence. Let be the maximum distance between the points in the -level set and the optimal set . Proof of the following theorem is similar to the one of Theorem 4 (in Appendix D) with slight modification.
Theorem 5**.**
Suppose that Assumption 1 (14) holds with , and is sufficiently large such that there exists and . Given , let , , , T_{1}=\max\bigg{\{}320M^{2},320B^{2}L^{2}G^{2},8192\log(\frac{1}{\tilde{\delta}})M^{2},8192\log(\frac{1}{\tilde{\delta}})B^{2}L^{2}G^{2}\bigg{\}}\cdot\frac{(R_{1}^{(1)})^{2}}{\epsilon_{0}^{2}}, and . After at most calls of RSPD, Algorithm 3 guarantees that with probability with an iteration complexity of .
Remark: This iteration complexity is still an improved one compared with that in (Nemirovski et al., 2009), reducing the dependence on the size of and to the .
5 Applications and Experiments
In this section, we investigate the effectiveness of our algorithms on two applications, i.e., distributionally robust optimization (DRO) and area under receiver operating characteristic curve (AUC) maximization. We perform DRO experiments on four benchmark datasets, a9a, real-sim, rcv1 and w8a. AUC experiments are performed on a9a, real-sim, covtype and URL. Table 1 shows the statistics of the used six datasets.
DRO. First, we consider solving the DRO (4) for binary classification as mentioned in the Introduction. We use the square distance for that was studied in (Namkoong and Duchi, 2017), i.e., . For the loss function, we consider the non-smooth hinge loss , where denotes the feature vector and denotes the label. We also include a regularizer on the model parameter . Using different regularizers will give different properties for the primal objective function. For example, if , then the primal objective function is obviously a strongly convex function. If , then we can prove that the primal objective function is a piecewise quadratic convex function, which satisfies the LEB condition with . The proof is given in Appendix E. We report the result of RSPD for solving the problem with here.
We compare with the baseline called Bandit Mirror Descent (BMD) algorithm considered in (Namkoong and Duchi, 2016), which has a convergence rate of . The stochastic gradients are computed in the same way as in (Namkoong and Duchi, 2016). Computing the restarted dual solution takes time complexity, and each update for the primal variable and the dual variable takes and , respectively. Therefore, the total time complexity of RSPD for finding an -optimal solution is . In contrast, the time complexity of BMD is .
We conduct experiments on four datasets from libsvm website using regularization for . The regularizer parameters are set to be for all datasets. The initial step sizes of all algorithms are tuned in the range of . All algorithms start with the same initial solutions with and . In implementing RSPD, we start with an initial increased by a factor of at each epoch. The results of objective gap against the number of gradients and against CPU time are shown in Figure 1 and Figure 2, respectively. It is clear that the proposed algorithm converge much faster than the baseline algorithm BMD.
AUC Maximization. Next, we consider empirical AUC maximization by solving the min-max saddle-point formulation proposed by (Ying et al., 2016):
[TABLE]
where denote the feature-label pairs of a training example, , is the percentage of positive example, and is the indicator function. Let . In order to achieve good AUC performance, we add a ball constraint on . Bounds on can be derived similarly to (Ying et al., 2016). If we use ball , it was shown in (Liu et al., 2018a) that the primal objective function satisfies the LEB with . If we use ball constraint , under a mild condition that it was shown that a LEB with is satisfied (Liu et al., 2018b). Then the iteration complexity of RSPD is given by . Since the dual variable is one-dimensional, computing the restarted dual solution takes complexity given the averaged feature vectors for the positive and negative examples are precomputed. Hence, when LEB with is satisfied, the total time complexity of RSPD or ARSPD is . We also note that SPDC (Zhang and Lin, 2015) is applicable in the AUC task, but it does not give a linear rate for the considered AUC problem, because there is no strong convexity for primal variable as required for achieving a linear rate. Adding a small strongly convex regularizer on the primal variable, its total time complexity is since every iteration needs to solve a linear system (i.e., the proximal mapping of the quadratic part of the primal variable), where is sample size. Here, we report the results of the proposed adaptive algorithm for the problem with an ball constraint and an ball, respectively.
Since the function is smooth in terms of and , we include more applicable baselines for comparison. In particular, we compare with four algorithms, i.e., PDSG (Nemirovski et al., 2009), SPAM (Natole et al., 2018), SMP (Juditsky et al., 2011) and primal-dual SVRG (Palaniappan and Bach, 2016). For primal-dua SVRG, we directly use the formulation of AUC proposed in the paper and conduct the experiment using the code provided by the authors 222Code derived at https://sites.google.com/site/pbalamuru/home/sagsaddle-code. SPAM is an algorithm proposed particularly for the stochastic AUC maximization. SMP and SVRG utilize the smoothness of the objective function. The complexity of PDSG and SMP for finding an -stationary solution is given by . Note that both SPAM and SVRG require a strong convexity of the objective function on the primal variable. To this end, we add an regularizer, i.e., with a small value of . These two algorithms have a total time complexity for finding a solution such that given by and , respectively. We can see that all baseline algorithms have worse time complexity than RSPD, especially the primal dual SVRG algorithm.
In the ball setting, we fix and on all datasets. In the ball setting, we set on a9a, covtype and URL, and on real-sim. The initial step sizes of all algorithms are tuned in the range of . For ARSPD, we set and simply set pretending that we do not know the value of true and tune . The initial solution of all algorithms are set to . For the ball setting, the convergence curves of AUC on four data sets against the number of gradients and CPU time are shown in Figure 3 and Figure 4, including two large-scale datasets covtype and URL, on which SVRG is too slow to be plotted. For the ball setting, the convergence curves of AUC against the number of gradients and CPU time are shown in Figure 5 and Figure6. We can see that the overall performance of ARSPD is the best among all algorithms.
6 Conclusion
In this paper, we have proposed novel stochastic primal-dual algorithms for solving convex-concave problems with no bilinear structure assumed, which employ a mixture of stochastic gradient updates and deterministic dual updates. A fast convergence rate of was achieved under strong convexity on the primal and dual variables. In addition, we design variants for more general problems without strong convexity achiving adaptive rates. Empirical results verify the effectiveness of our algorithms.
Appendix A A Lemma Regarding
Lemma 5**.**
Let , where is the convex conjugate of a differentiable function , then
[TABLE]
Proof.
Let , then we know
[TABLE]
Since is differentinable, and then by using Lemma 11.4 in (Cesa-Bianchi and Lugosi, 2006) we have
[TABLE]
That is
[TABLE]
∎
Appendix B Proof of Lemma 2
For simplicity of presentation, we use the notations , , and . To prove Lemma 2, we would leverage the following two update approaches:
[TABLE]
where and . The first two updates are identical to Line 4 and Line 5 in Algorithm 1. This can be verified easily. Take the first one as example:
[TABLE]
Let with , which includes the four update approaches in (17) as special cases. By using the strong convexity of and the first order optimality condition (), for any , we have
[TABLE]
which implies
[TABLE]
Then
[TABLE]
Applying the above result to the updates in (17), we have
[TABLE]
Adding the above four inequalities, we have
[TABLE]
where the last inequality uses the facts that , , and . Then we combine the LHS and RHS by summing up :
[TABLE]
By Jensen’s inequality, we have
[TABLE]
where , . Let and , we get
[TABLE]
Then we complete the proof by taking the expectation on both sides of above inequality and using the the facts that .
Appendix C Proof of Lemma 4
For simplicity of presentation, we use the notations , , and .
To prove Lemma 4, we would leverage the following two update approaches:
[TABLE]
where and . The first two lines are identical to Line 5 and 6 in Algorithm 2. This can be verified easily. Take the first one as example:
[TABLE]
Let us define with , which includes the four update approaches in (28) as special cases. By using the strong convexity of and the first order optimality condition (), for any , we have
[TABLE]
which implies
[TABLE]
Then
[TABLE]
Applying the above result to the updates in (28) (treating above as , , , , respectively), we have
[TABLE]
Adding the above four inequalities, we have
[TABLE]
where the last inequality uses the facts that , , and . Then we combine the LHS and RHS by summing up :
[TABLE]
By Jensen’s inequality, we have
[TABLE]
where , . Let and any fixed , we get
[TABLE]
Then we employ Azuma’s inequality (Lemma 3) to upper bound the last term with a high probability. Let be martingale difference sequence. We have
[TABLE]
where the first inequality is due to the triangle inequality, the second inequality is due to Cauchy–Schwarz inequality, the third inequality is due to Assumption 1 (2), and the last inequality is due to , . Therefore, by Azuma’s inequality with probability at least , we have for any
[TABLE]
Appendix D Proof of Theorem 4 (Theorem 5)
The proof is similar to the proof of Theorem 3 in (Xu et al., 2017). For completeness, we include it here. The proof of Theorem 5 can be also obtained by a slight change of the following proof.
Proof.
Based on the proof of Theorem 3, since and by the settings of , , T_{1}=\max\bigg{\{}320M^{2},320B^{2}L^{2}G^{2},8192\log(\frac{1}{\tilde{\delta}})M^{2},8192\log(\frac{1}{\tilde{\delta}})B^{2}L^{2}G^{2}\bigg{\}}\cdot\frac{(R_{1}^{(1)})^{2}}{\epsilon_{0}^{2}}, it can be shown that
[TABLE]
with a probability . Next, by running RSPD with initial satisfying (37) and the settings of , , and T_{2}=T_{1}\cdot 2^{2(1-\theta)}=\max\bigg{\{}320M^{2},320B^{2}L^{2}G^{2},8192\log(\frac{1}{\tilde{\delta}})M^{2},8192\log(\frac{1}{\tilde{\delta}})B^{2}L^{2}G^{2}\bigg{\}}\cdot\frac{(R_{1}^{(2)})^{2}}{\epsilon_{0}^{2}}, Theorem 3 ensures that with a probability at least ,
[TABLE]
By continuing this process with , we can show that
[TABLE]
with a probability at least . The total number of iterations for calls of RSPD can be bounded by
[TABLE]
∎
Appendix E Piecewise Quadratic Function of Distributionally Robust Optimization
We would like to prove the regularized DRO function is convex and piecewise quadratic, so it satifies the LEB condition with . First we present the following proposition.
Proposition 1**.**
(Proposition 2.3 (Rockafellar, 1987)) Let where is symmetric and positive semidefinite, and is lower semicontinuous, convex and piecewise linear-quadratic. Its effective domain is nonempty convex polyhedron that can be decomposed into finitely many polyhedral convex sets, on each of which is quadratic or linear.
We can rewrite DRO as , which is piecewise linear-quadratic in \Big{(}\ell(x)+n\lambda_{1}\mathbf{1}\Big{)} according to the above proposition. If is piecewise linear, the composition of the piecewise linear and piecewise quadratic functions is piecewise quadratic.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1Bolte et al. (2015) Jerome Bolte, Trong Phong Nguyen, Juan Peypouquet, and Bruce Suter. From error bounds to the complexity of first-order descent methods for convex functions. Co RR , abs/1510.08234, 2015.
- 2Cesa-Bianchi and Lugosi (2006) N. Cesa-Bianchi and G. Lugosi. Prediction, Learning, and Games . Cambridge University Press, 2006.
- 3Chen et al. (2017 a) Robert S. Chen, Brendan Lucier, Yaron Singer, and Vasilis Syrgkanis. Robust optimization for non-convex objectives. In Advances in Neural Information Processing Systems 30 (NIPS , pages 4705–4714. 2017 a.
- 4Chen et al. (2014) Y. Chen, G. Lan, and Y. Ouyang. Optimal primal-dual methods for a class of saddle point problems. SIAM Journal on Optimization , 24(4):1779–1814, 2014. 10.1137/130919362 . · doi ↗
- 5Chen et al. (2017 b) Yunmei Chen, Guanghui Lan, and Yuyuan Ouyang. Accelerated schemes for a class of variational inequalities. Mathematical Programming , 165(1):113–149, Sep 2017 b.
- 6Dekel and Singer (2006) Ofer Dekel and Yoram Singer. Support vector machines on a budget. In NIPS , pages 345–352, 2006.
- 7Du and Hu (2018) Simon S. Du and Wei Hu. Linear convergence of the primal-dual gradient method for convex-concave saddle point problems without strong convexity. Co RR , abs/1802.01504, 2018.
- 8Dvurechensky et al. (2018) Pavel Dvurechensky, Alexander Gasnikov, Fedor Stonyakin, and Alexander Titov. Generalized mirror prox: Solving variational inequalities with monotone operator, inexact oracle, and unknown h \ \ \backslash ” older parameters. ar Xiv preprint ar Xiv:1806.05140 , 2018.
