Acceleration of SVRG and Katyusha X by Inexact Preconditioning

Yanli Liu; Fei Feng; and Wotao Yin

arXiv:1905.09734·math.OC·May 24, 2019·ICML

Acceleration of SVRG and Katyusha X by Inexact Preconditioning

Yanli Liu, Fei Feng, and Wotao Yin

PDF

Open Access 1 Repo

TL;DR

This paper introduces an inexact preconditioning technique to accelerate SVRG and Katyusha X algorithms, achieving faster convergence and practical speedups in empirical risk minimization tasks.

Contribution

It proposes a novel inexact preconditioning approach with fixed preconditioners that enhances convergence of SVRG and Katyusha X without increasing memory requirements.

Findings

01

Achieves better iteration and gradient complexity.

02

Provides theoretical convergence guarantees.

03

Demonstrates 8x iteration and 7x runtime speedups in experiments.

Abstract

Empirical risk minimization is an important class of optimization problems with many popular machine learning applications, and stochastic variance reduction methods are popular choices for solving them. Among these methods, SVRG and Katyusha X (a Nesterov accelerated SVRG) achieve fast convergence without substantial memory requirement. In this paper, we propose to accelerate these two algorithms by \textit{inexact preconditioning}, the proposed methods employ \textit{fixed} preconditioners, although the subproblem in each epoch becomes harder, it suffices to apply \textit{fixed} number of simple subroutines to solve it inexactly, without losing the overall convergence. As a result, this inexact preconditioning strategy gives provably better iteration complexity and gradient complexity over SVRG and Katyusha X. We also allow each function in the finite sum to be nonconvex while the sum…

Equations384

F (x) = f (x) + ψ (x) = \frac{1}{n} i = 1 \sum n f_{i} (x) + ψ (x),

F (x) = f (x) + ψ (x) = \frac{1}{n} i = 1 \sum n f_{i} (x) + ψ (x),

w_{t + 1} = w_{t} - η H_{k} \tilde{\nabla}_{t},

w_{t + 1} = w_{t} - η H_{k} \tilde{\nabla}_{t},

f (y) \leq f (x) + ⟨ \nabla f (x), y - x ⟩ + \frac{L _{f}}{2} ∥ y - x ∥^{2}, \forall x, y \in R^{d} .

f (y) \leq f (x) + ⟨ \nabla f (x), y - x ⟩ + \frac{L _{f}}{2} ∥ y - x ∥^{2}, \forall x, y \in R^{d} .

f (y) \leq f (x) + ⟨ \nabla f (x), y - x ⟩ + \frac{L _{f}^{M}}{2} ∥ y - x ∥_{M}^{2}, \forall x, y \in R^{d} .

f (y) \leq f (x) + ⟨ \nabla f (x), y - x ⟩ + \frac{L _{f}^{M}}{2} ∥ y - x ∥_{M}^{2}, \forall x, y \in R^{d} .

f (y) \geq f (x) + ⟨ \nabla f (x), y - x ⟩ + \frac{σ _{f}}{2} ∥ y - x ∥^{2}, \forall x, y \in R^{d} .

f (y) \geq f (x) + ⟨ \nabla f (x), y - x ⟩ + \frac{σ _{f}}{2} ∥ y - x ∥^{2}, \forall x, y \in R^{d} .

f (y) \geq f (x) + ⟨ \nabla f (x), y - x ⟩ + \frac{σ _{f}^{M}}{2} ∥ y - x ∥_{M}^{2}, \forall x, y \in R^{d} .

f (y) \geq f (x) + ⟨ \nabla f (x), y - x ⟩ + \frac{σ _{f}^{M}}{2} ∥ y - x ∥_{M}^{2}, \forall x, y \in R^{d} .

∥\nabla f (x) - \nabla f (y) ∥_{M^{- 1}} \equiv ∥ x - y ∥_{Q},

∥\nabla f (x) - \nabla f (y) ∥_{M^{- 1}} \equiv ∥ x - y ∥_{Q},

\partial ϕ (x) = {v \in R^{d} ∣ ϕ (z) \geq ϕ (x) + ⟨ v, z - x ⟩ \forall z \in R^{d}} .

\partial ϕ (x) = {v \in R^{d} ∣ ϕ (z) \geq ϕ (x) + ⟨ v, z - x ⟩ \forall z \in R^{d}} .

prox_{η ψ}^{M} (x) = y \in R^{d} arg min {ψ (y) + \frac{1}{2 η} ∥ x - y ∥_{M}^{2}} .

prox_{η ψ}^{M} (x) = y \in R^{d} arg min {ψ (y) + \frac{1}{2 η} ∥ x - y ∥_{M}^{2}} .

prox_{η ψ} (x) = sign (x) max {∣ x ∣ - η λ, 0} .

prox_{η ψ} (x) = sign (x) max {∣ x ∣ - η λ, 0} .

w_{t + 1}

w_{t + 1}

= y \in R^{d} arg min {ψ (y) + \frac{1}{2 η} ∥ y - w_{t} ∥_{M}^{2} + ⟨ \tilde{\nabla}_{t}, y ⟩} .

w_{t + 1} = prox_{η ψ} (w_{t} - η \tilde{\nabla}_{t}),

w_{t + 1} = prox_{η ψ} (w_{t} - η \tilde{\nabla}_{t}),

y \in R^{d} arg min {ψ (y) + \frac{1}{2 η} ∥ y - w_{t} ∥_{M}^{2} + ⟨ \tilde{\nabla}_{t}, y ⟩} .

y \in R^{d} arg min {ψ (y) + \frac{1}{2 η} ∥ y - w_{t} ∥_{M}^{2} + ⟨ \tilde{\nabla}_{t}, y ⟩} .

w_{t + 1}^{i + 1} = prox_{γ ψ} (w_{t + 1}^{i} - \frac{γ}{η} M (w_{t + 1}^{i} - w_{t}) - γ \tilde{\nabla}_{t}) .

w_{t + 1}^{i + 1} = prox_{γ ψ} (w_{t + 1}^{i} - \frac{γ}{η} M (w_{t + 1}^{i} - w_{t}) - γ \tilde{\nabla}_{t}) .

y min Ψ (y) = h_{1} (y) + h_{2} (y) .

y min Ψ (y) = h_{1} (y) + h_{2} (y) .

0 \in

0 \in

∥ ε_{t + 1}^{p} ∥_{M} \leq \frac{c ( p )}{η} ∥ w_{t + 1} - w_{t} ∥_{M},

c (p) = 14 κ (M) \frac{τ ^{p}}{1 - τ ^{p}},

c (p) = 14 κ (M) \frac{τ ^{p}}{1 - τ ^{p}},

τ

τ

\displaystyle\mathbb{E}[F({x}^{k})-F(x^{*})]\leq{\mathcal{O}}\big{(}(\frac{1}{1+\frac{1}{4}m\eta\sigma^{M}})^{k}\big{)}.

\displaystyle\mathbb{E}[F({x}^{k})-F(x^{*})]\leq{\mathcal{O}}\big{(}(\frac{1}{1+\frac{1}{4}m\eta\sigma^{M}})^{k}\big{)}.

\displaystyle\mathbb{E}[F({x}^{k})-F(x^{*})]\leq{\mathcal{O}}\big{(}(\frac{1}{1+\frac{1}{2}\sqrt{\frac{1}{2}m\eta\sigma^{M}}})^{k}\big{)}.

\displaystyle\mathbb{E}[F({x}^{k})-F(x^{*})]\leq{\mathcal{O}}\big{(}(\frac{1}{1+\frac{1}{2}\sqrt{\frac{1}{2}m\eta\sigma^{M}}})^{k}\big{)}.

p

p

\displaystyle={\mathcal{O}}\bigg{(}\sqrt{\kappa(M)}\ln\big{(}\sqrt{\kappa^{M}_{f}}\kappa(M)\big{)}\bigg{)}

C_{1} (m, ε) = O (\frac{n + m}{ln ( 1 + \frac{1}{4} m η σ )} ln \frac{1}{ε}),

C_{1} (m, ε) = O (\frac{n + m}{ln ( 1 + \frac{1}{4} m η σ )} ln \frac{1}{ε}),

C_{2} (m, ε) = O (\frac{n + m}{ln ( 1 + \frac{1}{2} \frac{1}{2} m η σ )} ln \frac{1}{ε}) .

C_{1}^{'} (m, ε) = O (\frac{n + ( 1 + p d ) m}{ln ( 1 + \frac{1}{4} m η σ ^{M} )} ln \frac{1}{ε}),

C_{1}^{'} (m, ε) = O (\frac{n + ( 1 + p d ) m}{ln ( 1 + \frac{1}{4} m η σ ^{M} )} ln \frac{1}{ε}),

C_{2}^{'} (m, ε) = O (\frac{n + ( 1 + p d ) m}{ln ( 1 + \frac{1}{2} \frac{1}{2} m η σ ^{M} )} ln \frac{1}{ε}) .

\displaystyle\frac{\min_{m\geq 1}C_{1}^{\prime}(m,\varepsilon)}{\min_{m\geq 1}C_{1}(m,\varepsilon)}\leq{\mathcal{O}}\big{(}\frac{n^{\frac{1}{2}}}{\kappa_{f}}\big{)}.

\displaystyle\frac{\min_{m\geq 1}C_{1}^{\prime}(m,\varepsilon)}{\min_{m\geq 1}C_{1}(m,\varepsilon)}\leq{\mathcal{O}}\big{(}\frac{n^{\frac{1}{2}}}{\kappa_{f}}\big{)}.

\frac{min _{m \geq 1} C _{1}^{'} ( m , ε )}{min _{m \geq 1} C _{1} ( m , ε )} \leq O (\frac{d}{n κ _{f}}) .

\frac{min _{m \geq 1} C _{1}^{'} ( m , ε )}{min _{m \geq 1} C _{1} ( m , ε )} \leq O (\frac{d}{n κ _{f}}) .

\displaystyle\frac{\min_{m\geq 1}C_{2}^{\prime}(m,\varepsilon)}{\min_{m\geq 1}C_{2}(m,\varepsilon)}\leq{\mathcal{O}}\big{(}\sqrt{\frac{n^{\frac{1}{2}}}{\kappa_{f}}}\big{)}.

\displaystyle\frac{\min_{m\geq 1}C_{2}^{\prime}(m,\varepsilon)}{\min_{m\geq 1}C_{2}(m,\varepsilon)}\leq{\mathcal{O}}\big{(}\sqrt{\frac{n^{\frac{1}{2}}}{\kappa_{f}}}\big{)}.

\frac{min _{m \geq 1} C _{2}^{'} ( m , ε )}{min _{m \geq 1} C _{2} ( m , ε )} \leq O (\frac{d}{n ^{\frac{3}{4}}}) .

\frac{min _{m \geq 1} C _{2}^{'} ( m , ε )}{min _{m \geq 1} C _{2} ( m , ε )} \leq O (\frac{d}{n ^{\frac{3}{4}}}) .

\url .

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

uclaopt/IPSVRG
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsStochastic Gradient Optimization Techniques · Sparse and Compressive Sensing Techniques · Markov Chains and Monte Carlo Methods

Full text

Acceleration of SVRG and Katyusha X by Inexact Preconditioning

Yanli Liu

Fei Feng

Wotao Yin

Abstract

Empirical risk minimization is an important class of optimization problems with many popular machine learning applications, and stochastic variance reduction methods are popular choices for solving them. Among these methods, SVRG and Katyusha X (a Nesterov accelerated SVRG) achieve fast convergence without substantial memory requirement. In this paper, we propose to accelerate these two algorithms by inexact preconditioning, the proposed methods employ fixed preconditioners, although the subproblem in each epoch becomes harder, it suffices to apply fixed number of simple subroutines to solve it inexactly, without losing the overall convergence. As a result, this inexact preconditioning strategy gives provably better iteration complexity and gradient complexity over SVRG and Katyusha X. We also allow each function in the finite sum to be nonconvex while the sum is strongly convex. In our numerical experiments, we observe an on average $8\times$ speedup on the number of iterations and $7\times$ speedup on runtime.

SVRG, Katyusha X, inexact preconditioning

1 Introduction

Empirical risk minimization is an important class of optimization problems that has many applications in machine learning, especially in the large-scale setting. In this paper, we formulate it as the minimization of the following objective

[TABLE]

where the finite sum $f(x)$ is strongly convex, each $f_{i}(x)$ in the finite sum is smooth111A function $f$ is said to be smooth if its gradient $\nabla f$ is Lipschitz continuous. and can be nonconvex, and the regularizer $\psi(x)$ is proper, closed, and convex, but may be nonsmooth. A nonzero $\psi(x)$ is desirable in many applications, for example, $\ell_{1}-$ regularization that induces sparsity in the solution. Allowing $f_{i}$ to be nonconvex is also necessary in some applications, e.g., shift-and-invert approach to solve PCA (Saad, 1992).

1.1 Related Work

To obtain a high quality approximate solution $\hat{x}$ of (1.1), stochastic variance reduction algorithms are a class of preferable choices in the large scale setting where $n$ is huge. If each $f_{i}$ is $\sigma-$ strongly convex and $L-$ smooth, and $\psi=0$ , then SVRG (Johnson & Zhang, 2013), SAGA (Defazio et al., 2014a), SAG (Roux et al., 2012), SARAH (Nguyen et al., 2017), SDCA (Shalev-Shwartz & Zhang, 2013), SDCA without duality (Shalev-Shwartz, 2016), and Finito/MISO (Defazio et al., 2014b; Mairal, 2013) can find such a $\hat{x}$ within ${\mathcal{O}}\big{(}(n+\frac{L}{\sigma})\ln(\frac{1}{\varepsilon})\big{)}$ evaluations of component gradients $\nabla f_{i}$ , while vanilla gradient descent needs ${\mathcal{O}}(n\frac{L}{\sigma}\ln{\frac{1}{\varepsilon}})$ evaluations. Recently, SCSG improves this complexity to ${\mathcal{O}}\big{(}(n\wedge\frac{L}{\sigma\varepsilon}+\frac{L}{\sigma})\ln\frac{1}{\varepsilon}\big{)}$ 222 $a\wedge b\coloneqq\min\{a,b\}$ .. When $\psi\neq 0$ , many of these algorithms can be extended accordingly and the same gradient complexity is preserved (Xiao & Zhang, 2014; Defazio et al., 2014a; Shalev-Shwartz & Zhang, 2016). Among these methods, SVRG has been a popular choice due to its low memory cost.

When the condition number $\frac{L}{\sigma}$ is large, the performances of these variance reduction methods may degenerate considerably. In view of this, there have been many schemes that incorporate second-order information into the variance reduction schemes. In (Gonen et al., 2016), the problem data is first transformed by linear sketching in order to decrease the condition number, then SVRG is applied. However, the strategy is only proposed for ridge regression and it is unclear whether it can be applied to other problems.

A larger family of algorithms, called Stochastic Quasi-Newton (SQN) methods, apply to more general settings. The idea is to first sample one or a few Hessian-vector products, then perform a L-BFGS type update on the approximate Hessian inverse $H_{k}$ (Byrd et al., 2016; Moritz et al., 2016; Gower et al., 2016), then $H_{k}$ is applied to the SVRG-type stochastic gradient as a preconditioner. That is,

[TABLE]

where $\tilde{\nabla}_{t}$ is a variance-reduced stochastic gradient.

Linear convergence is established and competitive numerical performances are observed for SQN methods. However, the theoretical linear rate depends on the condition number of the approximate Hessian, which again depends poorly on the condition number of the objective, so it is not clear whether they are faster than SVRG in general. Furthermore, they do not support nondifferentiable regularizers nonconvexity of individual $f_{i}$ . Recently, the first issue is partially resolved in (Lin et al., 2016), where the algorithm is at least as fast as SVRG. To deal with the second issue, (Wang et al., 2018) applied a $H_{k}-$ preconditioned proximal mapping of $\psi$ after $H_{k}$ is applied to the variance reduced stochastic gradient, but in order to evaluate this mapping efficiently, $H_{k}$ is required to be of the symmetric rank-one update form $\tau I_{d}+uu^{T}$ , where $I_{d}\in\operatorname*{\mathbb{R}}^{d\times d}$ is the identity matrix and $u\in\operatorname*{\mathbb{R}^{d}}$ . However, $H_{k}$ is still ill-conditioned with a conditioner number of order ${\mathcal{O}}(\frac{1}{\varepsilon})$ , therefore only a gradient complexity of order ${\mathcal{O}}\big{(}(n+\kappa\frac{1}{\varepsilon})\ln(\frac{1}{\varepsilon})\big{)}$ can be guaranteed.

Another way of exploiting second-order information is to cyclically calculate one individual Hessian $\nabla^{2}f_{i}$ (or an approximation of it) (Rodomanov & Kropotov, 2016; Mokhtari et al., 2018), linear and locally superlinear convergence are established. However, they require at least an $O(n)$ amount of memory to store the local variables, which will be substantial when $n$ is large.

Aside from exploiting second-order information, it is also possible to apply Nesterov-type acceleration to SVRG. Recently, Katyusha (Allen-Zhu, 2017) and Katyusha X (Allen-Zhu, 2018) are developed in this spirit. Katyusha X also applies to the sum-of-nonconvex setting where each $f_{i}$ can be nonconvex. There are also “Catalyst” accelerated methods (Lin et al., 2015), where a small amount of strong convexity $\frac{c}{2}\|x-y^{k}\|^{2}$ is added to the objective and is minimized inexactly at each step, then Nesterov acceleration is applied. However, Catalyst methods have an additional $\ln k$ factor in gradient complexity over Katyusha and Katyusha X.

1.2 Our Contributions

We propose to accelerate SVRG and Katyusha X by a fixed preconditioner, as opposed to time-varying preconditioners in SQN methods. And the subproblems are solved with fixed number of simple subroutines. 2. 2.

If the preconditioner captures the second order information of $f$ , then there will be significant accelerations. With a good preconditioner $M$ , when $\kappa_{f}\in(n^{\frac{1}{2}},n^{2}d^{-2})$ , Algorithm 1 and Algorithm 2 are ${\mathcal{O}}(\frac{n^{\frac{1}{2}}}{\kappa_{f}})$ and ${\mathcal{O}}(\sqrt{\frac{n^{\frac{1}{2}}}{\kappa_{f}}})$ times faster than SVRG and Katyusha X in terms of gradient complexity, respectively. When $\kappa_{f}>n^{2}d^{-2}$ , these numbers become ${\mathcal{O}}(\frac{d}{\sqrt{n\kappa_{f}}})$ and ${\mathcal{O}}(\frac{d}{n^{\frac{3}{4}}})$ . We also demonstrate these accelerations for Lasso and Logistic regression. 3. 3.

Our acceleration applies to the sum-of-nonconvex setting, where $f(x)$ in (1.1) is strongly convex, but each individual $f_{i}$ can be nonconvex. We also allow a nondifferentiable regularizer $\psi(x)$ .

2 Preliminaries and Assumptions

Throughout this paper, we use $\|\cdot\|$ for $\ell_{2}-$ norm and $\langle\cdot,\cdot\rangle$ for dot product, $\|\cdot\|_{1}$ denotes the $\ell_{1}-$ norm.

The preconditioner $M\succ 0$ is a symmetric, positive definite matrix. We write $\lambda_{\text{min}}(M)$ and $\lambda_{\text{max}}(M)$ as the smallest and the largest eigenvalues of $M$ , respectively, and $\kappa(M)\coloneqq\frac{\lambda_{\text{max}}(M)}{\lambda_{\text{min}}(M)}$ as the condition number of $M$ . For $M\succ 0$ , let $\|\cdot\|_{M}$ and $\langle\cdot,\cdot\rangle_{M}$ denote the norm and inner product induced by $M$ , respectively, i.e., $\|x\|_{M}=\sqrt{x^{T}Mx},\langle x,y\rangle_{M}=x^{T}My$ .

We use $\lceil\cdot\rceil$ to denote the ceiling function. For $r\in(0,1]$ , $N\sim$ Geom $(r)$ denotes a random variable $N$ that obeys the geometric distribution, i.e., $N=k$ with probability $(1-r)^{k}r$ for $k\in\mathbb{N}$ . We have $\operatorname*{\mathbb{E}}[N]=\frac{1-p}{p}$ .

Definition 1.

We say that $f:\operatorname*{\mathbb{R}^{d}}\rightarrow\operatorname*{\mathbb{R}}$ is $L_{f}-$ smooth, if it is differentiable and satisfies

[TABLE]

We say that $f:\operatorname*{\mathbb{R}^{d}}\rightarrow\operatorname*{\mathbb{R}}$ is $L^{M}_{f}-$ smooth under $\|\cdot\|_{M}$ , if it is differentiable and satisfies

[TABLE]

Definition 2.

We say that $f$ is $\sigma_{f}-$ strongly convex, if

[TABLE]

We say that $f$ is $\sigma^{M}_{f}-$ strongly convex under $\|\cdot\|_{M}$ , if

[TABLE]

$L^{M}_{f}-$ smoothness under $\|\cdot\|_{M}$ is equivalent to $\|\nabla f_{i}(x)-\nabla f_{i}(y)\|_{M^{-1}}\leq L^{M}_{f}\|x-y\|_{M}$ . Also, $\sigma^{M}_{f}-$ strong convexity is equivalent to $\|\nabla f_{i}(x)-\nabla f_{i}(y)\|_{M^{-1}}\geq\sigma^{M}_{f}\|x-y\|_{M}$ . Cf. Section 2 of (Shalev-Shwartz & Zhang, 2016).

Definition 3.

We define the condition number of $f$ under $\|\cdot\|_{M}$ as $\kappa^{M}_{f}\coloneqq\frac{L^{M}_{f}}{\sigma^{M}_{f}}$ .

When $M=I$ , we have $\kappa^{M}_{f}=\kappa_{f}\coloneqq\frac{L_{f}}{\sigma_{f}}$ .

In this paper, we will choose $M$ such that $\kappa^{M}_{f}\ll\kappa$ . For example, if $f(x)=\frac{1}{2}x^{T}Qx$ where $Q\succ 0$ is ill-conditioned, by choosing $M=Q$ we have

[TABLE]

which tells us that $L^{M}_{f}=\sigma^{M}_{f}=1$ and $\kappa^{M}_{f}=1$ , while $\kappa_{f}=\kappa(Q)\gg 1$ . That is, under $Q-$ metric, $f(x)$ has a much smaller condition number and can be minimized easily.

Definition 4.

For a proper closed convex function $\phi:\operatorname*{\mathbb{R}^{d}}\rightarrow\operatorname*{\mathbb{R}}\cup\{+\infty\}$ , its subdifferential at $x\in{\mathrm{dom}(f)}$ is written as

[TABLE]

Definition 5.

For a proper closed convex function $\phi:\operatorname*{\mathbb{R}^{d}}\rightarrow\mathbb{R}$ , its $M-$ preconditioned proximal mapping with step size $\eta>0$ is defined by

[TABLE]

When $M=I$ , this reduces to the classical proximal mapping.

Finally, let us list the assumptions that will be effective throughout this paper.

Assumption 1.

In the objective function (1.1),

Each $f_{i}(x)$ is $L_{f}-$ smooth and $L^{M}_{f}-$ smooth under

$\|\cdot\|_{M}$ . 2. 2.

$f(x)$ is $\sigma_{f}-$ strongly convex, and $\sigma^{M}_{f}-$ strongly convex under $\|\cdot\|_{M}$ , where $\sigma_{f}>0$ and $\sigma^{M}_{f}>0$ . 3. 3.

The regularization term $\psi(x)$ is proper closed convex and $\mathbf{prox}_{\eta\psi}$ is easy to compute.

Remark 1.

In Assumption 1, we only require $f(x)=\frac{1}{n}\sum_{i=1}^{n}f_{i}(x)$ to be strongly convex, while each $f_{i}(x)$ can be nonconvex. 2. 2.

Several common choices of regularizers have simple proximal mappings. For example, when $\psi(x)=\lambda\|\cdot\|_{1}$ with $\lambda>0$ , $\mathbf{prox}_{\eta\psi}$ can be computed component wise as

[TABLE]

3 Proposed Algorithms

As discussed in Sec. 1, SVRG and Katyusha X suffer from ill-conditioning like other first order methods. In this section, we propose to accelerate them by applying inexact preconditioning. Let us illustrate the idea as follows,

We would like to apply a preconditioner $M\succ 0$ to the gradient descent step in SVRG. i.e.,

[TABLE]

where $\tilde{\nabla}_{t}$ is a variance-reduced stochastic gradient. When $\psi=0$ and this minimization is solved exactly, we have $w_{t+1}=w_{t}-\eta M^{-1}\tilde{\nabla}_{t}$ , which is a preconditioned gradient update. 2. 2.

However, solving (1) exactly may be expensive and impractical. In fact it suffices to solve it highly inexactly by fixed number of simple subroutines.

We summarize the resulted algorithm in Algorithm 1 and call it Inexact Preconditioned(IP-) SVRG. Compared to SVRG, the only difference lies in line $7$ .

Remark 2.

In line $2$ , the epoch length $D^{k}$ obeys a geometric distribution and $\mathbb{E}[m^{k}]=m-1$ , this is for the purpose of simplifying analysis (motivated by (Lei & Jordan, 2017; Allen-Zhu, 2018)), in practice one can just set $D^{k}=m-1$ . In our experiments, this still brings significant accelerations. 2. 2.

The choice of $m$ affects the performance. Intuitively, a larger $m$ means more gradient evaluations per epoch, but also more progress per epoch. Theoretically, we show that $m=\lceil\frac{n}{1+pd}\rceil$ gives faster convergence than SVRG, where $p$ is the number of subroutines used in Line $7$ . 3. 3.

In line $6$ , one can also sample a batch of gradients instead of one. It is straightforward to generalize our convergence results in Sec. 4 to this setting. 4. 4.

If $M=I$ , line $7$ reduces to

[TABLE]

and Algorithm 1 reduces to SVRG.

For $M\not\propto I$ , line $7$ contains an optimization problem that may not have a closed form solution:

[TABLE]

To solve it inexactly, we propose to apply fixed number of iterations of some simple subroutines, which are initialized at $w_{t}$ . This procedure is summarized in Procedure 1.

Remark 3.

In Procedure 1, there are many choices for the iterator $S$ , for example, one can use proximal gradient, FISTA (Beck & Teboulle, 2009) (or equivalently, Nesterov acceleration (Nesterov, 2013)), and FISTA with restart (O’donoghue & Candes, 2015). Under these choices, line $3$ is easy to compute. For example, when $S$ is the proximal gradient step, line $3$ of Procedure 1 becomes

[TABLE]

Now, let us also apply the inexact preconditioning idea to Katyusha X (Algorithm 2 of (Allen-Zhu, 2018)). Similar to Katyusha X, we first apply a momentum step, then one epoch of iPreSVRG (i.e., line $2\sim 9$ of Algorithm 1).

Remark 4.

When $\tau=\frac{1}{2}$ , one can show that $x_{k+1}\equiv y_{k}$ , and Algorithm 2 reduces to Algorithm 1. 2. 2.

When $M=I$ and the proximal mapping is solved exactly, Algorithm 2 reduces to Katyusha X. 3. 3.

The convergence of Algorithm 2 is established when $\tau=\frac{1}{2}\sqrt{\frac{1}{2}m\eta\sigma^{M}_{f}}$ . In practice, we found that many other choices of $\tau$ also work.

4 Main Theory

In this section, we proceed to establish the convergence of Algorithm 1 and Algorithm 2. The key idea is that when the preconditioned proximal gradient update in (3.2) is solved inexactly as in Procedure 1, the error can be bounded by $\|w_{t+1}-w_{t}\|_{M}$ , under which we can still establish the overall convergence of Algorithm 1 and Algorithm 2. Combine this with the fixed number of simple subroutines in Procedure 1, we obtain a much lower gradient complexity when $\kappa_{f}>n^{\frac{1}{2}}$ .

All the proofs in this section are deferred to the supplementary material.

First, Let us analyze the error in the optimality condition of (3.2) when it is solved inexactly by FISTA with restart as in Procedure 1. Specifically,

Let $h_{1}(y)=\psi(y)$ and $h_{2}(y)=\frac{1}{2\eta}\|y-w_{t}\|^{2}_{M}+\langle\tilde{\nabla},y\rangle,$ then the subproblem (3.2) can be written as

[TABLE]

Therefore, FISTA with restart applied to (3.2) can be summarized in the following algorithm.

Lemma 1.

Take Assumption 1. Suppose in Procedure 1, we choose $S$ as the iterator of FISTA with restart111FISTA with restart can be replaced with any iterator with Q-linear convergence on the iterates. In our experiments, FISTA also works, and a simple choice of $p=20$ is enough. every $p_{0}=\lceil 2e\sqrt{\kappa(M)}\rceil$ steps, with step size $\gamma=\frac{\eta}{\lambda_{\mathrm{max}}(M)}$ and restart it $(r-1)$ times (that is, $p=rp_{0}$ iterations in total). Then, $w_{t+1}=w_{t+1}^{(r-1,p_{0})}$ is an approximate solution to (3.2) that satisfies

[TABLE]

where

[TABLE]

and

[TABLE]

With Lemma 1, the overall convergences of Algorithm 1 and 2 can be established. The analysis is similar to that of (Allen-Zhu, 2018).

Theorem 1.

Under Assumption 1, let $x^{*}=\operatorname*{arg\,min}_{x}F(x)$ , $64\kappa^{M}_{f}c^{2}(p)\leq 1$ , $\eta\leq\frac{1}{2\sqrt{m}L^{M}_{f}}$ , and $m\geq 4$ . Then the iPreSVRG in Algorithm 1 satisfies

[TABLE]

Theorem 2.

Under Assumption 1, let $x^{*}=\operatorname*{arg\,min}_{x}F(x)$ , $64\kappa^{M}_{f}c^{2}(p)\leq 1$ , $\tau=\frac{1}{2}\sqrt{\frac{1}{2}m\eta\sigma^{M}_{f}}$ , $\eta\leq\frac{1}{2\sqrt{m}L^{M}_{f}}$ , and $m\geq 4$ . Then the iPreKatX in Algorithm 2 satisfies

[TABLE]

Remark 5.

When $M=I$ , we have $c(p)=0$ , and Theorems 1 and 2 recovers the Theorems D.1 and 4.3 of (Allen-Zhu, 2018).

In Theorems 1 and 2, we need the number of simple subroutines $p$ to be large enough such that $64\kappa^{M}_{f}c^{2}(p)\leq 1$ , the following Lemma provides a sufficient condition for this.

Lemma 2.

If the subproblem iterator $S$ in Procedure 1 is FISTA with restart every $p_{0}=\lceil 2e\sqrt{\kappa(M)}\rceil$ steps, and with step size $\gamma=\frac{\eta}{\lambda_{\mathrm{max}}(M)}$ , then, in order for $64\kappa^{M}_{f}c^{2}(p)\leq 1$ to hold, it suffices to choose

[TABLE]

where $c_{1}=\frac{1}{64*14^{2}}$ .

With (4.3), (4.4), and (4.5), we can now calculate the gradient complexities of Algorithm 1 and Algorithm 2, but let us first do that for SVRG and Katyusha X.

In Assumption 1, we have assumed that $\mathbf{prox}_{\eta\psi}(\cdot)$ is cheap to evaluate, therefore, each epoch of SVRG needs $n+m$ gradient evaluations, which is also true for Katyusha X. As a result, the gradient complexity for SVRG and Katyusha X to reach $\varepsilon-$ suboptimality are:

[TABLE]

For Algorithm 1 and Algorithm 2, each iteration in Procedure 1 is at most as expensive as $d$ gradient computations111For each iteration of Procedure 1, the most expensive step is multiplying $M$ to some vector, which is often cheaper than $d$ gradient computations. and is operated $p$ times, therefore, one epoch of iPreSVRG/iPreKatX needs at most $n+(1+pd)m$ gradient computations.

Consequently, we can write the the gradient complexity for Algorithm 1 and Algorithm 2 to reach $\varepsilon-$ suboptimality as:

[TABLE]

Remark 6.

According to Lemma 2, when $S$ is FISTA with restart, it suffices to choose $p$ by (4.5). 2. 2.

When the preconditioner $M$ is chosen appropriately, the step size $\eta$ in (4.8) and (4.9) can be much larger than that of (4.6) and (4.7).

Finally, we can compare $C_{1}(m,\varepsilon)$ , $C_{2}(m,\varepsilon)$ with $C^{\prime}_{1}(m,\varepsilon)$ , $C^{\prime}_{2}(m,\varepsilon)$ , respectively. It turns out that there is a significant speedup when $\kappa>n^{\frac{1}{2}}$ .

Theorem 3.

Take Assumption 1. Let the iterator $S$ in Procedure 1 be FISTA with restart, and an appropriate preconditioner $M$ is chosen such that $\kappa_{f}$ and $\kappa(M)$ are of the same order, and $\kappa^{M}_{f}$ is small compared to them, then

if $\kappa_{f}>n^{\frac{1}{2}}$ and $\kappa_{f}<n^{2}d^{-2}$ , then

[TABLE] 2. 2.

if $\kappa_{f}>n^{\frac{1}{2}}$ and $\kappa_{f}>n^{2}d^{-2}$ , then

[TABLE]

Theorem 4.

Take Assumption 1. Let the iterator $S$ in Procedure 1 be FISTA with restart, and an appropriate preconditioner $M$ is chosen such that $\kappa_{f}$ and $\kappa(M)$ are of the same order, and $\kappa^{M}_{f}$ is small compared to them, then

if $\kappa_{f}>n^{\frac{1}{2}}$ and $\kappa_{f}<n^{2}d^{-2}$ , then

[TABLE] 2. 2.

If $\kappa_{f}>n^{\frac{1}{2}}$ and $\kappa_{f}>n^{2}d^{-2}$ , then

[TABLE]

In Section 5, we provide practical choices of $M$ for Lasso and Logistic regression.

5 Experiments

To investigate the practical performance of Algorithms 1 and 2, we test on three problems: Lasso, logistic regression, and a synthetic sum-of-nonconvex problem. For the first two, each function in the finite sum is convex. To guarantee that the objective is strongly convex, a small $\ell_{2}-$ regularization is added to Lasso and logistic regression.

In the following, we compare SVRG, iPreSVRG, Katyusha X, and iPreKatX on four datasets from LIBSVM111https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/: w1a.t (47272 samples, 300 features), protein (17766 samples, 357 features), cod-rna.t (271617 samples, 8 features), australian (690 samples, 14 features), and one synthetic dataset. The implementation settings are listed below,

We choose the epoch length $m=100$ in all experiments, since we found that the choices $m\in\{\frac{n}{4},\frac{n}{2},n\}$ need more gradient evaluations. 2. 2.

For iPrePDHG and iPreKatX, we use FISTA as the subproblem iterator $S$ . If the preconditioner $M$ is diagonal, then the number of subroutines for solving the subproblem is $p=1$ , if not, then we set $p=20$ . 3. 3.

In all the experiments, we tune the step size $\eta$ and momentum weight $\tau$ to their optimal. 4. 4.

All algorithms are initialized at $x^{0}=\mathbf{0}$ . 5. 5.

All algorithms are implemented in Matlab R2015b. To be fair, except for the subproblem routines for inexact preconditioning, the other parts of the code are identical in all algorithms. The experiments are conducted on a Windows system with Intel Core i7 2.6 GHz CPU. The code is available at:

[TABLE]

5.1 Lasso

We formulate Lasso as

[TABLE]

where $a_{i}\in\mathbb{R}^{d}$ are feature vectors and $b_{i}\in\mathbb{R}$ are labels. Note that the first term is equivalent to $\frac{1}{2n}\|Ax-b\|^{2}$ , where $A=(a_{1},a_{2},...,a_{n})^{T}\in\mathbb{R}^{n\times d}$ and $b=(b_{1},b_{2},\dots,b_{n})\in\mathbb{R}^{n}$ .

For Lasso as in (5.1), we provide two choices of preconditioner $M$ ,

When $d$ is small, we choose

[TABLE]

this is the exact Hessian of the smooth part of the objective. 2. 2.

When $d$ is large and $A^{T}A$ is diagonally dominant, we choose

[TABLE]

where $\alpha>0$ . In this case, the subproblem (3.2) can be solved exactly with $p=1$ iteration.

Our numerical results are presented in the following figures. We didn’t observe significant accelerations of Katyusha X over SVRG and iPreKatX over iPrePDHG, and we suspect the reason is that $m=100$ and the optimal choices of step size $\eta$ make $m\eta\sigma_{f}>1$ or $m\eta\sigma^{M}_{f}>1$ , thus the complexity in (4.7) and (4.9) are not better than (4.6) and (4.8), respectively.

5.2 Logistic Regression

We formulate Logistic regression as

[TABLE]

where again $a_{i}\in\mathbb{R}^{d}$ are feature vectors and $b_{i}\in\mathbb{R}$ are labels.

For Logistic regression as in (5.2), the Hessian of the smooth part can be expressed as

[TABLE]

where $B=\text{diag}(b)A=\text{diag}(b)(a_{1},a_{2},...,a_{n})^{T}$ . Inspired by this111Here is a heuristic justification: By Definition 1 we know that $L^{M}_{f}=1$ ; Since $\frac{\exp(-b_{i}a_{i}^{T}x)}{\big{(}1+\exp(-b_{i}a_{i}^{T}x)\big{)}^{2}}\rightarrow 0$ only when $x$ is unbounded, we know that if the iterates $x^{k}$ of our algorithms are bounded, then $H(x^{k})\succcurlyeq\frac{c}{n}B^{T}B$ for some $c>0$ , which gives $\sigma^{M}_{f}=4c$ according to Definition 2. When $c$ is not too small, one can expect $\kappa^{M}_{f}=\frac{1}{4c}\ll\kappa_{f}$ ., we provide two choices of preconditioner $M$ ,

When $d$ is small, we choose

[TABLE] 2. 2.

When $d$ is large and $B^{T}B$ is diagonally dominant, we choose

[TABLE]

where $\alpha>0$ . In this case, the subproblem (3.2) can be solved exactly with $p=1$ iteration.

Our results are presented in the following figures, again, we didn’t observe a significant acceleration of Katyusha X over SVRG and iPreKatX over iPrePDHG, due to the same reason mentioned in the last subsection.

5.3 Sum-of-nonconvex Example

Similar to (Allen-Zhu & Yuan, 2016), we generate a sum-of-nonconvex example by the following procedure:

We take $n$ normalized random vector $a_{i}\in\operatorname*{\mathbb{R}^{d}}$ , and also $d$ vectors of the form $g_{i}=(0,...0,5i,0,...0)$ , where the nonzero element is at $i$ th coordinate.

And the sum-of-nonconvex problem is given by

[TABLE]

where $n=2000,d=100$ , and $\lambda_{1}=10^{-3}$ .

[TABLE]

Since the sum of $D_{i}$ ’s is [math], they do not affect the condition number of the whole problem. However, it makes most of the first half of $f_{i}$ to be highly nonconvex. Overall, the condition number of this problem is equal to that of $\sum_{i=1}^{n}c_{i}c_{i}^{T}$ , which is approximately 10000 in our tested data.

Since $\sum_{i=1}^{n}c_{i}c_{i}^{T}$ is diagonally dominant, we select $M=\text{diag}(\frac{1}{n}\sum_{i=1}^{n}c_{i}c_{i}^{T})+\alpha I$ as the preconditioner. Our algorithms also have significant acceleration in this sum-of-nonconvex setting.

6 Conclusions and Future Work

In this paper, we propose to accelerate SVRG and Katyusha X by inexact preconditioning, with an appropriate preconditioner, both can be provably accelerated in terms of iteration complexity and gradient complexity. Our algorithms admits a nondifferentiable regularizer, as well as nonconvexity of individual functions. We confirm our theoretical results on Lasso, Logistic regression, and a sum-of-nonconvex example, where simple choices of preconditioners lead to significant accelerations.

There are still open questions left for us to address in the future: (a) Do we have theoretical guarantee when the subproblem iterator $S$ is chosen as faster schemes such as APCG (Lin et al., 2014), NU_ACDM (Allen-Zhu et al., 2016), and A2BCD (Hannah et al., 2018a)? (b) In general, how to choose a simple preconditioner that can greatly reduce the condition number of the problem? (c) Is it possible to apply this inexact preconditioning technique to other stochastic algorithms?

Acknowledgements

We would like to thank Yunbei Xu for helpful discussions on the idea of inexact preconditioning. We also thank the reviewers for their valuable comments.

This work is supported in part by the National Key R $\&$ D Program of China 2017YFB02029, AFOSR MURI FA9550-18-1-0502, NSF DMS-1720237, and ONR N0001417121.

Appendix A Proof of Lemma 1

In this section, we prove the results on the error generated when solving the subproblem (3.2) inexactly by Procedure 1. Before proving Lemma 1, we will first prove a simpler case in Lemma 3, where the subproblem iterator $S$ is the proximal gradient step.

Lemma 3.

Take Assumption 1. Suppose in Procedure 1, we choose $S$ as the proximal gradient step with step size $\gamma=\eta\frac{\lambda_{\mathrm{min}}(M)}{\lambda_{\mathrm{max}}^{2}(M)}$ , and is repeat it $p$ times, where $p\geq 1$ . Then, $w_{t+1}=w_{t+1}^{p}$ is an approximate solution to (3.2) that satisfies

[TABLE]

where

[TABLE]

and $\tau=\sqrt{1-\kappa^{-2}(M)}<1$ .

Proof of Lemma 3.

The optimization problem in (3.2) is of the form

[TABLE]

for $h_{1}(y)=\psi(y)$ and $h_{2}(y)=\frac{1}{2\eta}\|y-w_{t}\|^{2}_{M}+\langle\tilde{\nabla},y\rangle.$ With our choice of $S$ as the proximal gradient descent step, the iterations in Procedure 1 are

[TABLE]

where $i=0,1,...,p-1$ . From the definition of $\mathbf{prox}_{\gamma h_{1}}$ , we have

[TABLE]

Compare this with (A.1) gives

[TABLE]

To bound the right hand side, let $w_{t+1}^{\star}$ be the solution of (A.3), $\alpha=\frac{\lambda_{\text{min}}(M)}{\eta}$ , and $\beta=\frac{\lambda_{\text{max}}(M)}{\eta}$ . Then $h_{1}(y)$ is convex and $h_{2}(y)$ is $\alpha$ -strongly convex and $\beta$ -Lipschitz differentiable. Consequently, Prop. 26.16(ii) of (Bauschke et al., 2017) gives

[TABLE]

where $\tau=\sqrt{1-\gamma(2\alpha-\gamma\beta^{2})}$ .

Let $a_{i}=\|w_{t+1}^{i}-w_{t+1}^{\star}\|$ . Then, $a_{i}\leq\tau^{i}a_{0}$ . We can derive

[TABLE]

On the other hand, we have

[TABLE]

Combining these two equations yields

[TABLE]

where

[TABLE]

Finally, let the eigenvalues of $M$ be $0<\lambda_{1}\leq\lambda_{2}\leq...\leq\lambda_{d}$ , with orthonormal eigenvectors $v_{1},v_{2},...,v_{d}$ . Let $\varepsilon^{p}_{t+1}$ and $w_{t+1}-w_{t}$ be decomposed by

[TABLE]

then

[TABLE]

Combine these two inequalities with (A.4), we arrive at

[TABLE]

where

[TABLE]

∎

Now, we are ready to prove Lemma 1, the techniques are similar to the proof of Lemma 3.

Proof of Lemma 1.

We want to find $c(p)$ such that

[TABLE]

Take $i=r-1$ and $j=p_{0}-1$ , then the optimality condition of the problem in line 5 of Algorithm 3 is

[TABLE]

compare this with (A.7), we have

[TABLE]

where

[TABLE]

As a result,

[TABLE]

Let the solution of (3.2) be $w^{\star}_{t+1}$ . By Theorem 4.4 of (Beck & Teboulle, 2009), for any $0\leq i\leq r-1$ and $0\leq j\leq p_{0}$ we have

[TABLE]

On the other hand, the strong convexity of $\Psi=h_{1}+h_{2}$ gives

[TABLE]

Therefore,

[TABLE]

Now, let us use (A.11) repeatedly to bound the right hand side of (A.10). For example, the first term can be bounded as

[TABLE]

Similarly, the rest of the terms can be bounded as follows,

[TABLE]

where in the first and third estimate we have used $\frac{\theta_{p_{0}-2}-1}{\theta_{p_{0}-1}}\leq\frac{\theta_{p_{0}-2}}{\theta_{p_{0}-1}}<1$ . On the other hand, we have

[TABLE]

As a result, taking $\gamma=\frac{\lambda_{\mathrm{max}}(M)}{\eta}$ , $w^{(0,0)}_{t+1}=w_{t}$ , $w^{(r-1,p_{0})}_{t+1}=w_{t+1}$ and $\tau=(\frac{4\kappa(M)}{p_{0}^{2}})^{\frac{1}{2p_{0}}}$ yields

[TABLE]

where

[TABLE]

Similar to the end of proof of Lemma 3, we have

[TABLE]

Now, let us choose $p_{0}$ such that $\tau=(\frac{4\kappa(M)}{p_{0}^{2}})^{\frac{1}{2p_{0}}}$ is minimized, a simple calculation yields

[TABLE]

In order for $p_{0}$ to be an integer, we can take

[TABLE]

then

[TABLE]

Finally, Let us show that $b(p)$ in (A.12) can be bounded by $7\tau^{p}$ , and the desired bound (A.8) on $\|\varepsilon^{p}_{t+1}\|_{M}$ follows.

First, we have

[TABLE]

and

[TABLE]

On the other hand, a simple calculation shows that $(\frac{p_{0}}{p_{0}-1})^{\frac{1}{p_{0}}}$ is decreasing in $p_{0}$ , therefore

[TABLE]

Similarly, one can show that

[TABLE]

Combining these two inequalities with (B.2) yields

[TABLE]

∎

Appendix B Proof of Theorem 1

In this section, we proceed to establish the convergence of inexact preconditioned SVRG as in Algorithm 1. The proof is similar to that of Theorem D.1 of (Allen-Zhu, 2018).

Before proving Theorem 1, let us first prove several lemmas.

First, the inexact optimality condition (4.1) gives the following descent:

Lemma 4.

Under Assumption 1, suppose that (4.1) holds. Then, for any $u\in\operatorname*{\mathbb{R}^{d}}$ we have

[TABLE]

Proof.

First, let us rewrite the left hand side as

[TABLE]

By (4.1) and the definition of subdifferential we have

[TABLE]

Combining these two gives

[TABLE]

where in the last equality we have applied

[TABLE]

∎

Based on lemma 4, we have

Lemma 5.

Under Assumption 1, if the iterator $S$ in Procedure 1 is proximal gradient descent or FISTA with restart, then, for any $a>0$ , $\eta\leq\frac{1-2c(p)a}{2L^{M}_{f}}$ , and $u\in\operatorname*{\mathbb{R}^{d}}$ we have

[TABLE]

Proof.

We have

[TABLE]

where the first and second inequality are due to the strong convexity and smoothness under $\|\cdot\|_{M}$ in Assumption 1, respectively. the last equality is due to $\operatorname*{\mathbb{E}}[\tilde{\nabla}_{t}]=\nabla f(w_{t})$ .

On the other hand, recall that Lemma 4 gives

[TABLE]

For the last term we can apply Cauchy-Schwartz as follows,

[TABLE]

from Lemma 3 and Lemma 1 we know that

[TABLE]

Therefore, by Young’s inequality, we have for any $a>0$ that

[TABLE]

Applying this to Lemma 4 yields

[TABLE]

Applying this to (B.2), we arrive at

[TABLE]

where in the second inequality we have applied

[TABLE]

Finally, since $\eta\leq\frac{1-2c(p)a}{2L^{M}_{f}}$ , we have $\frac{\eta}{2(1-c(p)a-\eta L^{M}_{f})}\leq\eta$ , which gives the desired result.

∎

Lemma 6.

Under Assumption 1, we have

[TABLE]

Proof.

We have

[TABLE]

where in the first inequality, we have applied $\operatorname*{\mathbb{E}}[\|\xi-\operatorname*{\mathbb{E}}\xi\|^{2}]=\operatorname*{\mathbb{E}}[\|\xi\|^{2}-\|\operatorname*{\mathbb{E}}\xi\|^{2}$ with $\xi=M^{-\frac{1}{2}}\big{(}\nabla f_{i_{t}}(w_{t})-\nabla f_{i_{t}}(w_{0})\big{)}$ , and in the second inequality follows from Assumption 1. ∎

Lemma 7. (Fact 2.3 of (Allen-Zhu, 2018)).

Let $C_{1},C_{2},...$ be a sequence of numbers, and $N\sim$ Geom $(p)$ , then

$\mathbb{E}_{N}\left[C_{N}-C_{N+1}\right]=\frac{p}{1-p}\mathbb{E}_{N}\left[C_{0}-C_{N}\right]$ , and 2. 2.

$\mathbb{E}_{N}\left[C_{N}\right]=(1-p)\mathbb{E}\left[C_{N+1}\right]+pC_{0}.$ **

Lemma 8.

Under Assumption 1, if $\eta\leq\min\{\frac{1-2c(p)a}{2L^{M}_{f}},\frac{1}{2\sqrt{m}L^{M}_{f}}\}$ and $m\geq 2$ , then, for any $u\in\operatorname*{\mathbb{R}^{d}}$ we have

[TABLE]

Proof.

By Lemmas 5 and 6, we know that

[TABLE]

Let $D\sim$ Geom $(\frac{1}{m})$ as in Algorithm 1 and take $t=D$ , then

[TABLE]

where the first equality follows from the item 1 of Lemma 7 with $C_{N}=\|u-w_{N}\|_{M}^{2}$ , the second inequality follows from item 2 with $C_{N}=\|w_{d}-w_{0}\|_{M}^{2}$ , item 2 with $C_{N}=\|u-w_{0}\|_{M}^{2}-\|u-w_{N}\|_{M}^{2}$ , and item 1 with $C_{N}=\|u-w_{D}\|_{M}^{2}$ , then third inequality makes use of $m\geq 2$ and the fourth inequality makes use of $\eta\leq\frac{1}{2\sqrt{m}L^{M}_{f}}$ .

∎

Now, let us proceed to prove Theorem 1. With Lemma 8, it can be proved in a similar way as Theorem 3 of (Hannah et al., 2018b).

Proof of Theorem 1.

Without loss of generality, we can assume $x^{\star}=\operatorname*{arg\,min}_{x\in\operatorname*{\mathbb{R}^{d}}}F(x)=\mathbf{0}$ and $F(x^{*})=0.$

According to Lemma 8, for any $u\in\operatorname*{\mathbb{R}}^{d}$ , and $\eta\leq\min\{\frac{1-2c(p)a}{2L^{M}_{f}},\frac{1}{2\sqrt{m}L^{M}_{f}}\}$ we have

[TABLE]

or equivalently,

[TABLE]

In the following proof, we will omit $\operatorname*{\mathbb{E}}$ .

Setting $u=x^{*}=0$ and $u=x^{j}$ yields the following two inequalities:

[TABLE]

Define $\tau=\frac{1}{2}m\eta(\sigma_{f}^{M}-\frac{2c(p)}{a\eta})$ , multiply $(1+2\tau)$ to (B.3), then add it to (B.5) yields

[TABLE]

Multiplying both sides by $(1+\tau)^{j}$ gives

[TABLE]

Summing over $j=0,1,...,k-1$ , we have

[TABLE]

Since $F(x^{j})\geq 0$ , we have

[TABLE]

By the strong convexity of $F$ , we have $F(x^{0})\geq\frac{\sigma_{f}^{M}}{2}\|x^{0}\|_{M}^{2}$ , therefore

[TABLE]

Finally, recall that $a>0$ can be chosen arbitrarily, so we can take

[TABLE]

and

[TABLE]

In order for the choice of $\eta$ in (B.7) to be possible, we need

[TABLE]

to have one solution at least, which requires

[TABLE]

under which $\eta=\frac{1}{4L^{M}_{f}}$ satisfy (B.8). As a result, $m\geq 4$ makes (B.7) into

[TABLE]

and the desired convergence result follows from (B.6). ∎

Appendix C Proof of Lemma 2

Proof.

From Lemma 1, we know that

[TABLE]

where

[TABLE]

Therefore, in order for $64\kappa^{M}_{f}c^{2}(p)\leq 1$ , we need

[TABLE]

which is equivalent to

[TABLE]

Thus, it suffices to require that

[TABLE]

which gives

[TABLE]

∎

Appendix D Proof of Theorem 2

The proof of Theorem 2 is similar to that of Theorem 4.3 of (Allen-Zhu, 2018), so we provide a proof sketch here and omit the details.

In (Allen-Zhu, 2018), the proof of Theorem 4.3 is based on Lemma 3.3, here the proof of Theorem 2 is based on Lemma 8, which is an analog of Lemma of 3.3 in our settings. 2. 2.

Based on Lemma 8, the proof of Theorem 2 follows in nearly the same way as Theorem 4.3 of (Allen-Zhu, 2018), the only difference is that one needs to replace $\sigma$ by $\sigma^{M}_{f}-\frac{2c(p)}{a\eta}$ . 3. 3.

By setting

[TABLE]

and

[TABLE]

as in the proof of Theorem 1, the $\tau$ in Theorem 4.3 of (Allen-Zhu, 2018) becomes $\frac{1}{2}m\eta\sigma^{M}_{f}$ , and the convergence result of Theorem 2 follows.

Appendix E Proof of Theorems 3 and 4

Proof of Theorem 3.

From Remark 5, we know that the gradient complexity of SVRG can be expressed as

[TABLE]

Taking the largest possible step size $\eta=\frac{1}{2\sqrt{m}L_{f}}$ as in Theorem 1, we have

[TABLE]

Let us first find the optimal $m=m^{\star}$ for SVRG, let

[TABLE]

then

[TABLE]

Taking derivative to the numerator gives

[TABLE]

Therefore, $m^{\star}$ is given by $g^{\prime}(m)=0$ . Let $z=\frac{\sqrt{m}}{8\kappa_{f}}>0$ , then

[TABLE]

Since $\ln(1+z)>\frac{z}{1+z}$ for $z>0$ , we know that $g^{\prime}(n)>0$ , therefore, $m^{\star}<n$ .

Let $m=n^{s}$ where $0<s<1$ , we would like to have $g^{\prime}(n^{s})<0$ , i,e.,

[TABLE]

so that $m^{\star}\in(n^{s},n)$ .

Since $\kappa_{f}>n^{\frac{1}{2}}$ , we have $z=\frac{\sqrt{m}}{8\kappa_{f}}<\frac{1}{8}$ , on the other hand, we have

[TABLE]

Therefore, it suffices to have

[TABLE]

As a result, we have $m^{\star}\in(\frac{n}{c_{0}},n)$ , and

[TABLE]

where in the second equality we have used $\kappa_{f}>n^{\frac{1}{2}}$ .

For our iPreSVRG in Algorithm 1, we have

[TABLE]

thanks to Lemma 2, $p$ can be chosen as

[TABLE]

furthermore, we can take $\eta=\frac{1}{2\sqrt{m}L_{f}}$ due to Theorem 1.

Under these settings, we have

[TABLE]

Let us take $m=m^{\prime}=\lceil\frac{n}{1+pd}\rceil$ .

If $n>1+pd$ , or equivalently $\kappa_{f}<n^{2}d^{-2}$ , then

[TABLE]

Since $p={\mathcal{O}}\bigg{(}\sqrt{\kappa(M)}\ln\big{(}\sqrt{\kappa^{M}_{f}}\kappa(M)\big{)}\bigg{)},$ we know that when $(\kappa^{M}_{f})^{2}\sqrt{\kappa(M)}d<n$ , or equivalently $\kappa_{f}<n^{2}d^{-2}$ , we have

[TABLE]

therefore

[TABLE]

and

[TABLE]

If $n\leq 1+pd$ , or equivalently $\kappa_{f}>n^{2}d^{-2}$ , then $m=1$ and

[TABLE]

therefore

[TABLE]

Since $\kappa(M)\approx\kappa_{f}\gg\kappa^{M}_{f}$ , this ratio becomes ${\mathcal{O}}(\frac{d}{\sqrt{n\kappa_{f}}})$ ∎

Proof of Theorem 4.

The proof of Theorem 4 is similar and is omitted. ∎

Bibliography32

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1Allen-Zhu (2017) Allen-Zhu, Z. Katyusha: The first direct acceleration of stochastic gradient methods. In Proceedings of the 49th Annual ACM SIGACT Symposium on Theory of Computing , pp. 1200–1205. ACM, 2017.
2Allen-Zhu (2018) Allen-Zhu, Z. Katyusha X: Practical Momentum Method for Stochastic Sum-of-Nonconvex Optimization. In ICML , 2018.
3Allen-Zhu & Yuan (2016) Allen-Zhu, Z. and Yuan, Y. Improved SVRG for non-strongly-convex or sum-of-non-convex objectives. In International conference on machine learning , pp. 1080–1089, 2016.
4Allen-Zhu et al. (2016) Allen-Zhu, Z., Qu, Z., Richtárik, P., and Yuan, Y. Even faster accelerated coordinate descent using non-uniform sampling. In International Conference on Machine Learning , pp. 1110–1119, 2016.
5Bauschke et al. (2017) Bauschke, H. H., Combettes, P. L., et al. Convex Analysis and Monotone Operator Theory in Hilbert Spaces , volume 2011. Springer, 2017.
6Beck & Teboulle (2009) Beck, A. and Teboulle, M. A fast iterative shrinkage-thresholding algorithm for linear inverse problems. SIAM journal on imaging sciences , 2(1):183–202, 2009.
7Byrd et al. (2016) Byrd, R. H., Hansen, S. L., Nocedal, J., and Singer, Y. A stochastic quasi-Newton method for large-scale optimization. SIAM Journal on Optimization , 26(2):1008–1031, 2016.
8Defazio et al. (2014 a) Defazio, A., Bach, F., and Lacoste-Julien, S. SAGA: A fast incremental gradient method with support for non-strongly convex composite objectives. In Advances in Neural Information Processing Systems 27 , pp. 1646–1654, 2014 a.