IS-ASGD: Accelerating Asynchronous SGD using Importance Sampling

Fei Wang; Jun Ye; Weichen Li; Guihai Chen

arXiv:1706.08210·cs.DC·April 11, 2018

IS-ASGD: Accelerating Asynchronous SGD using Importance Sampling

Fei Wang, Jun Ye, Weichen Li, Guihai Chen

PDF

Open Access

TL;DR

This paper introduces IS-ASGD, a novel importance sampling-based asynchronous SGD method that accelerates convergence especially on large-scale sparse datasets, outperforming SVRG-ASGD in efficiency.

Contribution

The paper proposes a new importance sampling combined ASGD method with proven superior convergence bounds and practical efficiency for large-scale sparse datasets.

Findings

01

IS-ASGD outperforms SVRG-ASGD in convergence speed.

02

IS-ASGD is more efficient on large-scale sparse datasets.

03

Experimental results confirm theoretical advantages.

Abstract

Variance reduction (VR) techniques for convergence rate acceleration of stochastic gradient descent (SGD) algorithm have been developed with great efforts recently. VR's two variants, stochastic variance-reduced-gradient (SVRG-SGD) and importance sampling (IS-SGD) have achieved remarkable progresses. Meanwhile, asynchronous SGD (ASGD) is becoming more critical due to the ever-increasing scale of the optimization problems. The application of VR in ASGD to accelerate its convergence rate has therefore attracted much interest and SVRG-ASGDs are therefore proposed. However, we found that SVRG suffers dissatisfying performance in accelerating ASGD when the datasets are sparse and large-scale. In such case, SVRG-ASGD's iterative computation cost is magnitudes higher than ASGD which makes it very slow. On the other hand, IS achieves improved convergence rate with few extra computation cost and…

Tables1

Table 1. Table 1 : Evaluation Datasets

Name	Dimension	Instances	$\nabla f_{i}$ -Spa.	$ψ$	$ρ$	Source
News20	1,355,191	19,996	$10^{- 3}$	0.972	$5^{- 4}$	JMLR
URL	3,231,961	2,396,130	$10^{- 5}$	0.964	$3^{- 4}$	ICML
Algebra	20,216,830	8,407,752	$10^{- 7}$	0.892	$1^{- 4}$	KDD
Bridge	29,890,095	19,264,097	$10^{- 7}$	0.877	$2^{- 4}$	KDD

Equations70

f_{i} (w) = ϕ_{i} (w) + η r (w),

f_{i} (w) = ϕ_{i} (w) + η r (w),

min_{w \in R^{d}} F (w) := \frac{1}{n} i = 1 \sum n f_{i} (w) .

min_{w \in R^{d}} F (w) := \frac{1}{n} i = 1 \sum n f_{i} (w) .

w_{t + 1} = w_{t} - λ \nabla f_{i_{t}} (w_{t}),

w_{t + 1} = w_{t} - λ \nabla f_{i_{t}} (w_{t}),

E [V (\nabla f_{i_{t}} (w_{t}) - \nabla F (w_{t})],

E [V (\nabla f_{i_{t}} (w_{t}) - \nabla F (w_{t})],

⟨ x - y, \nabla f_{i} (x) - \nabla f_{i} (y)⟩ \geq μ ∥ x - y ∥_{2}^{2}, \forall x, y \in R^{d}

⟨ x - y, \nabla f_{i} (x) - \nabla f_{i} (y)⟩ \geq μ ∥ x - y ∥_{2}^{2}, \forall x, y \in R^{d}

∥\nabla f_{i} (x) - \nabla f_{i} (y) ∥_{2} \leq L_{i} ∥ x - y ∥_{2}, \forall x, y \in R^{d}

∥\nabla f_{i} (x) - \nabla f_{i} (y) ∥_{2} \leq L_{i} ∥ x - y ∥_{2}, \forall x, y \in R^{d}

p_{i}^{t} = I_{i}^{t} / N, s . t . i = 1 \sum N p_{i}^{t} = 1

p_{i}^{t} = I_{i}^{t} / N, s . t . i = 1 \sum N p_{i}^{t} = 1

w_{t + 1} = w_{t} - \frac{λ}{n p _{i_{t}}^{t}} \nabla f_{i_{t}} (w_{t})

w_{t + 1} = w_{t} - \frac{λ}{n p _{i_{t}}^{t}} \nabla f_{i_{t}} (w_{t})

E [F (w_{t + 1})

E [F (w_{t + 1})

\displaystyle-\mu\mathbb{E}\|w_{\star}-w_{t}\|_{2}^{2}+\frac{\lambda_{t}}{\mu}\mathbb{E}\mathbb{V}\Big{(}(np_{i_{t}}^{t})^{-1}\nabla f_{i_{t}}(w_{t})\Big{)}

\mathbb{V}\big{(}(np^{t}_{i_{t}})^{-1}\nabla f_{i_{t}}(w_{t})\big{)}=\mathbb{E}\|(np^{t}_{i_{t}})^{-1}\nabla f_{i_{t}}(w_{t})-\nabla F(w_{t})\|_{2}^{2}

\mathbb{V}\big{(}(np^{t}_{i_{t}})^{-1}\nabla f_{i_{t}}(w_{t})\big{)}=\mathbb{E}\|(np^{t}_{i_{t}})^{-1}\nabla f_{i_{t}}(w_{t})-\nabla F(w_{t})\|_{2}^{2}

p_{i}^{t} = \frac{∥\nabla f _{i} ( w _{t} ) ∥ _{2}}{\sum _{j = 1}^{N} ∥\nabla f _{j} ( w _{t} ) ∥ _{2}}, \forall i \in {1, 2, ..., N} .

p_{i}^{t} = \frac{∥\nabla f _{i} ( w _{t} ) ∥ _{2}}{\sum _{j = 1}^{N} ∥\nabla f _{j} ( w _{t} ) ∥ _{2}}, \forall i \in {1, 2, ..., N} .

p_{i} = \frac{L _{i}}{\sum _{j = 1}^{N} L _{j}}, \forall i \in {1, 2, ..., N} .

p_{i} = \frac{L _{i}}{\sum _{j = 1}^{N} L _{j}}, \forall i \in {1, 2, ..., N} .

\frac{1}{T} t = 1 \sum T E [F (w_{t}) - F (w_{⋆})] \leq \frac{∥ w _{⋆} - w _{0} ∥ _{2}^{2}}{σ} (\frac{\sum _{i = 1}^{n} L _{i}}{n}) \frac{1}{T},

\frac{1}{T} t = 1 \sum T E [F (w_{t}) - F (w_{⋆})] \leq \frac{∥ w _{⋆} - w _{0} ∥ _{2}^{2}}{σ} (\frac{\sum _{i = 1}^{n} L _{i}}{n}) \frac{1}{T},

\frac{1}{T} t = 1 \sum T E [F (w_{t}) - F (w_{⋆})] \leq \frac{∥ w _{⋆} - w _{0} ∥ _{2}^{2} \sum _{i = 1}^{n} ( L _{i}^{2} )}{σ n} \frac{1}{T},

\frac{1}{T} t = 1 \sum T E [F (w_{t}) - F (w_{⋆})] \leq \frac{∥ w _{⋆} - w _{0} ∥ _{2}^{2} \sum _{i = 1}^{n} ( L _{i}^{2} )}{σ n} \frac{1}{T},

ψ = \frac{( \sum _{i = 1}^{n} L _{i} ) ^{2}}{\sum _{i = 1}^{n} ( L _{i}^{2} )},

ψ = \frac{( \sum _{i = 1}^{n} L _{i} ) ^{2}}{\sum _{i = 1}^{n} ( L _{i}^{2} )},

∥\nabla f_{i} (w) ∥_{2} \leq 2 (1 + ∥ x_{i} ∥_{2} / λ) ∥ x_{i} ∥_{2} + λ

∥\nabla f_{i} (w) ∥_{2} \leq 2 (1 + ∥ x_{i} ∥_{2} / λ) ∥ x_{i} ∥_{2} + λ

D = {x_{1}, x_{2}, x_{3}, x_{4}}

D = {x_{1}, x_{2}, x_{3}, x_{4}}

Φ_{a} = i = 1 \sum N_{a} L_{a_{i}}

Φ_{a} = i = 1 \sum N_{a} L_{a_{i}}

Φ_{a} = Φ_{b} \forall a, b \in {1, ..., n u m_{T}}

Φ_{a} = Φ_{b} \forall a, b \in {1, ..., n u m_{T}}

ρ = \frac{\sum _{i = 1}^{N} ( L _{i} - μ ) ^{2}}{N},

ρ = \frac{\sum _{i = 1}^{N} ( L _{i} - μ ) ^{2}}{N},

w_{t + 1} = w_{t} - λ \nabla f_{i_{t}} (w_{t} + θ_{t})

w_{t + 1} = w_{t} - λ \nabla f_{i_{t}} (w_{t} + θ_{t})

∥ w_{t + 1} - w_{⋆} ∥_{2}^{2} =

∥ w_{t + 1} - w_{⋆} ∥_{2}^{2} =

=

λ^{2} ∥\nabla f_{i_{t}} (\overset{w_{t}}{^}) ∥_{2}^{2} + 2 λ ⟨ \overset{w_{t}}{^} - w_{t}, \nabla f_{i_{t}} (\overset{w_{t}}{^})⟩

⟨ \overset{w_{t}}{^} - w_{⋆}, \nabla f_{i_{t}} (\overset{w_{t}}{^})⟩ \geq μ ∥ \overset{w_{t}}{^} - w_{⋆} ∥_{2}^{2}

⟨ \overset{w_{t}}{^} - w_{⋆}, \nabla f_{i_{t}} (\overset{w_{t}}{^})⟩ \geq μ ∥ \overset{w_{t}}{^} - w_{⋆} ∥_{2}^{2}

μ ∥ \overset{w_{t}}{^} - w_{⋆} ∥_{2}^{2} \geq \frac{μ}{2} ∥ w_{t} - w_{⋆} ∥_{2}^{2} - μ ∥ \overset{w_{t}}{^} - w_{t} ∥_{2}^{2}

ϵ_{t + 1} \leq (1 - λ μ) ϵ_{t}

ϵ_{t + 1} \leq (1 - λ μ) ϵ_{t}

+ 2 λ R_{2}^{t} E ⟨ \overset{w_{t}}{^} - w_{t}, \nabla f_{i_{t}} (\overset{w_{t}}{^})⟩

ϵ_{t + 1} \leq

ϵ_{t + 1} \leq

\displaystyle+\underbrace{\lambda^{2}M^{2}\Big{(}8\tau\frac{\bar{\Delta}}{n}+4\lambda\mu\tau+16\lambda\mu\tau^{2}\frac{\bar{\Delta}}{n}}_{\delta}\Big{)}

k = O (1) lo g (ϵ_{0} / ϵ) (\frac{L ˉ}{μ} + \frac{L ˉ}{in f L} \frac{σ ^{2}}{μ ^{2} ϵ})

k = O (1) lo g (ϵ_{0} / ϵ) (\frac{L ˉ}{μ} + \frac{L ˉ}{in f L} \frac{σ ^{2}}{μ ^{2} ϵ})

O\Big{(}\min\Big{\{}n/\bar{\Delta},\frac{\epsilon\mu\sup L+\sigma^{2}}{\epsilon\mu^{2}}\Big{\}}\Big{)}

O\Big{(}\min\Big{\{}n/\bar{\Delta},\frac{\epsilon\mu\sup L+\sigma^{2}}{\epsilon\mu^{2}}\Big{\}}\Big{)}

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsStochastic Gradient Optimization Techniques · Privacy-Preserving Technologies in Data · Cryptography and Data Security

Full text

IS-ASGD: Accelerating Asynchronous SGD using Importance Sampling

Fei Wang [email protected] Department of Computer Science, Shanghai Jiao Tong University

Weichen Li [email protected] Carnegie Mellon University

Jason Ye [email protected] Intel Asia Pacific R&D Ltd.

Guihai Chen [email protected] Department of Computer Science, Shanghai Jiao Tong University

Abstract

Variance reduction (VR) techniques for convergence rate acceleration of stochastic gradient descent (SGD) algorithm have been developed with great efforts recently. VR’s two variants, stochastic variance-reduced-gradient (SVRG-SGD) and importance sampling (IS-SGD) have achieved remarkable progresses. Meanwhile, asynchronous SGD (ASGD) is becoming more critical due to the ever-increasing scale of the optimization problems. The application of VR in ASGD to accelerate its convergence rate has therefore attracted much interest and SVRG-ASGDs are therefore proposed. However, we found that SVRG suffers dissatisfying performance in accelerating ASGD when the datasets are sparse and large-scale. In such case, SVRG-ASGD’s iterative computation cost is magnitudes higher than ASGD which makes it very slow. On the other hand, IS achieves improved convergence rate with few extra computation cost and is invariant to the sparsity of dataset. This advantage makes it very suitable for the acceleration of ASGD for large-scale sparse datasets. In this paper we propose a novel IS-combined ASGD for effective convergence rate acceleration, namely, IS-ASGD. We theoretically prove the superior convergence bound of IS-ASGD. Experimental results also demonstrate our statements.

1 Introduction

For the empirical risk minimizations (ERM) problems, stochastic gradient descent (SGD) may be the most widely adopted solver algorithm. Let $w$ be the optimizer to be learned, denote

[TABLE]

where $\phi_{i}$ , $i\in\{1,2,...,n\}$ are vector functions that map $\mathbb{R}^{d}\to\mathbb{R}$ and $r(w)$ is the regularizer and $\eta$ is the regularization factor. This paper studies the following ERM optimization problem:

[TABLE]

For SGD algorithm, $w$ is updated as:

[TABLE]

where ${i_{t}}\sim P$ means $i$ is drawn iteratively with respect to sampling probability distribution $P$ and $\lambda$ is the step-size. With the growing concurrency of hardwares, lock-free asynchronous SGD (ASGD) algorithms Recht et al. (2011) have been developed to for speedup. With the improved speed and scalability, ASGDs quickly become indispensable and are de facto solvers for large-scale sparse optimizations. With the maturing of ASGDs, many following interests naturally shifted to the convergence acceleration techniques of them.

1.1 Variance Reduction for Convergence Acceleration of ASGD

It is commonly known that the variance of the stochastic gradient:

[TABLE]

is one of the major reasons that slow down the convergence rate of SGD. The uses of variance reduction (VR) techniques to accelerate the iterative convergence rate of SGD have therefore attracted much interest recently. VR improves the iterative convergence rate by constructing variance-reduced gradient instead of using the original stochastic gradient directly for model update.

One VR algorithm, stochastic variance-reduced-gradient (SVRG) Johnson and Zhang (2013) uses historical true-gradient and model snapshots to reduce the gradient variance. SVRG and its variants, e.g., SAGA Defazio et al. (2014) have been reported to be successful in accelerating the iterative convergence rate of SGD. Meanwhile, along with the intensively-studied SVRG-styled VR algorithms, another newly proposed VR technique, namely, importance sampling (IS) algorithm also achieves decreased stochastic gradient variance and improved convergence bound for SGD effectively by using non-uniform sampling (of the training samples) schemes as proposed in literatures Zhao and Zhang. (2015); T.Strohmer and Vershynin. (2009); Needell et al. (2014); Csiba and Richtárik (2016).

Recently, with the success of VR techniques, combining ASGD with VR to further improve its convergence rate shows practical significance and has therefore been studied, and several related works were proposed. Interestingly, we found that all these works by far are based on SVRG-styled ASGD (SVRG-ASGD), e.g., Zhao and Li (2016); Huo and Huang (2017); Meng et al. (2017); Liu et al. (2017); J. Reddi et al. (2015) while the research of IS-styled ASGD is still untouched. One possible reason is that SVRG-SGD’s iterative convergence rate, i.e., iteration count as the x-axis of the convergence curve, is much higher than that of IS-SGD. However, in practical deployments, the absolute convergence rate, i.e., wall-clock as the x-axis of the convergence curve, holds the actual significance. Unfortunately, previous works validated SVRG-styled ASGD with small-scale and relative dense datasets, in which its drawbacks on the absolute convergence rate for large-scale sparse datasets are not revealed.

1.2 Absolute Convergence Acceleration: Sparsity and Performance

After intensive evaluations of the existing SVRG-ASGD algorithms, we found that its absolute convergence rate is severely limited when dealing with large-scale sparse datasets, which is, unfortunately, de facto type of datasets that ASGDs are supposed to work with. See Algorithm 1 for the generic scheme of SVRG-ASGD, two bottlenecking issues decrease its absolute convergence rate drastically. They are caused by the same reason, i.e., SVRG is intrinsically dense. Sparsity for Less Computation As can be seen in line 7, for each iteration of SVRG-ASGD, two additional vector adds, i.e., $\nabla f_{i_{t}}(s)$ and $\mu$ are needed. Intuitively, this increases the computation cost up to two times. However, the actual increase of the computation cost can be extremely large. See Figure 1 for illustration, we should be noted that in large-scale sparse optimization problems where ASGD is applied, stochastic gradient $\nabla f_{i_{t}}(w_{t})$ is actually very sparse (as shown in the three upper rows) and is thus index-compressed, i.e., only the non-zeros features are stored with their corresponding indices. The update of $w_{t}$ is thus proceeded in an index-compressed way and the add operation is actually executed very few times, e.g., $10^{-7}$ * $d$ , comparing to the dimensionality $d$ . For sparse datasets with dimensions in tens of millions (which is not rarely seen in modern optimizations), index-compression of the sparse gradient is the most efficient method for ASGDs.

However, for SVRG-ASGD, due to the involvement of the historical true gradient $\mu$ , which is in fact a dense gradient with size $d$ as shown in the last row in Figure 1, the index-compressions of $\mu$ is meaningless. The update of $w_{t}$ has to be proceeded in the form of raw vector with full length $d$ , which is typically five to seven magnitudes larger than the index-compressed stochastic gradient. Adding arrays of such large magnitude at each iteration is completely impractical. In consideration of the large data sample counts, the training can be extremely time-consuming even it actually needs less iterations (SVRG accelerates ASGD in iteration). In fact, for modern stochastic optimizations where sparse datasets with extra-high dimensionality are common, we found that performing SVRG-ASGD is computationally infeasible and often fails to complete training in a reasonable time due to the drastically increased computation cost, which is caused by the loss of sparsity.

For large datasets, when the true-gradient is $10^{3}$ magnitudes higher than $\nabla f_{i}$ , the absolute convergence rate of SVRG-ASGD shows large net decrease. Unfortunately, for the previous SVRG-ASGD works, the absolute convergence results were conducted on relative low dimensionality (around $10^{2}\sim 10^{5}$ ) datasets; or using larger datasets ( $10^{7}$ ) but the comparison is limited between SVRGs.

We also find that the (only) public version of SVRG-ASGD111https://github.com/CMU-ML-17-102/svrg.git, committed by the author of J. Reddi et al. (2015) does not follow the proposed algorithm in its corresponding literature J. Reddi et al. (2015). The public version actually omit the addition of dense gradient $\mu$ at each iteration (in line 7) and only do it once at the end of each epoch by multiplying $\mu$ with $n$ . It seems that the intention of this approximation is to avoid the expensive dense gradient operation at each iteration. Unfortunately, we found the convergence curve of this public version far from the literature version.

Sparsity for Less Conflicts It is commonly known that one fundamental assumption for ASGD is that the datasets are sufficiently sparse, otherwise the conflict updates for the global model would certainly raise the risk of non-convergence or inferior convergence curves. As a consequence, for SVRG-ASGD, the loss of sparsity due to the usage of dense true-gradient $\mu$ does not only increase the iterative time cost magnitudes higher but also increases the potentiality of conflict updates. Such conflicts weaken the benefits of using variance-reduced gradient, which can be deemed as another negative effect on the absolute convergence rate.

Obviously, in oder to achieve the absolute converge rate acceleration of ASGD, a true-gradient-free VR algorithm has to be designed. Naturally, importance sampling as an elegant true-gradient-free VR algorithm comes to our consideration.

1.3 IS-ASGD for Guaranteed Absolute Convergence Acceleration

Clearly, when designing a VR algorithm to achieve absolute convergence acceleration for ASGD, we hope it not only remains a minimal increase of iterative time cost but also maintains low potentiality of conflict updates, which seems to be a difficult task. Fortunately, we notice that IS naturally suits in: it does not rely on the variance-reduced-gradient $v_{t}$ which makes it free from the true-gradient $\mu$ and thus the above mentioned performance-bottlenecking problems do not exist. In fact, IS can be implemented with no extra on-line computation by generating the sample sequences beforehand and let the computation threads iterate over the generated sequences, which leaves the computation kernel the same as ASGD. The calculation of the sampling distribution is typically fast which can actually be ignored comparing to the whole training time cost. That is, IS-ASGD is able to preserve almost the same iterative computation cost and low conflict updates with ASGD while achieving a higher iterative convergence rate. These are the key advantages for achieving a high absolute convergence rate acceleration.

As mentioned above, since SVRG typically achieves much better performance on iterative convergence rate acceleration, research by far all focus on the SVRG-ASGD algorithms while the IS-styled ASGD algorithm is still left unstudied. We consider this missing field worthy to be researched due to its practical significance for high performance large-scale sparse optimizations and its novelty. Following this idea, we analyze and propose the algorithm that uses IS to accelerate the absolute convergence rate of ASGD effectively, i.e., IS-ASGD, as the novel contribution of this paper.

The rest of this paper is organized as follows. In section II we analyze the potential problems of applying IS in ASGD and propose IS-ASGD algorithm with detailed discussion. Section III is dedicated to the theoretical analysis of the convergence bound improvement of IS for ASGD. The in-depth evaluations of both iterative and absolute convergence results are provided in Section IV. Finally, the conclusion of this paper is given in Section V.

2 Importance Sampling for Asynchronous SGD

We first briefly introduce some key concepts of IS. Like most previous related literatures, we make the following necessary assumptions for the convergence analysis of the stochastic optimization problem studied in this paper.

•

$\mu$ -Convex: $f_{i}$ is strongly convex with parameter $\mu$ , that is:

[TABLE]

•

$L_{i}$ -Lipschitz: Each $f_{i}$ is continuously differentiable and $\nabla f_{i}$ has Lipschitz constant $L_{i}$ w.r.t $\|\cdot\|_{2}$ , i.e.,

[TABLE]

where $\forall i\in\{1,2,...,N\}$ .

2.1 Importance Sampling

Importance sampling reduces the gradient variance through a non-uniform sampling procedure instead of drawing sample uniformly. For conventional stochastic optimization algorithms, the sampling probability of $i$ -th sample at $t$ -th iteration, denoted by $p_{i}^{t}$ , always equals to $1/N$ while in an IS scheme, $p_{i}^{t}$ is endowed with an importance factor $I_{i}^{t}$ and thus the $i$ -th sample is sampled at $t$ -th iteration with respect to a weighted probability:

[TABLE]

where $N$ is the number of training samples. With this non-uniform sampling procedure, to obtain an unbiased expectation, the update of $w$ is modified as:

[TABLE]

where $i_{t}$ is drawn i.i.d w.r.t the weighted sampling probability distribution $P^{t}=\{p_{i}^{t}\},\forall i\in\{1,2,...,N\}$ .

2.2 Importance Sampling for Variance Reduction

Recall the optimization problem in Equation 2, using the analysis result from Zhao and Zhang. (2015), we have the following lemma:

Lemma 1.

Let $\sigma^{2}=\mathbb{E}\|\nabla f_{i}(w_{\star})\|_{2}^{2}$ where $w_{\star}=\arg\underset{w}{\min}F(w)$ . Set $\lambda\leq\frac{1}{\mu}$ , with the update scheme defined in Equation 8, the following inequality satisfy:

[TABLE]

where the variance is defined as:

[TABLE]

and the expectation is estimated w.r.t distribution $P^{t}$ .

It is easy to verify that in order to minimize the gradient variance, the optimal sampling probability $p_{i}^{t}$ should be set as:

[TABLE]

Obviously, such iteratively re-estimation of $P^{t}$ is completely impractical. The authors propose to use the supremum of $\|\nabla f_{i}(w_{t})\|_{2}$ as an approximation. Since we have $L_{i}$ -Lipschitz of $\nabla f_{i}$ , by further assuming $\|w_{t}\|\leq R$ for any $t$ , we have $\|\nabla f_{i}(w_{t})\|_{2}\leq RL_{i}$ , i.e., $\sup\|\nabla f_{i}(w_{t})\|_{2}=RL_{i}$ . Thus the actual sampling probability of $p_{i}$ is calculated as:

[TABLE]

With such definition, $P$ needs no update and is used throughout the training procedure. The authors further prove that with Equation 12, IS accelerated SGD achieves a convergence bound as:

[TABLE]

while for standard SGD solver that actually samples $x_{i}$ w.r.t uniform distribution, the convergence bound is:

[TABLE]

when $\lambda$ is set as $\sqrt{\sigma\|w_{\star}-w_{0}\|_{2}^{2}}/\left(\frac{\sum_{i=1}^{n}L_{i}}{n}\sqrt{T}\right)$ . According to Cauchy-Schwarz inequality, we always have $\frac{(\sum_{i=1}^{n}L_{i})^{2}}{n\sum_{i=1}^{n}L_{i}^{2}}\leq 1$ , which implies that IS does improve convergence bound. Denote

[TABLE]

we can conclude that the improvement gets larger when $\psi\ll n$ .

For example, for L2-regularized SVM optimization problem with squared hinge loss, i.e., $f_{i}(w)=(\lfloor 1-y_{i}w^{T}x_{i}\rfloor_{+})^{2}+\frac{\lambda}{2}\|w\|_{2}^{2}$ , where $x_{i}$ is the $i$ -th sample and $y_{i}\in\{-1,+1\}$ is the corresponding label, $\|\nabla f_{i}(w)\|_{2}$ can be bounded as

[TABLE]

The pseudo code of practical IS-SGD algorithm is shown in Algorithm 2. As can be seen that, the core procedure of IS is the construction of sampling distribution $P$ . Once $P$ is constructed, IS-SGD works as same as SGD except that the training samples are selected w.r.t to weighted probability distribution $P$ and the step-size is adjusted with ${1}/{np_{i}}$ . It is clear that IS-ASGD does not rely on the true gradient $\mu$ , which makes it free from the bottlenecking issues that deteriorate the absolute convergence rate of SVRG-ASGD drastically. This means that IS as an effective VR technique is very suitable for ASGD solvers with large-scale sparse datasets.

2.3 Importance Imbalance

In most ASGD implementations that solve large-scale optimization problems, each training thread/process runs on its corresponding core/node and typically works on its local dataset for the sake of performance and scalability. For IS-ASGD, such data-segmentation brings in problem since each thread/process (indexed with $i$ ) can only calculate the sampling probability distribution $P_{i}$ based on its local dataset instead of the whole dataset, which leads to sub-optimal VR performance of IS for ASGD. See Figure 2 for illustration, assume we have two working cores and whole training dataset as:

[TABLE]

with subset $\mathcal{D}_{1}=\{x_{1},x_{2}\}$ located on core/node 1 while $\mathcal{D}_{2}=\{x_{3},x_{4}\}$ located on core/node 2. Without loss of generality, we further assume their corresponding Lipschitz constants as $\{1,2,3,4\}$ . For comparison, in IS-SGD where the only training process works on the whole dataset, the probability distribution of being chosen as the training sample is $P=\{p_{1}=0.1,p_{2}=0.2,p_{3}=0.3,p_{4}=0.4\}$ while in IS-ASGD with local-data-training, the sampling probabilities are $P_{1}=\{p_{1}=0.33,p_{2}=0.67\}$ and $P_{2}=\{p_{3}=0.43,p_{4}=0.57\}$ respectively for each core/node. In global-data-training algorithm e.g., IS-SGD, $p_{4}$ is much larger (twice over) than $p_{2}$ while in IS-ASGD, $p_{4}$ is even smaller than $p_{2}$ which is a heavy distortion from the theoretical optimum.

2.4 Importance Balancing for IS-ASGD

To reduce such imbalance, a rearrangement of the dataset before dividing/dispatching data segments to its corresponding core/node should be considered. See the second row of Figure 2 for illustration, to achieve a balanced importance segmentation, we design a simple balancing algorithm as shown in Algorithm 3. As can be seen that this procedure generates rearranged dataset indices $\mathcal{D}_{r}$ by locating $\mathcal{D}_{s}[i]$ and $\mathcal{D}_{s}[n-1-i]$ index together sequentially. Denote $\Phi_{a}$ as the importance sum of core/node $a$ :

[TABLE]

where $L_{a_{i}}$ is the Lipschitz constant of the $i$ -th data sample on core/node $a$ , $N_{a}$ is the number of data samples on core/node $a$ . According to Equation 12, we have the sampling probability of $i$ -th data sample on core/node $a$ as $p_{a_{i}}=\frac{L_{a_{i}}}{\Phi_{a}},\forall i\in\left\{1,2,...,N_{a}\right\}$ . It is easy to prove that by satisfying

[TABLE]

where $num_{T}$ is the number of cores/nodes, then the importance imbalance is eliminated. We call this dataset rearrangement procedure as importance balancing. Obviously Algorithm 3 does not guarantee to produce an equal-importance dataset segmentation. However, segmenting dataset into certain number (e.g. $num_{T}$ ) of equal-importance subsets is a typical NP-hard problem which can not be solved easily. We still use this simple head-tail sequential matching procedure since it is a fast approximation and generally works well in practice.

Meanwhile, it has to be pointed out that if the distribution of the Lipschitz constants closes to uniform distribution or the dataset is sufficiently large, a random shuffling would work just fine for IS to perform VR since the risk of severe importance imbalance is low. We empirically define a metric $\rho$ , which measures the potential of the imbalance to some extent:

[TABLE]

where $\mu=\sum_{i=1}^{N}L_{i}/N$ . A lower $\rho$ indicates lower potential of severe importance imbalance and vice versa. Accordingly, IS-ASGD is designed to perform importance balancing in an adaptive manner depending on $\rho$ . The pseudo code of IS-ASGD is shown in Algorithm 4, $\zeta$ is empirically set as $5^{-4}$ .

3 Convergence Analysis of IS-ASGD

Among the many analysis of the convergence bound of ASGD, Horia et al. model ASGD as SGD with perturbed inputs i.e., the inconsistent state of the model is treated as true model with noise added. Comparing to other convergence analysis, this scheme is more general, compact and most importantly, makes the analysis of the effect of IS in ASGD relative simple. We first give a brief introduction of the perturbed iterate analysis which serves as the base of our analysis. For the ease of analysis, we presume that the dataset is perfectly importance-balanced, i.e., $\Phi_{a}=\Phi_{b}$ , $\forall a,b\in\{0,1,...,num_{T}\}$ , that is, IS achieves its theoretical convergence bound proved in previous literatures.

3.1 Perturbed Iterate Analysis

In perturbed iterate analysis Mania et al. (2017), the update of $w_{t}$ is modeled as:

[TABLE]

where $\theta_{t}$ is the asynchrony error term caused by lock-free update at iteration $t$ . Let $\hat{w_{t}}=w_{t}+\theta_{t}$ . We have:

[TABLE]

Recall the convexity assumption i.e., $f_{i}$ is strongly convex with parameter $\mu$ , we have:

[TABLE]

Denote by $\epsilon_{t}$ the relative error of $\hat{w_{t}}$ , i.e, $\mathbb{E}\|\hat{w}_{t}-w_{\star}\|_{2}^{2}$ . By substituting Equation 23 back to 22, we obtain:

[TABLE]

Among the three labeled terms, notice that $R_{0}^{t}$ is a common term that exists in both SGD and ASGD while $R_{1}^{t}$ and $R_{2}^{t}$ are additional error terms introduced by the inconsistency of the model. $R_{1}^{t}$ reflects the difference between the true model and the perturbed (noise added) one, and $R_{2}^{t}$ measures the projection of such noise on the gradient of each iteration. Now that the convergence bound can be obtained once $R_{0}^{t}$ , $R_{1}^{t}$ and $R_{2}^{t}$ are bounded. The authors first bound $\mathbb{E}\|\nabla f_{i_{t}}(\hat{w_{t}})\|_{2}\leq M$ , i.e., $R_{0}^{t}\leq M^{2}$ . Next, to bound $R_{1}^{t}$ and $R_{2}^{t}$ , the concept of conflict graph is introduced as the following.

Conflict graph

Denote by $c_{i}\subseteq\{j\}_{j=0}^{d}$ the set of feature index of data sample $x_{i}$ , i.e., $j\in c_{i}$ only if the $j$ -th feature is provided in $x_{i}$ . In a conflict graph $G=\{e_{ij},v_{i}\}$ , $i,j\in\{0,1,...,n\},\ i\neq j$ , vertices $v_{i}$ and $v_{j}$ are connected with edge $e_{ij}$ if and only if $c_{i}\cap c_{j}\neq\varnothing$ . Further define two factors that reflect the extent of conflict update:

•

Delay parameter, $\tau$ , i.e., the maximum lag between when a gradient is computed and when it is applied to $w$ . It is assumed that $\tau$ is linearly related to the concurrency.

•

Conflict parameter, $\bar{\Delta}$ , which is the average degree of the conflict graph $G$ , obviously, datasets with higher $\bar{\Delta}$ suffers severer extent of conflict updates and vice versa.

These two parameters measure the extent of inconsistency from two aspects. $\tau$ is set as the proxy of concurrency of ASGD which can be controlled by the users while $\bar{\Delta}$ measures the intrinsic potentials of conflict update of dataset which is irrelevant to the algorithm’s settings. The authors prove that $R_{1}^{t}$ is bounded as $R_{1}^{t}\leq\lambda^{2}M^{2}\Big{(}2\tau+8\tau^{2}\frac{\bar{\Delta}}{n}\Big{)}$ and $R_{2}^{t}$ bounded as $R_{2}^{t}\leq 4\lambda M^{2}\tau\frac{\bar{\Delta}}{n}$ . Thus the recursion can be obtained by plugging $R_{1}^{t}$ and $R_{2}^{t}$ back to Equation 24:

[TABLE]

3.2 Bounding IS-ASGD

Now the recursion of $\epsilon_{t}$ is divided into two parts, i.e., accurate SGD term $\xi$ and noise term $\delta$ . With such scheme of modeling, the difficulty of the analysis caused by the inconsistency can be greatly simplified. We thus have the following lemma that bounds IS-ASGD.

Lemma 2.

For IS-ASGD algorithm that follows the scheme of Algorithm 4, by satisfying the convexity and continuity conditions in Equation 5 and Equation 6. Denote by $\sigma$ the residual, i.e., $\mathbb{E}\|\nabla f_{i}(w_{\star})\|_{2}$ , with a proper stepsize as $\lambda=\epsilon\mu/(2\epsilon\mu\sup L+2\sigma^{2})$ , the iteration steps $k$ which is sufficient to achieve $\mathbb{E}\|w_{k}-w_{\star}\|_{2}^{2}\leq\epsilon$ , is defined as:

[TABLE]

when $\tau$ is bounded as

[TABLE]

where $\epsilon_{0}:=\max_{0\leq t\leq T}\mathbb{E}\|\hat{w_{t}}-w_{\star}\|_{2}^{2}$ .

Proof.

Using the analysis of Needell et al. (2014), we know that for $\xi$ , the convergence bound of SGD is obtained as

[TABLE]

when $\lambda=\epsilon\mu/(2\epsilon\mu\sup L+2\sigma^{2})$ and the convergence bound of $\xi$ is further reduced from supremum dependence of $L$ to average dependence through the application of IS, i.e.,

[TABLE]

With the accurate SGD term $\xi$ bounded as Equation 29, we are left to bound the noise term $\delta$ as an order-wise constant in order to achieve nearly linear speedup of IS-SGD.

According to the definition of $p_{i}$ as shown in Equation 12, we have $(np_{i})^{-1}=\frac{\bar{L}}{L_{i}}$ . Since $\nabla f_{i_{t}}$ is scaled with $(np_{i})^{-1}$ in IS, $M$ is also scaled as $M_{s}:=(np_{i})^{-1}M$ . Since $\frac{\bar{L}}{L_{i}}\leq\frac{\bar{L}}{\inf L}$ , $M_{s}$ is thus bounded as:

[TABLE]

With this result, from Equation 25, it can be concluded that $\delta$ is bounded as an order-wise constant when the following conditions are satisfied:

[TABLE]

Considering that $\lambda$ is set as $\epsilon\mu/(2\epsilon\mu\sup L+2\sigma^{2})$ , Equation 31 is thus satisfied by bounding $\tau$ as:

[TABLE]

Thus the recursion of IS-ASGD is the same with IS-SGD plus an additional order-wise constant. We thus have the convergence bound of IS-ASGD as shown in Lemma 2. ∎

Obviously this bound inherits the superiority of IS-SGD over ASGD, and it shows that IS-ASGD achieves a nearly linear speedup of IS-SGD which is similar to the previous result in Fang and Lin (2017) that shows SVRG-ASGD achieves nearly linear speedup of SVRG-SGD.

In brief, the key to the convergence bound analysis is the serialization of the asynchrony which divides the update scheme into two self-bounded terms, i.e., $\xi$ and $\delta$ . Such separation makes the analysis much simpler, that is, IS decreases the convergence bound of $\xi$ as the same as in SGD while the two bounded error terms caused by the asynchrony, i.e., $R^{t}_{1}$ , $R^{t}_{2}$ , increase the convergence bound up to a constant when certain conditions are met.

4 Experimental Results

In order to make our evaluation representative and convincing, we conduct the evaluation based on the following configuration: Testbed Our testbed is a 2-sockets server with Intel XeonE5-2699V4 CPU which has 44 cores in total (with HyperThreading off) and 128G main memory. Code Base We base our IS-ASGD code on well-validated open source version of ASGD algorithm222http://i.stanford.edu/hazy/victor/Hogwild/. We also make our evaluation code of IS-ASGD publicly available333https://github.com/FayW/IS-ASGD.git. Datasets Evaluation datasets are from LibSVM444https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/. News20 (low dimensionality and relative dense) was used for the purpose of validation in previous SVRG-ASGD works. Such small dataset can not expose the bottlenecking performance problems as discussed in section 1. Yet we still select this dataset for comparison. For other three large-scale datasets, SVRG-ASGD fails to finish training in a reasonable time555For KDD datasets, SVRG-ASGD takes about 2 hours to finish 1 epoch with 44 threads. and thus we show no comparison of it.

According to the empirical threshold we set for $\rho$ in section 2, News20 is importance-balanced while for other datasets, we use simple random shuffling for IS-ASGD.

Objective Functions We evaluate IS-ASGD based on the most widely used objective function in classification problems, i.e., L1-regularized cross-entropy loss. It has been popularly adopted from simple linear models to neural network based models. Concurrency 16, 32 and 44 threads are evaluated. Algorithms Despite ASGD algorithm which is our target to accelerate, we also evaluate SGD (as baseline) and SVRG-ASGD. SVRG-ASGD We implement SVRG-ASGD by strictly following the proposed algorithm in J. Reddi et al. (2015) without the skip- $\mu$ approximation in the available public version since this approximation deteriorates the convergence rate significantly. Metrics Two metrics: rooted mean squared error (RMSE, objective value as the error) and error rate (i.e., misclassification) are evaluated and the error rate is updated once a better result is obtained.

4.1 Iterative Convergence Rate Acceleration

For iterative comparison, our expectation is that for same epoch counts, IS-ASGD achieves lower RMSE than ASGD and the error rate of IS-ASGD should be roughly (not strictly) lower than ASGD since a lower cost does not meant to be a lower error rate. Figure 3 shows the iterative convergence results of all four datasets in their corresponding sub-figures respectively.

Comparing to SVRG-ASGD From Figure 3-a, we see that SVRG-ASGD achieves the best iterative convergence rate with large improvement which comes at the price of magnitudes higher iterative computation cost. Meanwhile, it can also be noticed that with the increasing of the concurrency, the improvements of SVRG diminish quickly. This complies to our previous analysis that SVRG suffers higher potentiality of conflict updates due to its loss of sparsity, which makes it more concurrency-sensitive.

Comparing to ASGD We notice that the iterative convergence metrics of ASGD are the worst. It is worse than SGD in datasets that are relative dense, e.g., News20 and URL, while in datasets that are sufficiently sparse, e.g., the KDD datasets, its convergence rate are close to SGD. It is also clear that IS-ASGD’s iterative convergence rate is much better than ASGD in all cases. In fact, IS-ASGD also achieves better optimum, i.e., a lower final error rate and RMSE as can be seen from the results.

They also show different concurrency-robustness, for instance, in Figure 3-c, when $\tau=16$ , ASGD achieves close convergence rate to SGD with the increasing of epochs. However its convergence metrics deteriorates quickly when $\tau$ increases to 32 and 44. Meanwhile, IS-ASGD seems non-effected, it maintains close convergence results with SGD in all concurrencies which is a large improvement of ASGD and shows its concurrency-robustness.

Figure 3-b and d shows similar results that IS-ASGD achieves significant convergence rate accelerations comparing to ASGD and SGD. These two datasets, i.e., KDD2010_Alg., and Bri. are sparse and have extremely large dimensionality and number of samples. They also have lower $\psi$ , as mentioned in Equation 15, section 2.2, the convergence bound improvement of applying IS in SGD is negatively correlated to $\psi$ ant thus IS-ASGD achieves much significant convergence improvements in these two datasets, which is even much better than SGD. While in Figure 3-a and Figure 3-c where the datasets are relatively small (potentially higher imbalance), dense and have higher $\psi$ , its convergence bound are close to SGD.

In fact, when conditions discussed in section 3 are satisfied, i.e., datasets are sufficiently sparse, the iterative convergence rate of IS-ASGD will be no worse than SGD while the iterative convergence rate of ASGD will be no better than SGD. When datasets are even more large-scale and has lower $\psi$ , the convergence rate improvement of IS-ASGD increases significantly. We can firmly say that IS-ASGD accelerates the iterative convergence rate of ASGD effectively due to its inherited superior convergence bound from IS-SGD. Such improvements will directly result in absolute convergence rate acceleration of IS-ASGD since its iterative time cost remains almost the same with ASGD, and most importantly, it preserves the sparsity.

4.2 Absolute Convergence Rate Acceleration

We present the results of absolute convergence acceleration in two forms as shown in Figure 4 and Figure 5 respectively. While the iterative convergence results shown above hold a more academic meaning, the absolute convergence rate is the metric that matters for practical deployments since people always hope to obtain a trained model with less time. Figure 4 plots the absolute convergence curves with the x-axis as wall-clock in seconds. Be noted that we provide the RMSE comparison between all algorithms in the first column while in the second and third columns only the RMSE and error rate comparison between ASGD and IS-ASGD are shown for a better resolution since their curves are very short comparing to that of SGD and SVRG-ASGD.

As can be seen from Figure 4-a, SVRG-ASGD takes much longer time to achieve the same accuracy than other algorithms despite of its superior iterative convergence rate as shown in Figure 3-a since its iterative computation cost is several magnitudes higher than others. For the comparison between ASGD and IS-ASGD, we specifically plot the final best error rate (referred to as optimum) of ASGD in red circle while the blue dot corresponds to the same optimum achieved by IS-ASGD. This comparison directly shows the final absolute speedup results of IS for ASGD. We see that in Figure 4-d, IS-ASGD achieves a maximum 1.8x acceleration while in other cases its acceleration of the optimum varies depending the datasets and concurrency. In Figure 4-c, it shows that IS-ASGD also achieves better final optimum and higher acceleration of ASGD when concurrency increases.

The results also show that the optimums of error rate are achieved much earlier than that of RMSEs which implies that the acceleration of the early stage of the convergence is more important since for later stage the error rate improvements is very limited. We thus present error-rate/speedup curves for an in-depth inspection.

In-depth: Slice Inspection Figure 5 is derived from Figure 4 directly, the two 3D-figures in each sub-figure show the speedups of IS-ASGD over ASGD and SGD in a slicing manner for a deeper inspection of the convergence procedure. Its y-axis is the concurrency and z-axis is the absolute speedup of reaching the corresponding error-rate (values are linearly interpolated when needed) on x-axis.

From the speedup curves, we can see that the speedups are the largest at the early stage and drop in the middle. We can also conclude that the scales of the datasets affect the speedup curves in two aspects, first, for large-scale datasets, i.e, in Figure 5-b and d, the speedups rise at the final stage of the convergence procedures which implies that IS-ASGD achieves its best acceleration performance in large-scale datasets when searching for the optimal models. Second, for large-scale datasets, the average speedups of IS-ASGD over ASGD seem to be invariant to the currency as the curves show similar shape and mean, which indicates concurrency-robustness.

As can be summarized, the average speedups of IS-ASGD over ASGD range from 1.26 to 1.97 while the optimum speedups range from 1.13 to 1.54. For the raw computational speedup, it can be seen that the speedups of IS-ASGD over SGD for 16 threads range from 6.39 to 12.29 and increase to 11.89 to 23.53 when threads count increases to 44 depending on the size of dataset. In general, small data size does not achieve a good raw computation speedup. Taking the sampling time into consideration, the raw computational speedups of IS-ASGD are typically 7.7% to 1.1% lower than ASGD which are relative small differences. If we generate the sample sequence of IS-ASGD for each thread only once and simply shuffle it every epoch, there will be no computation performance gap between ASGD and IS-ASGD. In fact, such approximation work well in practice according to our evaluation.

4.3 Discussion: When Datasets are Dense

The main reason that causes SVRG performs inefficiently in large-scale sparse datasets is its reliance on the dense gradient $\mu$ which is magnitudes larger than the stochastic gradient $\nabla f_{i}$ . On the other hand, if the datasets are dense, e.g., when the sparsity of $\nabla f_{i}$ is higher than $10^{-3}$ which is close to $\mu$ , SVRG-ASGD prevails since their iterative computation costs are in same magnitude and SVRG-ASGD’s iterative convergence rate is much higher. Additionally, when datasets are small-scale, the whole training procedure tends to finish quickly. For this case, the performance bottleneck is the overhead of multi-process scheduling, all reduce operation, etc., instead of the computation, and thus SVRG-ASGD is likely to outperform the other algorithms. Since small datasets are of only academic meanings, it seems that the proper applications for SVRG-ASGDs are the scenarios when using ASGD for relative dense datasets. However for most large-scale optimizations, the datasets are typically sparse with its sparsity significantly lower than $10^{-3}$ .

5 Conclusion

Techniques for the acceleration of the convergence rate of asynchronous stochastic optimizations are of great importance and has long been a hot research field. In this paper we located several unidentified bottlenecking issues for current SVRG-based ASGD acceleration algorithms in large-scale sparse datasets and propose the novel IS-ASGD algorithm which avoids the above bottlenecking issues naturally. Its key advantage lies in the capability of accelerating the iterative convergence rate of ASGD with few increasing of the iterative time cost which in turn results in effective acceleration of the absolute convergence rate. We use importance-balancing trick to balance the importance between asynchronous cores/nodes which helps preserving the optimal VR performance of IS. Moreover, we theoretically proved the convergence bound of IS-ASGD which shows that IS-ASGD speeds up IS-SGD almost linearly and consequently inherits its superior convergence bound over ASGD and SGD. The experimental evaluation results clearly verify that IS-ASGD achieves 1.13 1.54x absolute convergence rates acceleration for ASGD. Evaluation codes can be publicly accessed.

Bibliography15

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1(1)
2Csiba and Richtárik (2016) Dominik Csiba and Peter Richtárik. 2016. Importance sampling for minibatches. ar Xiv preprint ar Xiv:1602.02283 (2016).
3Defazio et al . (2014) Aaron Defazio, Francis Bach, and Simon Lacoste-Julien. 2014. SAGA: A Fast Incremental Gradient Method With Support for Non-Strongly Convex Composite Objectives. In Advances in Neural Information Processing Systems . 1646–1654.
4Fang and Lin (2017) Cong Fang and Zhouchen Lin. 2017. Parallel Asynchronous Stochastic Variance Reduction for Nonconvex Optimization. In Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence . 794–800.
5Huo and Huang (2017) Zhouyuan Huo and Heng Huang. 2017. Asynchronous Mini-Batch Gradient Descent with Variance Reduction for Non-Convex Optimization. In Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence, February 4-9, 2017, San Francisco, California, USA. 2043–2049.
6J. Reddi et al . (2015) Sashank J. Reddi, Ahmed Hefny, Suvrit Sra, Barnabas Poczos, and Alex J Smola. 2015. On Variance Reduction in Stochastic Gradient Descent and its Asynchronous Variants. In Advances in Neural Information Processing Systems 28 . Curran Associates, Inc., 2647–2655.
7Johnson and Zhang (2013) Rie Johnson and Tong Zhang. 2013. Accelerating Stochastic Gradient Descent using Predictive Variance Reduction. In Advances in Neural Information Processing Systems . 315–323.
8Liu et al . (2017) Yuanyuan Liu, Fanhua Shang, and James Cheng. 2017. Accelerated Variance Reduced Stochastic ADMM. In Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence . 2287–2293.

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Videos

Taxonomy

IS-ASGD: Accelerating Asynchronous SGD using Importance Sampling

Abstract

1 Introduction

1.1 Variance Reduction for Convergence Acceleration of ASGD

1.2 Absolute Convergence Acceleration: Sparsity and Performance

1.3 IS-ASGD for Guaranteed Absolute Convergence Acceleration

2 Importance Sampling for Asynchronous SGD

2.1 Importance Sampling

2.2 Importance Sampling for Variance Reduction

Lemma 1**.**

2.3 Importance Imbalance

2.4 Importance Balancing for IS-ASGD

3 Convergence Analysis of IS-ASGD

3.1 Perturbed Iterate Analysis

Conflict graph

3.2 Bounding IS-ASGD

Lemma 2**.**

Proof.

4 Experimental Results

4.1 Iterative Convergence Rate Acceleration

4.2 Absolute Convergence Rate Acceleration

4.3 Discussion: When Datasets are Dense

5 Conclusion

Lemma 1.

Lemma 2.