Sparsification as a Remedy for Staleness in Distributed Asynchronous SGD

Rosa Candela; Giulio Franzese; Maurizio Filippone; Pietro Michiardi

arXiv:1910.09466·cs.LG·January 19, 2021·1 cites

Sparsification as a Remedy for Staleness in Distributed Asynchronous SGD

Rosa Candela, Giulio Franzese, Maurizio Filippone, Pietro Michiardi

PDF

Open Access

TL;DR

This paper demonstrates that applying sparsification in distributed asynchronous SGD does not impair convergence rates, effectively reducing communication without sacrificing performance in large-scale machine learning.

Contribution

First theoretical proof showing sparsification does not harm convergence in asynchronous, non-convex SGD with staleness, supported by empirical validation.

Findings

01

Sparsification maintains the same convergence rate as standard SGD.

02

Empirical results confirm negligible impact of sparsification on convergence.

03

Sparsification effectively reduces communication overheads in distributed systems.

Abstract

Large scale machine learning is increasingly relying on distributed optimization, whereby several machines contribute to the training process of a statistical model. In this work we study the performance of asynchronous, distributed settings, when applying sparsification, a technique used to reduce communication overheads. In particular, for the first time in an asynchronous, non-convex setting, we theoretically prove that, in presence of staleness, sparsification does not harm SGD performance: the ergodic convergence rate matches the known result of standard SGD, that is $O (1/ T)$ . We also carry out an empirical study to complement our theory, and confirm that the effects of sparsification on the convergence rate are negligible, when compared to 'vanilla' SGD, even in the challenging scenario of an asynchronous, distributed system.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsStochastic Gradient Optimization Techniques · Sparse and Compressive Sensing Techniques · Privacy-Preserving Technologies in Data

MethodsStochastic Gradient Descent