DEFT: Exploiting Gradient Norm Difference between Model Layers for   Scalable Gradient Sparsification

Daegun Yoon; Sangyoon Oh

arXiv:2307.03500·cs.LG·July 17, 2023

DEFT: Exploiting Gradient Norm Difference between Model Layers for Scalable Gradient Sparsification

Daegun Yoon, Sangyoon Oh

PDF

1 Repo

TL;DR

DEFT is a scalable gradient sparsification method for distributed deep learning that partitions gradient selection tasks among workers, reducing computational costs and communication traffic while maintaining high convergence.

Contribution

DEFT introduces a novel partitioned gradient selection scheme that improves scalability and efficiency in distributed training.

Findings

01

Significant speedup in gradient selection compared to existing methods.

02

Maintains high convergence performance.

03

Reduces communication traffic regardless of the number of workers.

Abstract

Gradient sparsification is a widely adopted solution for reducing the excessive communication traffic in distributed deep learning. However, most existing gradient sparsifiers have relatively poor scalability because of considerable computational cost of gradient selection and/or increased communication traffic owing to gradient build-up. To address these challenges, we propose a novel gradient sparsification scheme, DEFT, that partitions the gradient selection task into sub tasks and distributes them to workers. DEFT differs from existing sparsifiers, wherein every worker selects gradients among all gradients. Consequently, the computational cost can be reduced as the number of workers increases. Moreover, gradient build-up can be eliminated because DEFT allows workers to select gradients in partitions that are non-intersecting (between workers). Therefore, even if the number of…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

kljp/deft
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

MethodsGradient Sparsification · SPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings