Empirical Analysis on Top-k Gradient Sparsification for Distributed Deep   Learning in a Supercomputing Environment

Daegun Yoon; Sangyoon Oh

arXiv:2209.08497·cs.LG·September 20, 2022

Empirical Analysis on Top-k Gradient Sparsification for Distributed Deep Learning in a Supercomputing Environment

Daegun Yoon, Sangyoon Oh

PDF

Open Access

TL;DR

This paper empirically analyzes the inefficiencies of Top-k gradient sparsification in distributed deep learning on GPUs, highlighting communication bottlenecks and suggesting directions for more efficient methods.

Contribution

The paper provides an empirical evaluation of Top-k SGD's performance limitations on GPUs, offering insights for developing more efficient gradient sparsification techniques.

Findings

01

Top-k SGD is inefficient due to gradient sorting on GPUs.

02

Gradient sparsification reduces communication but has performance trade-offs.

03

Empirical analysis reveals bottlenecks in current sparsification methods.

Abstract

To train deep learning models faster, distributed training on multiple GPUs is the very popular scheme in recent years. However, the communication bandwidth is still a major bottleneck of training performance. To improve overall training performance, recent works have proposed gradient sparsification methods that reduce the communication traffic significantly. Most of them require gradient sorting to select meaningful gradients such as Top-k gradient sparsification (Top-k SGD). However, Top-k SGD has a limit to increase the speed up overall training performance because gradient sorting is significantly inefficient on GPUs. In this paper, we conduct experiments that show the inefficiency of Top-k SGD and provide the insight of the low performance. Based on observations from our empirical analysis, we plan to yield a high performance gradient sparsification method as a future work.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSparse and Compressive Sensing Techniques · Brain Tumor Detection and Classification · Face and Expression Recognition

MethodsSPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings · Stochastic Gradient Descent · Gradient Sparsification