Auto-Precision Scaling for Distributed Deep Learning

Ruobing Han; James Demmel; Yang You

arXiv:1911.08907·cs.DC·May 18, 2021

Auto-Precision Scaling for Distributed Deep Learning

Ruobing Han, James Demmel, Yang You

PDF

1 Repo

TL;DR

Auto-Precision Scaling (APS) enhances distributed deep learning by enabling accurate low-precision gradient communication, significantly reducing bandwidth without sacrificing model accuracy, and is implemented in an open-source system integrated with PyTorch.

Contribution

The paper introduces APS, a novel algorithm that improves low-precision gradient accuracy in distributed training, along with a hybrid-precision technique and an open-source simulation system.

Findings

01

APS achieves <0.05% accuracy loss with 8-bit gradients.

02

APS provides significant speedup over existing methods.

03

The CPD system allows flexible simulation of low-precision training.

Abstract

It has been reported that the communication cost for synchronizing gradients can be a bottleneck, which limits the scalability of distributed deep learning. Using low-precision gradients is a promising technique for reducing the bandwidth requirement. In this work, we propose Auto Precision Scaling (APS), an algorithm that can improve the accuracy when we communicate gradients by low-precision floating-point values. APS can improve the accuracy for all precisions with a trivial communication cost. Our experimental results show that for many applications, APS can train state-of-the-art models by 8-bit gradients with no or only a tiny accuracy loss (<0.05%). Furthermore, we can avoid any accuracy loss by designing a hybrid-precision technique. Finally, we propose a performance model to evaluate the proposed method. Our experimental results show that APS can get a significant speedup over…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

drcut/CPD
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.