Accelerating Distributed K-FAC with Smart Parallelism of Computing and   Communication Tasks

Shaohuai Shi; Lin Zhang; Bo Li

arXiv:2107.06533·cs.DC·July 15, 2021

Accelerating Distributed K-FAC with Smart Parallelism of Computing and Communication Tasks

Shaohuai Shi, Lin Zhang, Bo Li

PDF

Open Access

TL;DR

This paper introduces SPD-KFAC, a novel method that enhances distributed second-order optimization for deep learning by smartly parallelizing computation and communication, significantly reducing training time on GPU clusters.

Contribution

The paper proposes SPD-KFAC, which optimizes distributed KFAC training through pipelining and load balancing, addressing performance bottlenecks and improving efficiency.

Findings

01

Achieves 10%-35% faster training compared to existing methods.

02

Effectively reduces iteration time in distributed second-order optimization.

03

Demonstrates scalability on a 64-GPU cluster with high-speed interconnect.

Abstract

Distributed training with synchronous stochastic gradient descent (SGD) on GPU clusters has been widely used to accelerate the training process of deep models. However, SGD only utilizes the first-order gradient in model parameter updates, which may take days or weeks. Recent studies have successfully exploited approximate second-order information to speed up the training process, in which the Kronecker-Factored Approximate Curvature (KFAC) emerges as one of the most efficient approximation algorithms for training deep models. Yet, when leveraging GPU clusters to train models with distributed KFAC (D-KFAC), it incurs extensive computation as well as introduces extra communications during each iteration. In this work, we propose D-KFAC (SPD-KFAC) with smart parallelism of computing and communication tasks to reduce the iteration time. Specifically, 1) we first characterize the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Neuroimaging Techniques and Applications · Tensor decomposition and applications · Advanced Neural Network Applications

MethodsStochastic Gradient Descent