Accelerating Distributed K-FAC with Smart Parallelism of Computing and Communication Tasks
Shaohuai Shi, Lin Zhang, Bo Li

TL;DR
This paper introduces SPD-KFAC, a novel method that enhances distributed second-order optimization for deep learning by smartly parallelizing computation and communication, significantly reducing training time on GPU clusters.
Contribution
The paper proposes SPD-KFAC, which optimizes distributed KFAC training through pipelining and load balancing, addressing performance bottlenecks and improving efficiency.
Findings
Achieves 10%-35% faster training compared to existing methods.
Effectively reduces iteration time in distributed second-order optimization.
Demonstrates scalability on a 64-GPU cluster with high-speed interconnect.
Abstract
Distributed training with synchronous stochastic gradient descent (SGD) on GPU clusters has been widely used to accelerate the training process of deep models. However, SGD only utilizes the first-order gradient in model parameter updates, which may take days or weeks. Recent studies have successfully exploited approximate second-order information to speed up the training process, in which the Kronecker-Factored Approximate Curvature (KFAC) emerges as one of the most efficient approximation algorithms for training deep models. Yet, when leveraging GPU clusters to train models with distributed KFAC (D-KFAC), it incurs extensive computation as well as introduces extra communications during each iteration. In this work, we propose D-KFAC (SPD-KFAC) with smart parallelism of computing and communication tasks to reduce the iteration time. Specifically, 1) we first characterize the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Neuroimaging Techniques and Applications · Tensor decomposition and applications · Advanced Neural Network Applications
MethodsStochastic Gradient Descent
