Convolutional Neural Network Training with Distributed K-FAC

J. Gregory Pauloski; Zhao Zhang; Lei Huang; Weijia Xu; Ian T.; Foster

arXiv:2007.00784·cs.LG·July 3, 2020

Convolutional Neural Network Training with Distributed K-FAC

J. Gregory Pauloski, Zhao Zhang, Lei Huang, Weijia Xu, Ian T., Foster

PDF

3 Repos

TL;DR

This paper presents a scalable distributed K-FAC method for training convolutional neural networks efficiently at large scale, demonstrating faster convergence than SGD on ResNet models.

Contribution

It introduces a scalable K-FAC design with optimization techniques for CNN training at scale, improving convergence speed and efficiency.

Findings

01

Converges faster than SGD on ImageNet-1k with ResNet-50.

02

Achieves 75.9% MLPerf baseline in 18-25% less time.

03

Demonstrates scalability across GPU clusters.

Abstract

Training neural networks with many processors can reduce time-to-solution; however, it is challenging to maintain convergence and efficiency at large scales. The Kronecker-factored Approximate Curvature (K-FAC) was recently proposed as an approximation of the Fisher Information Matrix that can be used in natural gradient optimizers. We investigate here a scalable K-FAC design and its applicability in convolutional neural network (CNN) training at scale. We study optimization techniques such as layer-wise distribution strategies, inverse-free second-order gradient evaluation, and dynamic K-FAC update decoupling to reduce training time while preserving convergence. We use residual neural networks (ResNet) applied to the CIFAR-10 and ImageNet-1k datasets to evaluate the correctness and scalability of our K-FAC gradient preconditioner. With ResNet-50 on the ImageNet-1k dataset, our…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.