TL;DR
This paper presents a scalable distributed K-FAC method for training convolutional neural networks efficiently at large scale, demonstrating faster convergence than SGD on ResNet models.
Contribution
It introduces a scalable K-FAC design with optimization techniques for CNN training at scale, improving convergence speed and efficiency.
Findings
Converges faster than SGD on ImageNet-1k with ResNet-50.
Achieves 75.9% MLPerf baseline in 18-25% less time.
Demonstrates scalability across GPU clusters.
Abstract
Training neural networks with many processors can reduce time-to-solution; however, it is challenging to maintain convergence and efficiency at large scales. The Kronecker-factored Approximate Curvature (K-FAC) was recently proposed as an approximation of the Fisher Information Matrix that can be used in natural gradient optimizers. We investigate here a scalable K-FAC design and its applicability in convolutional neural network (CNN) training at scale. We study optimization techniques such as layer-wise distribution strategies, inverse-free second-order gradient evaluation, and dynamic K-FAC update decoupling to reduce training time while preserving convergence. We use residual neural networks (ResNet) applied to the CIFAR-10 and ImageNet-1k datasets to evaluate the correctness and scalability of our K-FAC gradient preconditioner. With ResNet-50 on the ImageNet-1k dataset, our…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
