Peering Beyond the Gradient Veil with Distributed Auto Differentiation
Bradley T. Baker, Aashis Khanal, Vince D. Calhoun, Barak Pearlmutter,, Sergey M. Plis

TL;DR
This paper introduces distributed auto-differentiation (dAD), a novel communication-efficient method for training distributed deep neural networks by exploiting the outer-product structure of gradients, reducing communication overhead.
Contribution
The paper presents dAD, a new distributed training algorithm that leverages gradient structure for improved communication efficiency over traditional gradient-sharing methods.
Findings
dAD trains more efficiently than state-of-the-art methods on transformers.
dAD reduces communication overhead in distributed deep learning.
dAD is effective on large-scale text and imaging datasets.
Abstract
Although distributed machine learning has opened up many new and exciting research frontiers, fragmentation of models and data across different machines, nodes, and sites still results in considerable communication overhead, impeding reliable training in real-world contexts. The focus on gradients as the primary shared statistic during training has spawned a number of intuitive algorithms for distributed deep learning; however, gradient-centric training of large deep neural networks (DNNs) tends to be communication-heavy, often requiring additional adaptations such as sparsity constraints, compression, quantization, and more, to curtail bandwidth. We introduce an innovative, communication-friendly approach for training distributed DNNs, which capitalizes on the outer-product structure of the gradient as revealed by the mechanics of auto-differentiation. The exposed structure of the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAI in cancer detection · Domain Adaptation and Few-Shot Learning · Generative Adversarial Networks and Image Synthesis
