GRAWA: Gradient-based Weighted Averaging for Distributed Training of Deep Learning Models
Tolga Dimlioglu, Anna Choromanska

TL;DR
This paper introduces GRAWA, a gradient-based weighted averaging algorithm for distributed deep learning that improves convergence speed and model quality by prioritizing flat regions in the optimization landscape.
Contribution
It proposes a novel weighted averaging method with theoretical convergence guarantees and demonstrates superior empirical performance over existing methods.
Findings
Faster convergence compared to baseline methods
Achieves better quality and flatter local optima
Requires less communication and fewer updates
Abstract
We study distributed training of deep learning models in time-constrained environments. We propose a new algorithm that periodically pulls workers towards the center variable computed as a weighted average of workers, where the weights are inversely proportional to the gradient norms of the workers such that recovering the flat regions in the optimization landscape is prioritized. We develop two asynchronous variants of the proposed algorithm that we call Model-level and Layer-level Gradient-based Weighted Averaging (resp. MGRAWA and LGRAWA), which differ in terms of the weighting scheme that is either done with respect to the entire model or is applied layer-wise. On the theoretical front, we prove the convergence guarantee for the proposed approach in both convex and non-convex settings. We then experimentally demonstrate that our algorithms outperform the competitor methods by…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Neural Network Applications · Medical Imaging and Analysis · Brain Tumor Detection and Classification
