Beyond Gradient Averaging in Parallel Optimization: Improved Robustness   through Gradient Agreement Filtering

Francois Chaubard; Duncan Eddy; Mykel J. Kochenderfer

arXiv:2412.18052·cs.LG·December 31, 2024

Beyond Gradient Averaging in Parallel Optimization: Improved Robustness through Gradient Agreement Filtering

Francois Chaubard, Duncan Eddy, Mykel J. Kochenderfer

PDF

Open Access 1 Repo

TL;DR

Gradient Agreement Filtering (GAF) enhances distributed deep learning by filtering conflicting gradients based on cosine similarity, leading to better generalization, higher accuracy, and reduced computation, especially with smaller microbatches.

Contribution

This paper introduces Gradient Agreement Filtering, a novel method that filters conflicting microgradients to improve robustness and efficiency in distributed training.

Findings

01

Outperforms traditional gradient averaging in accuracy.

02

Enables smaller microbatch sizes without training instability.

03

Reduces training computation by nearly tenfold.

Abstract

We introduce Gradient Agreement Filtering (GAF) to improve on gradient averaging in distributed deep learning optimization. Traditional distributed data-parallel stochastic gradient descent involves averaging gradients of microbatches to calculate a macrobatch gradient that is then used to update model parameters. We find that gradients across microbatches are often orthogonal or negatively correlated, especially in late stages of training, which leads to memorization of the training set, reducing generalization. In this paper, we introduce a simple, computationally effective way to reduce gradient variance by computing the cosine distance between micro-gradients during training and filtering out conflicting updates prior to averaging. We improve validation accuracy with significantly smaller microbatch sizes. We also show this reduces memorizing noisy labels. We demonstrate the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Fchaubard/gradient_agreement_filtering
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsStochastic Gradient Optimization Techniques · Optimization and Search Problems · Advanced Bandit Algorithms Research