PLUMAGE: Probabilistic Low rank Unbiased Min Variance Gradient Estimator for Efficient Large Model Training

Matan Haroush; Daniel Soudry

arXiv:2505.18313·cs.LG·May 27, 2025

PLUMAGE: Probabilistic Low rank Unbiased Min Variance Gradient Estimator for Efficient Large Model Training

Matan Haroush, Daniel Soudry

PDF

TL;DR

PLUMAGE is a novel probabilistic low-rank gradient estimator that reduces variance and bias, improving large model training stability and performance without extra hyperparameters.

Contribution

It introduces a unbiased, low-variance gradient estimator that addresses optimizer state misalignment, enhancing large language model training efficiency.

Findings

01

Reduces the gap over full-rank optimization by 33% on average.

02

Improves average training loss on GLUE benchmark by 28%.

03

Operates within similar computational and memory footprint as existing methods.

Abstract

Accelerator memory and networking constraints have emerged as dominant bottlenecks when training large language models LLMs with billions of parameters. Existing low rank gradient estimators such as GaLoRE and FLORA compress gradients and optimizer tensors by projecting weight gradients onto a rank r subspace, enabling LLM training on consumer hardware. Yet, these methods are either biased or subject to high estimator variance. Moreover, the optimizer state based on the first and second moments estimates expressed in the previous subspace becomes misaligned whenever the projection is updated, leading to instabilities during training. We propose PLUMAGE: Probabilistic Low rank Unbiased Minimum vAriance Gradient Estimator. PLUMAGE is a drop in replacement for existing low rank gradient estimators. It does not introduce new hyperparameters beyond the chosen rank r and the update interval.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.