From $O(mn)$ to $O(r^2)$: Two-Sided Low-Rank Communication for Adam in Distributed Training with Memory Efficiency
Sizhe Dang, Jiaqi Shao, Xiaodong Zheng, Guang Dai, Yan Song, Haishan Ye

TL;DR
This paper introduces TSR-Adam, a low-rank communication method for distributed training that significantly reduces communication overhead while maintaining performance, enabling more efficient large-scale model pretraining.
Contribution
TSR-Adam extends low-rank communication to Adam updates with a two-sided approach, reducing communication complexity from O(mn) to O(r^2) and incorporating randomized SVD refresh for further efficiency.
Findings
Reduces average communication by 13x during pretraining.
Achieves 25x reduction in communication during fine-tuning.
Maintains comparable model performance with significantly less communication.
Abstract
As foundation models continue to scale, pretraining increasingly relies on data-parallel distributed optimization, making bandwidth-limited gradient synchronization a key bottleneck. Orthogonally, projection-based low-rank optimizers were mainly designed for memory efficiency, but remain suboptimal for communication-limited training: one-sided synchronization still transmits an object for an matrix gradient and refresh steps can dominate peak communicated bytes. We propose TSR, which brings two-sided low-rank communication to Adam-family updates (TSR-Adam) by synchronizing a compact core , reducing the dominant per-step payload from to while keeping moment states in low-dimensional cores. To further reduce the peak communication from subspace refresh, TSR-Adam adopts a randomized SVD-based refresh that avoids…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsStochastic Gradient Optimization Techniques · Sparse and Compressive Sensing Techniques · Tensor decomposition and applications
