MARS: Unleashing the Power of Variance Reduction for Training Large Models
Huizhuo Yuan, Yifeng Liu, Shuang Wu, Xun Zhou, Quanquan Gu

TL;DR
This paper introduces MARS, a unified framework that combines variance reduction with preconditioned gradient methods, significantly improving the efficiency of training large neural models like GPT-2.
Contribution
The paper proposes MARS, a novel optimization framework that integrates variance reduction with preconditioned gradient methods for large-scale neural network training.
Findings
MARS outperforms AdamW in training GPT-2 models.
Three instances of MARS leverage AdamW, Lion, and Shampoo.
Experimental results show large margin improvements over existing optimizers.
Abstract
Training deep neural networks--and more recently, large models demands efficient and scalable optimizers. Adaptive gradient algorithms like Adam, AdamW, and their variants have been central to this task. Despite the development of numerous variance reduction algorithms in the past decade aimed at accelerating stochastic optimization in both convex and nonconvex settings, variance reduction has not found widespread success in training deep neural networks or large language models. Consequently, it has remained a less favored approach in modern AI. In this paper, to unleash the power of variance reduction for efficient training of large models, we propose a unified optimization framework, MARS (Make vAriance Reduction Shine), which reconciles preconditioned gradient methods with variance reduction via a scaled stochastic recursive momentum technique. Within our framework, we introduce…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsMachine Learning and Data Classification · Reservoir Engineering and Simulation Methods
MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Attention Is All You Need · Layer Normalization · Dropout · Cosine Annealing · Adam · Residual Connection · Weight Decay · Byte Pair Encoding · Linear Warmup With Cosine Annealing
