MARS: Unleashing the Power of Variance Reduction for Training Large Models

Huizhuo Yuan; Yifeng Liu; Shuang Wu; Xun Zhou; Quanquan Gu

arXiv:2411.10438·cs.LG·September 5, 2025

MARS: Unleashing the Power of Variance Reduction for Training Large Models

Huizhuo Yuan, Yifeng Liu, Shuang Wu, Xun Zhou, Quanquan Gu

PDF

Open Access 2 Repos 1 Models 1 Video

TL;DR

This paper introduces MARS, a unified framework that combines variance reduction with preconditioned gradient methods, significantly improving the efficiency of training large neural models like GPT-2.

Contribution

The paper proposes MARS, a novel optimization framework that integrates variance reduction with preconditioned gradient methods for large-scale neural network training.

Findings

01

MARS outperforms AdamW in training GPT-2 models.

02

Three instances of MARS leverage AdamW, Lion, and Shampoo.

03

Experimental results show large margin improvements over existing optimizers.

Abstract

Training deep neural networks--and more recently, large models demands efficient and scalable optimizers. Adaptive gradient algorithms like Adam, AdamW, and their variants have been central to this task. Despite the development of numerous variance reduction algorithms in the past decade aimed at accelerating stochastic optimization in both convex and nonconvex settings, variance reduction has not found widespread success in training deep neural networks or large language models. Consequently, it has remained a less favored approach in modern AI. In this paper, to unleash the power of variance reduction for efficient training of large models, we propose a unified optimization framework, MARS (Make vAriance Reduction Shine), which reconciles preconditioned gradient methods with variance reduction via a scaled stochastic recursive momentum technique. Within our framework, we introduce…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Models

🤗
rwightman/timm-optim-caution
model· ♡ 9
♡ 9

Videos

MARS: Unleashing the Power of Variance Reduction for Training Large Models· slideslive

Taxonomy

TopicsMachine Learning and Data Classification · Reservoir Engineering and Simulation Methods

MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Attention Is All You Need · Layer Normalization · Dropout · Cosine Annealing · Adam · Residual Connection · Weight Decay · Byte Pair Encoding · Linear Warmup With Cosine Annealing