MARS-M: When Variance Reduction Meets Matrices

Yifeng Liu; Angela Yuan; Quanquan Gu

arXiv:2510.21800·cs.LG·February 2, 2026

MARS-M: When Variance Reduction Meets Matrices

Yifeng Liu, Angela Yuan, Quanquan Gu

PDF

TL;DR

MARS-M is a novel optimizer that combines variance reduction with matrix-based preconditioning, leading to faster convergence and better performance in training large neural networks, including language models and vision tasks.

Contribution

This paper introduces MARS-M, integrating MARS variance reduction with Muon, and proves its improved convergence rate under standard conditions.

Findings

01

MARS-M converges at a rate of ( T^{-1/3})

02

MARS-M achieves lower losses in language modeling and vision tasks

03

Empirical results show improved downstream benchmark performance

Abstract

Matrix-based preconditioned optimizers, such as Muon, have recently been shown to be more efficient than scalar-based optimizers for training large-scale neural networks, including large language models (LLMs). Recent benchmark studies of LLM pretraining optimizers have demonstrated that variance-reduction techniques such as MARS can substantially speed up training compared with standard optimizers that do not employ variance reduction. In this paper, we introduce MARS-M, a new optimizer that integrates MARS-style variance reduction with Muon. Under standard regularity conditions, we prove that MARS-M converges to a first-order stationary point at a rate of $\tilde{O} (T^{- 1/3})$ , improving upon the $\tilde{O} (T^{- 1/4})$ rate attained by Muon. Empirical results on language modeling and computer vision tasks demonstrate that MARS-M consistently yields lower losses and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.