The Implicit Bias of Steepest Descent with Mini-batch Stochastic Gradient
Jichu Li, Xuan Tang, Difan Zou

TL;DR
This paper analyzes how mini-batch stochastic steepest descent methods, including momentum and variance reduction, influence the implicit bias and convergence behavior in multi-class classification, revealing conditions for full-batch-like behavior and limitations of stochastic updates.
Contribution
It provides a unified, explicit analysis of the implicit bias and convergence rates of mini-batch stochastic steepest descent, including effects of batch size, momentum, and variance reduction.
Findings
Large batches are needed for convergence without momentum.
Momentum enables small-batch convergence but slows it down.
Variance reduction can recover full-batch implicit bias for any batch size.
Abstract
A variety of widely used optimization methods like SignSGD and Muon can be interpreted as instances of steepest descent under different norm-induced geometries. In this work, we study the implicit bias of mini-batch stochastic steepest descent in multi-class classification, characterizing how batch size, momentum, and variance reduction shape the limiting max-margin behavior and convergence rates under general entry-wise and Schatten- norms. We show that without momentum, convergence only occurs with large batches, yielding a batch-dependent margin gap but the full-batch convergence rate. In contrast, momentum enables small-batch convergence through a batch-momentum trade-off, though it slows convergence. This approach provides fully explicit, dimension-free rates that improve upon prior results. Moreover, we prove that variance reduction can recover the exact full-batch implicit…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsStochastic Gradient Optimization Techniques · Metaheuristic Optimization Algorithms Research · Privacy-Preserving Technologies in Data
