Provable Benefit of Sign Descent: A Minimal Model Under Heavy-Tailed Class Imbalance

Robin Yadav; Shuo Xie; Tianhao Wang; Zhiyuan Li

arXiv:2512.00763·cs.LG·December 2, 2025

Provable Benefit of Sign Descent: A Minimal Model Under Heavy-Tailed Class Imbalance

Robin Yadav, Shuo Xie, Tianhao Wang, Zhiyuan Li

PDF

Open Access

TL;DR

This paper demonstrates that sign descent, an $ ext{L}_ ext{infinity}$-norm optimization method, converges faster than traditional gradient descent in language modeling tasks with heavy-tailed class imbalance, supported by theoretical analysis.

Contribution

The paper provides a theoretical analysis showing the provable benefit of sign descent over gradient descent in heavy-tailed class imbalance scenarios.

Findings

01

Sign descent converges faster than GD under heavy-tailed class imbalance.

02

The analysis is based on a minimal model of next-token prediction.

03

Heavy-tailed class imbalance impacts the effectiveness of optimization algorithms.

Abstract

Adaptive optimization methods (such as Adam) play a major role in LLM pretraining, significantly outperforming Gradient Descent (GD). Recent studies have proposed new smoothness assumptions on the loss function to explain the advantages of adaptive algorithms with structured preconditioners, e.g., coordinate-wise or layer-wise, and steepest descent methods w.r.t. non-euclidean norms, e.g., $ℓ_{\infty}$ norm or spectral norm, over GD. However, it remains unclear how these smoothness assumptions manifest in language modelling tasks. In this work, we aim to analyze the benefit of $ℓ_{\infty}$ -norm descent (a.k.a. sign descent) directly from properties of the data distribution, namely, heavy-tailed class imbalance. We propose a minimal yet representative setting of next-token prediction, where we can provably show faster convergence of coordinate-wise algorithms such as Sign descent…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsStochastic Gradient Optimization Techniques · Natural Language Processing Techniques · Speech Recognition and Synthesis