Scaling Laws for Gradient Descent and Sign Descent for Linear Bigram Models under Zipf's Law

Frederik Kunstner; Francis Bach

arXiv:2505.19227·cs.LG·May 27, 2025

Scaling Laws for Gradient Descent and Sign Descent for Linear Bigram Models under Zipf's Law

Frederik Kunstner, Francis Bach

PDF

Open Access

TL;DR

This paper analyzes how the heavy-tailed Zipf distribution of words affects the efficiency of gradient and sign descent in training linear bigram models, revealing that sign descent performs significantly better with large vocabularies.

Contribution

It derives new scaling laws for gradient and sign descent under Zipf's law, especially highlighting the challenges when the tail is heavy (α=1) and the advantages of sign descent.

Findings

01

Sign descent scales with the square root of dimension for Zipf data.

02

Gradient descent is most challenging at α=1, scaling almost linearly with dimension.

03

Sign descent outperforms gradient descent in heavy-tailed data scenarios.

Abstract

Recent works have highlighted optimization difficulties faced by gradient descent in training the first and last layers of transformer-based language models, which are overcome by optimizers such as Adam. These works suggest that the difficulty is linked to the heavy-tailed distribution of words in text data, where the frequency of the $k$ th most frequent word $π_{k}$ is proportional to $1/ k$ , following Zipf's law. To better understand the impact of the data distribution on training performance, we study a linear bigram model for next-token prediction when the tokens follow a power law $π_{k} \propto 1/ k^{α}$ parameterized by the exponent $α > 0$ . We derive optimization scaling laws for deterministic gradient descent and sign descent as a proxy for Adam as a function of the exponent $α$ . Existing theoretical investigations in scaling laws assume that the eigenvalues of the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsRandom Matrices and Applications · Markov Chains and Monte Carlo Methods

MethodsAdam