Scaling Laws for Gradient Descent and Sign Descent for Linear Bigram Models under Zipf's Law
Frederik Kunstner, Francis Bach

TL;DR
This paper analyzes how the heavy-tailed Zipf distribution of words affects the efficiency of gradient and sign descent in training linear bigram models, revealing that sign descent performs significantly better with large vocabularies.
Contribution
It derives new scaling laws for gradient and sign descent under Zipf's law, especially highlighting the challenges when the tail is heavy (α=1) and the advantages of sign descent.
Findings
Sign descent scales with the square root of dimension for Zipf data.
Gradient descent is most challenging at α=1, scaling almost linearly with dimension.
Sign descent outperforms gradient descent in heavy-tailed data scenarios.
Abstract
Recent works have highlighted optimization difficulties faced by gradient descent in training the first and last layers of transformer-based language models, which are overcome by optimizers such as Adam. These works suggest that the difficulty is linked to the heavy-tailed distribution of words in text data, where the frequency of the th most frequent word is proportional to , following Zipf's law. To better understand the impact of the data distribution on training performance, we study a linear bigram model for next-token prediction when the tokens follow a power law parameterized by the exponent . We derive optimization scaling laws for deterministic gradient descent and sign descent as a proxy for Adam as a function of the exponent . Existing theoretical investigations in scaling laws assume that the eigenvalues of the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsRandom Matrices and Applications · Markov Chains and Monte Carlo Methods
MethodsAdam
