Training Dynamics Underlying Language Model Scaling Laws: Loss Deceleration and Zero-Sum Learning
Andrei Mircea, Supriyo Chakraborty, Nima Chitsazan, Milind Naphade, Sambit Sahu, Irina Rish, Ekaterina Lobacheva

TL;DR
This paper investigates the training dynamics of language models, revealing loss deceleration caused by zero-sum learning, and shows how scaling mitigates this effect, offering insights to improve model training.
Contribution
It introduces the concept of zero-sum learning as a cause of loss deceleration and demonstrates how scaling alleviates this issue in language model training.
Findings
Loss deceleration occurs early in training with piecewise linear loss curves.
Scaling reduces the loss at which deceleration occurs and improves post-deceleration loss improvement rates.
Zero-sum learning causes destructive interference among per-example gradients, hindering training progress.
Abstract
This work aims to understand how scaling improves language models, specifically in terms of training dynamics. We find that language models undergo loss deceleration early in training; an abrupt slowdown in the rate of loss improvement, resulting in piecewise linear behaviour of the loss curve in log-log space. Scaling up the model mitigates this transition by (1) decreasing the loss at which deceleration occurs, and (2) improving the log-log rate of loss improvement after deceleration. We attribute loss deceleration to a type of degenerate training dynamics we term zero-sum learning (ZSL). In ZSL, per-example gradients become systematically opposed, leading to destructive interference in per-example changes in loss. As a result, improving loss on one subset of examples degrades it on another, bottlenecking overall progress. Loss deceleration and ZSL provide new insights into the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques
