The Power of Power Law: Asymmetry Enables Compositional Reasoning
Zixuan Wang, Xingyu Dang, Jason D. Lee, Kaifeng Lyu

TL;DR
This paper demonstrates that training language models on power-law distributed data, which reflects real-world data, enhances compositional reasoning and skill acquisition more effectively than uniform data distribution, supported by theoretical and empirical evidence.
Contribution
It reveals that power-law data distributions facilitate learning of compositional skills with less data and provides a theoretical explanation for this advantage.
Findings
Power-law training outperforms uniform training on compositional tasks.
Learning under power-law distribution requires less data for skill acquisition.
Power-law sampling improves the loss landscape, aiding skill learning.
Abstract
Natural language data follows a power-law distribution, with most knowledge and skills appearing at very low frequency. While a common intuition suggests that reweighting or curating data towards a uniform distribution may help models better learn these long-tail skills, we find a counterintuitive result: across a wide range of compositional reasoning tasks, such as state tracking and multi-step arithmetic, training under power-law distributions consistently outperforms training under uniform distributions. To understand this advantage, we introduce a minimalist skill-composition task and show that learning under a power-law distribution provably requires significantly less training data. Our theoretical analysis reveals that power law sampling induces a beneficial asymmetry that improves the pathological loss landscape, which enables models to first acquire high-frequency skill…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
