gzip Predicts Data-dependent Scaling Laws
Rohan Pandey

TL;DR
This paper investigates how data complexity influences neural language model scaling laws, demonstrating that data compressibility affects performance predictions and proposing a new data-dependent scaling law incorporating gzip-compressibility.
Contribution
It introduces a novel data-dependent scaling law for language models that accounts for data complexity via gzip-compressibility, challenging the assumption of data-agnostic scaling laws.
Findings
Scaling laws are sensitive to data complexity.
Gzip effectively predicts data complexity impacts.
The new law adjusts optimal compute allocation based on data compressibility.
Abstract
Past work has established scaling laws that predict the performance of a neural language model (LM) as a function of its parameter count and the number of tokens it's trained on, enabling optimal allocation of a fixed compute budget. Are these scaling laws agnostic to training data as some prior work suggests? We generate training datasets of varying complexities by modulating the syntactic properties of a PCFG, finding that 1) scaling laws are sensitive to differences in data complexity and that 2) gzip, a compression algorithm, is an effective predictor of how data complexity impacts scaling properties. We propose a new data-dependent scaling law for LM's that accounts for the training data's gzip-compressibility; its compute-optimal frontier increases in dataset size preference (over parameter count preference) as training data becomes harder to compress.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsProtein Structure and Dynamics
