gzip Predicts Data-dependent Scaling Laws

Rohan Pandey

arXiv:2405.16684·cs.CL·May 28, 2024·2 cites

gzip Predicts Data-dependent Scaling Laws

Rohan Pandey

PDF

Open Access 1 Repo

TL;DR

This paper investigates how data complexity influences neural language model scaling laws, demonstrating that data compressibility affects performance predictions and proposing a new data-dependent scaling law incorporating gzip-compressibility.

Contribution

It introduces a novel data-dependent scaling law for language models that accounts for data complexity via gzip-compressibility, challenging the assumption of data-agnostic scaling laws.

Findings

01

Scaling laws are sensitive to data complexity.

02

Gzip effectively predicts data complexity impacts.

03

The new law adjusts optimal compute allocation based on data compressibility.

Abstract

Past work has established scaling laws that predict the performance of a neural language model (LM) as a function of its parameter count and the number of tokens it's trained on, enabling optimal allocation of a fixed compute budget. Are these scaling laws agnostic to training data as some prior work suggests? We generate training datasets of varying complexities by modulating the syntactic properties of a PCFG, finding that 1) scaling laws are sensitive to differences in data complexity and that 2) gzip, a compression algorithm, is an effective predictor of how data complexity impacts scaling properties. We propose a new data-dependent scaling law for LM's that accounts for the training data's gzip-compressibility; its compute-optimal frontier increases in dataset size preference (over parameter count preference) as training data becomes harder to compress.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

KhoomeiK/complexity-scaling
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsProtein Structure and Dynamics