InfoLaw: Information Scaling Laws for Large Language Models with Quality-Weighted Mixture Data and Repetition
Fengze Liu, Weidong Zhou, Binbin Liu, Ping Guo, Zijun Wang, Bingni Zhang, Yifan Zhang, Yifeng Yu, Xiaohuan Zhou, Taifeng Wang

TL;DR
InfoLaw is a data-aware scaling framework for large language models that predicts performance based on data quality, mixture, and repetition, improving data recipe selection during training.
Contribution
It introduces a novel information-based model that accurately predicts LLM performance across different data mixtures and scales, addressing limitations of standard scaling laws.
Findings
Predicts model loss with 0.15% mean absolute error up to 7B parameters.
Accurately extrapolates performance across data recipes and overtraining levels.
Enables efficient data recipe selection under varying compute budgets.
Abstract
Upweighting high-quality data in LLM pretraining often improves performance, but in datalimited regimes, especially under overtraining, stronger upweighting increases repetition and can degrade performance. However, standard scaling laws do not reliably extrapolate across mixture recipes or under repetitions, making the selection for optimal data recipes at scaling underdetermined. To solve this, we introduce InfoLaw (Information Scaling Laws), a data-aware scaling framework that predicts loss from consumed tokens, model size, data mixture weights, and repetition. The key idea is to model pretraining as information accumulation, where quality controls information density and repetition induces scaledependent diminishing returns. We first collect the model performance after training on datasets that vary in scale, quality distribution, and repetition level. Then we build up the modeling…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
