InfoLaw: Information Scaling Laws for Large Language Models with Quality-Weighted Mixture Data and Repetition

Fengze Liu; Weidong Zhou; Binbin Liu; Ping Guo; Zijun Wang; Bingni Zhang; Yifan Zhang; Yifeng Yu; Xiaohuan Zhou; Taifeng Wang

arXiv:2605.02364·cs.CL·May 5, 2026

InfoLaw: Information Scaling Laws for Large Language Models with Quality-Weighted Mixture Data and Repetition

Fengze Liu, Weidong Zhou, Binbin Liu, Ping Guo, Zijun Wang, Bingni Zhang, Yifan Zhang, Yifeng Yu, Xiaohuan Zhou, Taifeng Wang

PDF

TL;DR

InfoLaw is a data-aware scaling framework for large language models that predicts performance based on data quality, mixture, and repetition, improving data recipe selection during training.

Contribution

It introduces a novel information-based model that accurately predicts LLM performance across different data mixtures and scales, addressing limitations of standard scaling laws.

Findings

01

Predicts model loss with 0.15% mean absolute error up to 7B parameters.

02

Accurately extrapolates performance across data recipes and overtraining levels.

03

Enables efficient data recipe selection under varying compute budgets.

Abstract

Upweighting high-quality data in LLM pretraining often improves performance, but in datalimited regimes, especially under overtraining, stronger upweighting increases repetition and can degrade performance. However, standard scaling laws do not reliably extrapolate across mixture recipes or under repetitions, making the selection for optimal data recipes at scaling underdetermined. To solve this, we introduce InfoLaw (Information Scaling Laws), a data-aware scaling framework that predicts loss from consumed tokens, model size, data mixture weights, and repetition. The key idea is to model pretraining as information accumulation, where quality controls information density and repetition induces scaledependent diminishing returns. We first collect the model performance after training on datasets that vary in scale, quality distribution, and repetition level. Then we build up the modeling…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.