TL;DR
This paper introduces relative scaling laws for language models, revealing how performance disparities across different test distributions evolve with scale, and highlighting that scaling does not uniformly improve all aspects of model performance.
Contribution
It proposes a novel approach to measure performance gaps across distributions as models scale, supported by extensive experiments on 255 models trained under matched compute budgets.
Findings
Academic domains on MMLU converge toward parity.
Regional dialect performance varies with population size.
Risks related to capability and influence increase during pretraining.
Abstract
Scaling laws describe how language models improve with additional data, parameters, and compute. While widely used, they are typically measured on aggregate test sets. Aggregate evaluations yield clean trends but average over heterogeneous subpopulations, obscuring performance disparities. We introduce relative scaling laws, which track how performance gaps between test distributions evolve with scale rather than focusing solely on absolute error. Using 255 decoder-only Transformers trained under matched-compute (IsoFLOP) budgets from -- FLOPs on standard pretraining datasets, we find diverse trajectories: academic domains on MMLU converge toward parity; regional English dialects shift depending on population size; and clusters of AI risk behaviours split, with capability- and influence-related risks increasing during pretraining while adversarial risks do not. These…
Peer Reviews
Decision·ICLR 2026 Conference Desk Rejected Submission
s1: The proposed framework helps us study how different capabilities evolve as models develop a certain capabilities, rather than only examining scaling with respect to model size and tokens. For instance, Section 4.3 shows that as models improve at self-improvement tasks, capability-related and influence-related risks increase proportionally while adversarial risks (scheming, incorrigibility) do not emerge during pretraining. s2: The paper carefully studies scaling laws, e.g. they run extensi
w1: Section 2's mathematical formulation lacks rigor in notation and assumptions. The relative scaling law $G(F) = \gamma F^{\Delta \beta}$ is presented as following "directly" from absolute scaling laws, but the derivation glosses over when this approximation holds. What range of scales is required for the power law assumption to be valid? Under what conditions does the ratio of two power laws remain a power law (this requires both to use the same scale variable F, which may not hold if data m
1. Novelty: Breaks through the limitation of Scaling Laws by quantifying the impact of scale on subdomains, providing a more detailed view of model scaling. 2. Experiment: This work has a solid experiment, including 255 models trained on three distinct dataset under fixed compute budgets, ensuring the quality of result. 3. Clarity and Impact: The paper is well-written with clear conclusions. It highlights practical implications for multiple subdomains, making it relevant to researchers.
The conclusions are strongly tied to the specific models and datasets. The behavior of Relative Scaling Laws under other data distributions remains unexplored. This limits the direct usage of guiding next-generation model training pipelines.
1. Reframes scaling laws as relative dynamics between subpopulations. This is a minimal extension of existing theory that adds interpretive depth to scaling analyses. 2. Experimental setup is rigorous enough. 3. Clear visualizations, figures effectively show convergence/divergence dynamics, aiding conceptual understanding.
1. No statistical or mathematical validation of the proposed scaling fit function. The relative law is presented as an empirical regression without a clear formal derivation or underlying theoretical justification. 2. No significance testing of the fitted scaling laws. 3. While "relative scaling" is novel, the empirical insights (e.g., domain convergence, dialect disparity) largely reaffirm known intuitions about representation bias and data imbalance. Not able to appreciate the practical signi
- The paper is well written and justifies the choices made well - The paper also touches on aspects such as prompt design for evaluation which are crucial for reliable scaling laws - The authors show several situations where they validate the relative scaling laws and find disparities in some and also cases where these disparities go away with more compute but also cases where they don't - The models released will be an excellent resource for future researchers
I do not see any obvious flaws/weaknesses
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
