Relative Scaling Laws for LLMs

William Held; David Hall; Percy Liang; Diyi Yang

arXiv:2510.24626·cs.CL·January 16, 2026

Relative Scaling Laws for LLMs

William Held, David Hall, Percy Liang, Diyi Yang

PDF

4 Reviews

TL;DR

This paper introduces relative scaling laws for language models, revealing how performance disparities across different test distributions evolve with scale, and highlighting that scaling does not uniformly improve all aspects of model performance.

Contribution

It proposes a novel approach to measure performance gaps across distributions as models scale, supported by extensive experiments on 255 models trained under matched compute budgets.

Findings

01

Academic domains on MMLU converge toward parity.

02

Regional dialect performance varies with population size.

03

Risks related to capability and influence increase during pretraining.

Abstract

Scaling laws describe how language models improve with additional data, parameters, and compute. While widely used, they are typically measured on aggregate test sets. Aggregate evaluations yield clean trends but average over heterogeneous subpopulations, obscuring performance disparities. We introduce relative scaling laws, which track how performance gaps between test distributions evolve with scale rather than focusing solely on absolute error. Using 255 decoder-only Transformers trained under matched-compute (IsoFLOP) budgets from $1 0^{18}$ -- $1 0^{20}$ FLOPs on standard pretraining datasets, we find diverse trajectories: academic domains on MMLU converge toward parity; regional English dialects shift depending on population size; and clusters of AI risk behaviours split, with capability- and influence-related risks increasing during pretraining while adversarial risks do not. These…

Peer Reviews

Decision·ICLR 2026 Conference Desk Rejected Submission

Reviewer 01Rating 6Confidence 4

Strengths

s1: The proposed framework helps us study how different capabilities evolve as models develop a certain capabilities, rather than only examining scaling with respect to model size and tokens. For instance, Section 4.3 shows that as models improve at self-improvement tasks, capability-related and influence-related risks increase proportionally while adversarial risks (scheming, incorrigibility) do not emerge during pretraining. s2: The paper carefully studies scaling laws, e.g. they run extensi

Weaknesses

w1: Section 2's mathematical formulation lacks rigor in notation and assumptions. The relative scaling law $G(F) = \gamma F^{\Delta \beta}$ is presented as following "directly" from absolute scaling laws, but the derivation glosses over when this approximation holds. What range of scales is required for the power law assumption to be valid? Under what conditions does the ratio of two power laws remain a power law (this requires both to use the same scale variable F, which may not hold if data m

Reviewer 02Rating 8Confidence 2

Strengths

1. Novelty: Breaks through the limitation of Scaling Laws by quantifying the impact of scale on subdomains, providing a more detailed view of model scaling. 2. Experiment: This work has a solid experiment, including 255 models trained on three distinct dataset under fixed compute budgets, ensuring the quality of result. 3. Clarity and Impact: The paper is well-written with clear conclusions. It highlights practical implications for multiple subdomains, making it relevant to researchers.

Weaknesses

The conclusions are strongly tied to the specific models and datasets. The behavior of Relative Scaling Laws under other data distributions remains unexplored. This limits the direct usage of guiding next-generation model training pipelines.

Reviewer 03Rating 6Confidence 4

Strengths

1. Reframes scaling laws as relative dynamics between subpopulations. This is a minimal extension of existing theory that adds interpretive depth to scaling analyses. 2. Experimental setup is rigorous enough. 3. Clear visualizations, figures effectively show convergence/divergence dynamics, aiding conceptual understanding.

Weaknesses

1. No statistical or mathematical validation of the proposed scaling fit function. The relative law is presented as an empirical regression without a clear formal derivation or underlying theoretical justification. 2. No significance testing of the fitted scaling laws. 3. While "relative scaling" is novel, the empirical insights (e.g., domain convergence, dialect disparity) largely reaffirm known intuitions about representation bias and data imbalance. Not able to appreciate the practical signi

Reviewer 04Rating 10Confidence 3

Strengths

- The paper is well written and justifies the choices made well - The paper also touches on aspects such as prompt design for evaluation which are crucial for reliable scaling laws - The authors show several situations where they validate the relative scaling laws and find disparities in some and also cases where these disparities go away with more compute but also cases where they don't - The models released will be an excellent resource for future researchers

Weaknesses

I do not see any obvious flaws/weaknesses

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.