Relative-Based Scaling Law for Neural Language Models

Baoqing Yue; Jinyuan Zhou; Zixi Wei; Jingtao Zhan; Qingyao Ai; Yiqun Liu

arXiv:2510.20387·cs.LG·October 24, 2025

Relative-Based Scaling Law for Neural Language Models

Baoqing Yue, Jinyuan Zhou, Zixi Wei, Jingtao Zhan, Qingyao Ai, Yiqun Liu

PDF

Open Access 4 Reviews

TL;DR

This paper introduces a new scaling law based on relative ordering, using a novel metric called RBP, to better predict language model performance across scales, complementing traditional cross-entropy metrics.

Contribution

It proposes the Relative-Based Scaling Law and RBP metric, providing a more comprehensive understanding of model performance and emergence phenomena in language models.

Findings

01

RBP effectively measures relative token ranking.

02

The scaling law accurately predicts RBP improvements with model size.

03

The law aids in understanding emergence and fundamental theories.

Abstract

Scaling laws aim to accurately predict model performance across different scales. Existing scaling-law studies almost exclusively rely on cross-entropy as the evaluation metric. However, cross-entropy provides only a partial view of performance: it measures the absolute probability assigned to the correct token, but ignores the relative ordering between correct and incorrect tokens. Yet, relative ordering is crucial for language models, such as in greedy-sampling scenario. To address this limitation, we investigate scaling from the perspective of relative ordering. We first propose the Relative-Based Probability (RBP) metric, which quantifies the probability that the correct token is ranked among the top predictions. Building on this metric, we establish the Relative-Based Scaling Law, which characterizes how RBP improves with increasing model size. Through extensive experiments on four…

Peer Reviews

Decision·ICLR 2026 Conference Withdrawn Submission

Reviewer 01Rating 4Confidence 3

Strengths

Its strength lies in the novelty of the proposed idea and its substantiation by a large body of experimental results.

Weaknesses

Objectively, the paper’s contribution is limited. Although it provides a new scaling perspective, the conclusions derived do not appear to be significantly more informative or useful compared to those from the traditional log scaling law.

Reviewer 02Rating 4Confidence 3

Strengths

- The paper is well-written, and takeaways are easy to understand. - The proposed RBP metric addresses a key limitation of cross-entropy as a metric and better aligns with real-world inference practices, which often involve greedy decoding or top k sampling.

Weaknesses

- Previous works have put forth the view that whether model performance exhibits "emergence" depends on the metric being measured (Schaeffer et al., 2023); it is thus unclear what new information has been learned by proposing a specific metric that can explain away emergence. - The paper would be strengthened by showing that the RBP based scaling law leads to new insights about how model performance scales (for example, are there different compute-optimal data-to-parameter ratios?) that would no

Reviewer 03Rating 2Confidence 4

Strengths

The paper is written well: it explains the motivation clearly, has a clear focus, and is easy to read and follow.

Weaknesses

1. The main weakness is that, while the relative-based metric can measure different aspects of model performance compared to cross entropy loss, as clearly shown in Figure 1, the experiments and findings did not uncover novel knowledge of scaling behavior distinct from those know from CE loss. As such, many readers like me would interpret the paper's significance as a confirmation of the scaling law, previously known w.r.t. to the CE loss, now to the rank-based metric as well. But again the conf

Reviewer 04Rating 2Confidence 5

Strengths

- The new scaling law approach that quantifies relative ranking of tokens which is important for top-k sampling. - Explanation of "emergence" using RBP. - The authors released the code that can be used for reproduction of their findings.

Weaknesses

- No held-out points validation, no confidence intervals. - The scaling laws should be fitted not on the all data but on the Pareto frontier [1, 2] (e.g. dividing compute axis in bins taking the points with minimal loss for each bin). The points that were used for the fits in the paper are not necessarily on the Pareto frontier so at best they are not scaling law but scaling trends. - Number of points for the fit is < 10 and for all of the fits except Pythia they are <5. The sample size is too s

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Machine Learning in Materials Science · Generative Adversarial Networks and Image Synthesis