Revisiting the Scaling Properties of Downstream Metrics in Large Language Model Training

Jakub Krajewski; Amitis Shidani; Dan Busbridge; Sam Wiseman; Jason Ramapuram

arXiv:2512.08894·cs.LG·December 10, 2025

Revisiting the Scaling Properties of Downstream Metrics in Large Language Model Training

Jakub Krajewski, Amitis Shidani, Dan Busbridge, Sam Wiseman, Jason Ramapuram

PDF

Open Access 3 Reviews

TL;DR

This paper demonstrates that downstream task performance in large language models can be accurately modeled and predicted using simple power laws based on training data, improving extrapolation over previous methods.

Contribution

It introduces a direct modeling framework for downstream performance scaling laws, outperforming traditional two-stage procedures and providing functional forms for broader predictions.

Findings

01

Power law accurately models log accuracy scaling

02

Direct approach outperforms two-stage procedures

03

Validated on models up to 17B parameters and 350B tokens

Abstract

While scaling laws for Large Language Models (LLMs) traditionally focus on proxy metrics like pretraining loss, predicting downstream task performance has been considered unreliable. This paper challenges that view by proposing a direct framework to model the scaling of benchmark performance from the training budget. We find that for a fixed token-to-parameter ratio, a simple power law can accurately describe the scaling behavior of log accuracy on multiple popular downstream tasks. Our results show that the direct approach extrapolates better than the previously proposed two-stage procedure, which is prone to compounding errors. Furthermore, we introduce functional forms that predict accuracy across token-to-parameter ratios and account for inference compute under repeated sampling. We validate our findings on models with up to 17B parameters trained on up to 350B tokens across two…

Peer Reviews

Decision·ICLR 2026 Poster

Reviewer 01Rating 8Confidence 3

Strengths

The paper is clearly written, and I appreciate the measured and non-sensational tone throughout. The proposed accuracy scaling law fits the evaluated metrics well, showing strong consistency across scales. I also like that the paper carefully compares the one-stage approach to the traditional two-stage setup, showing lower MAE, MRE, and higher R² for the proposed method.

Weaknesses

Please see Questions section.

Reviewer 02Rating 6Confidence 3

Strengths

- wide coverage of benchmarks and types of tasks, various compute budgets (both scale and TPR) - simplicity of approach, taking into account the nature of these benchmarks (e.g. S-shaped) - results demonstrate some extrapolation to larger compute budgets - compares directly with prior works (two stage approach)

Weaknesses

- lack of motivation for practical usage of this compared to standard scaling laws - scaling laws are often used to justify design decisions (e.g. architectural or dataset choices) - there is a lack of these alternatives and showing that the scaling laws preserve the ordering of the "better" design decision

Reviewer 03Rating 4Confidence 5

Strengths

- One-stage approach to to directly predict downstream performance of LLMs from N and D. - Two different approaches to downstream task modeling - BSNL and a simple power-law relationship. - Authors validated their predictions using error on held-out points. - Works for different downstream metrics: pass@k, accuracy etc. - Baselines like two-stage approaches that are commonly used (FLOPs-to-NLL and then NLL-to-accuracy). - Direct downstream prediction is more reliable and accurate than two-stage

Weaknesses

- No code or data currently available which makes the reproduction and independent verification of the authors' claims impossible. The authors though promise to release the model losses and downstream evaluation results. - Better performance of one-stage approach compared to two-stage one on individual benchmarks can be due to the fact that individual task performance may not be correlated to the specific downstream task performance (which also depends on the the validation set) and not only com

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Computational and Text Analysis Methods