TL;DR
This paper introduces proxy metrics based on token-level statistics to reliably forecast downstream performance of language models, outperforming traditional signals across various model development tasks.
Contribution
It proposes a novel approach using token-level proxy metrics derived from expert solutions to improve performance forecasting during language model development.
Findings
Proxy metrics outperform loss- and compute-based baselines in model ranking.
Efficiently rank candidate corpora for pretraining with 10,000x less compute.
Forecast downstream accuracy with half the error of existing methods.
Abstract
Progress in language model development is often driven by comparative decisions: which architecture to adopt, which pretraining corpus to use, or which training recipe to apply. Making these decisions well requires reliable performance forecasts, yet the two commonly used signals are fundamentally limited. Cross-entropy loss is poorly aligned with downstream capabilities, and direct downstream evaluation is expensive, sparse, and often uninformative at early training stages. Instead, we propose to construct proxy metrics by aggregating token-level statistics, such as entropy, top-k accuracy, and expert token rank, from a candidate model's next token distribution over expert-written solutions. Across three settings, our proxies consistently outperform loss- and compute-based baselines: 1) For cross-family model selection, they rank a heterogeneous population of reasoning models with mean…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
