Train-before-Test Harmonizes Language Model Rankings
Guanhua Zhang, Ricardo Dominguez-Olmedo, Moritz Hardt

TL;DR
This paper proposes train-before-test as a method to evaluate language models, which leads to more consistent and valid model rankings across benchmarks by assessing their potential after fine-tuning.
Contribution
The paper introduces train-before-test as a novel approach for model evaluation and provides a comprehensive empirical analysis demonstrating its advantages over traditional methods.
Findings
Model potential rankings are consistent across benchmarks with train-before-test.
Train-before-test restores the link between perplexity and downstream performance.
Model potential is dominated by a single latent factor.
Abstract
Existing language model benchmarks provide contradictory model rankings, even for benchmarks that aim to capture similar skills. This dilemma of conflicting rankings hampers model selection, clouds model comparisons, and adds confusion to a growing ecosystem of competing models. In this paper, we take a different perspective on model comparison: instead of relying on out-of-the-box performance via direct evaluation, we compare model potential by providing each model with identical benchmark-specific fine-tuning before evaluation. We call this approach train-before-test. Our primary contribution is a comprehensive empirical evaluation of model potential across 24 benchmarks and 61 models. First, we demonstrate that model potential rankings obtained through train-before-test exhibit remarkable consistency across all benchmarks. Whereas traditional rankings demonstrate little external…
Peer Reviews
Decision·ICLR 2026 Oral
The problem addressed by the paper is important. The experiments conducted by the authors are very extensive, comprising a large set of LLMs and benchmarks. The paper is well written and easy to follow. Compared with the NeurIPS submission, the revised framing in terms of model potential is persuasive and clarifies the significance of the experiments and results. I also liked the added sections connecting the work to the scaling-laws literature.
- The method proposed by the authors only works in the narrow setting of LLM evaluation _where subsequent task-specific finetuning is guaranteed_. As a result, train-before-test provides little evidence about performance in typical deployment without such fine-tuning. The authors note this in the discussion, but because deployment without task-specific fine-tuning is far more common, this is a substantial limitation and should be stated upfront — in both the abstract and the introduction &
- The paper is generally well motivated, easy to follow, and well written. - Experimental setup is mostly sound (see weaknesses below), and the paper studies models across many LM families. - I appreciate the framing of model potential as a mean to pick model to best adapt to a task. paper would benefit from stating the goal of train-before-test even more explicitly in the abstract: the presented technique is NOT an intrinsic evaluation of models as finished products, but as starting point for
- One limitation of this approach is that it does not provide an estimate of the magnitude of improvement. This could have been achieved by either proposing a way to average score, or use rankings to determine if any two models statistically different. Given the focus on practitioners using this method to choose models for downstream applications, providing a single, easy to interpret number is crucial. - The paper lacks other comparison with other techniques that could be used to improve ranki
This is a good paper and merits publication at ICLR. While it is a follow-up to a previous ICLR paper, it adds net new contributions, notably showing how “train on test” fits into the existing language model benchmarking paradigm, which the previous work did not show explicitly. This work also shows interesting finding about how to unlock correlation between perplexity and downstream benchmark, which has long been a frustrating topic in language modeling research. Experiments are sensible, varie
I think the work is good, so not too many critiques. I would say first, the heat map figures for rank correlation are hard to read (Fig 3, 4, 5). Maybe use a different color scheme, consider annotating where we should pay attention, and make caption a bit more self contained so we can read the Figure + grasp the finding instead of having to cross reference it with the body text. Given the chosen models, I would’ve liked to have seen more findings that take advantage of: (1) comparing models o
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Explainable Artificial Intelligence (XAI) · Natural Language Processing Techniques
