Training on the Test Task Confounds Evaluation and Emergence

Ricardo Dominguez-Olmedo; Florian E. Dorner; Moritz Hardt

arXiv:2407.07890·cs.CL·April 22, 2025·1 cites

Training on the Test Task Confounds Evaluation and Emergence

Ricardo Dominguez-Olmedo, Florian E. Dorner, Moritz Hardt

PDF

Open Access 1 Repo 1 Video 3 Reviews

TL;DR

This paper highlights how training on evaluation tasks can bias model assessments and the perceived emergence of capabilities, proposing a method to mitigate this effect for fairer benchmarking.

Contribution

It introduces a method to adjust for training on test tasks in evaluations and demonstrates its impact on model performance and emergent behavior claims.

Findings

01

Training on test tasks confounds evaluation results.

02

Adjusting for test task training alters perceived model capabilities.

03

Emergent behaviors diminish when models are uniformly trained on test data.

Abstract

We study a fundamental problem in the evaluation of large language models that we call training on the test task. Unlike wrongful practices like training on the test data, leakage, or data contamination, training on the test task is not a malpractice. Rather, the term describes a growing set of practices that utilize knowledge about evaluation tasks at training time. We demonstrate that training on the test task confounds both relative model evaluations and claims about emergent capabilities. We argue that the seeming superiority of one model family over another may be explained by a different degree of training on the test task. To this end, we propose an effective method to adjust for the effect of training on the test task on benchmark evaluations. Put simply, to fine-tune each model under comparison on the same task-relevant data prior to evaluation. We then show that instances of…

Peer Reviews

Decision·ICLR 2025 Oral

Reviewer 01Rating 8Confidence 3

Strengths

* They test their method on on a large set of base models with different sizes. This also allows them to compare scaling on benchmarks. * The proposed method is promising for better evaluations/comparisons and light on compute, requiring only fine-tuning.

Weaknesses

* They only do so for pre-trained base models. In practical settings, we often look at the benchmarks of fine-tuned chat models, or models further fine-tuned off base-models (e.g. LLama derivatives). It is not clear whether their findings would apply to already-finetuned models. * They classify 'old' models as before November 2023, which seems somewhat arbitrary. I understand that choosing Nov 2023 reveals some sort of improvement given the same compute, but this could be due to reasons other th

Reviewer 02Rating 8Confidence 4

Strengths

- This paper explores an ever-increasingly important area of LLMs: the "science of evaluations." - It explores the implications of task-specific training for language models, providing evidence that training with task-specific data is beneficial. They find that many models from November 2023 gain advantages from this type of training. Additionally, they show that task-specific training is more likely responsible for recent boosts in model performance when FLOPs are held constant. - The paper's i

Weaknesses

- How was November 2023 chosen as the cutoff? I wonder what would have happened if you had chosen November 2022. How would the results be affected? - It is unclear how much hyperparameter tuning might affect results when training on task-specific data. How much could hparam tuning affect the results? How was the initial sweep in Appendix A.2 chosen? Although a full sweep might be expensive, additional clarity on how these were chosen will alleviate my concern. - Lack of consideration of instruc

Reviewer 03Rating 8Confidence 4

Strengths

- The paper is very clearly presented and argues persuasively. The figures are easily understood at a glance. - The conclusion argued for is surprising and novel, and has broad implications for evaluation methodology of LLMs and emergent behaviors in LLMs.

Weaknesses

- The p-values for signficance in figure 1 overestimate significance because of correlation due to models being from only a few model families. - The paper never actually directly checks the hypothesis of more in-distribution data being the cause of higher MMLU performance out of the box, despite arguing for it via implication from the finetuning result.

Code & Models

Repositories

socialfoundations/training-on-the-test-task
noneOfficial

Videos

Training on the Test Task Confounds Evaluation and Emergence· slideslive

Taxonomy

TopicsReal-time simulation and control systems

MethodsSparse Evolutionary Training