Pretraining on the Test Set Is No Longer All You Need: A Debate-Driven Approach to QA Benchmarks

Linbo Cao; Jinman Zhao

arXiv:2507.17747·cs.CL·August 11, 2025

Pretraining on the Test Set Is No Longer All You Need: A Debate-Driven Approach to QA Benchmarks

Linbo Cao, Jinman Zhao

PDF

Open Access

TL;DR

This paper introduces a debate-driven evaluation framework transforming QA benchmarks into adversarial debates, improving robustness and reducing data contamination issues in assessing language models.

Contribution

It presents a systematic debate-based assessment pipeline and a public benchmark demonstrating enhanced evaluation robustness and scalability for language models.

Findings

01

Debate-based evaluation increases difficulty and penalizes shallow memorization.

02

Models fine-tuned on test questions perform worse in debates, indicating robustness.

03

Even weaker judges can reliably evaluate stronger debaters.

Abstract

As frontier language models increasingly saturate standard QA benchmarks, concerns about data contamination, memorization, and escalating dataset creation costs persist. We propose a debate-driven evaluation paradigm that transforms any existing QA dataset into structured adversarial debates--where one model is given the official answer to defend, and another constructs and defends an alternative answer--adjudicated by a judge model blind to the correct solution. By forcing multi-round argumentation, this approach substantially increases difficulty while penalizing shallow memorization, yet reuses QA items to reduce curation overhead. We make two main contributions: (1) an evaluation pipeline to systematically convert QA tasks into debate-based assessments, and (2) a public benchmark that demonstrates our paradigm's effectiveness on a subset of MMLU-Pro questions, complete with…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsEducational Assessment and Pedagogy · Evaluation and Performance Assessment · Machine Learning and Algorithms