Benchmark Designers Should "Train on the Test Set" to Expose Exploitable Non-Visual Shortcuts

Ellis Brown; Jihan Yang; Shusheng Yang; Rob Fergus; Saining Xie

arXiv:2511.04655·cs.CV·November 7, 2025

Benchmark Designers Should "Train on the Test Set" to Expose Exploitable Non-Visual Shortcuts

Ellis Brown, Jihan Yang, Shusheng Yang, Rob Fergus, Saining Xie

PDF

Open Access 3 Datasets

TL;DR

This paper highlights the importance of diagnosing and mitigating non-visual biases in multimodal benchmarks by 'training on the test set' to reveal exploitable shortcuts, leading to more robust evaluation standards.

Contribution

It introduces a diagnostic framework combining stress-testing and bias pruning to identify and reduce non-visual shortcuts in multimodal benchmarks.

Findings

01

Models exploit non-visual biases in benchmarks.

02

Debiasing reduces shortcut performance.

03

Wider vision-blind performance gap achieved.

Abstract

Robust benchmarks are crucial for evaluating Multimodal Large Language Models (MLLMs). Yet we find that models can ace many multimodal benchmarks without strong visual understanding, instead exploiting biases, linguistic priors, and superficial patterns. This is especially problematic for vision-centric benchmarks that are meant to require visual inputs. We adopt a diagnostic principle for benchmark design: if a benchmark can be gamed, it will be. Designers should therefore try to ``game'' their own benchmarks first, using diagnostic and debiasing procedures to systematically identify and mitigate non-visual biases. Effective diagnosis requires directly ``training on the test set'' -- probing the released test set for its intrinsic, exploitable patterns. We operationalize this standard with two components. First, we diagnose benchmark susceptibility using a ``Test-set Stress-Test''…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Explainable Artificial Intelligence (XAI) · Natural Language Processing Techniques