Pitfalls of Evaluating Language Models with Open Benchmarks

Md. Najib Hasan (1); Md Mahadi Hassan Sibat (2); Mohammad Fakhruddin Babar (3); Souvika Sarkar (1); Monowar Hasan (3); Santu Karmaker (2) ((1) Wichita State University; (2) University of Central Florida; (3) Washington State University)

arXiv:2507.00460·cs.CL·January 8, 2026

Pitfalls of Evaluating Language Models with Open Benchmarks

Md. Najib Hasan (1), Md Mahadi Hassan Sibat (2), Mohammad Fakhruddin Babar (3), Souvika Sarkar (1), Monowar Hasan (3), Santu Karmaker (2) ((1) Wichita State University, (2) University of Central Florida, (3) Washington State University)

PDF

Open Access

TL;DR

This paper highlights the risks of data leakage in open LLM benchmarks, demonstrates how models can cheat, and advocates for more robust evaluation practices to ensure fair and meaningful model assessment.

Contribution

It reveals vulnerabilities in open benchmarks, introduces cheating models to illustrate these issues, and proposes strategies to improve evaluation reliability.

Findings

01

Cheating models outperform on open benchmarks but fail on unseen data.

02

Open benchmarks alone may not reflect real-world model utility.

03

Private and dynamic benchmarks are necessary for trustworthy evaluation.

Abstract

Open Large Language Model (LLM) benchmarks, such as HELM and BIG-Bench, provide standardized and transparent evaluation protocols that support comparative analysis, reproducibility, and systematic progress tracking in Language Model (LM) research. Yet, this openness also creates substantial risks of data leakage during LM testing--deliberate or inadvertent, thereby undermining the fairness and reliability of leaderboard rankings and leaving them vulnerable to manipulation by unscrupulous actors. We illustrate the severity of this issue by intentionally constructing cheating models: smaller variants of BART, T5, and GPT-2, fine-tuned directly on publicly available test-sets. As expected, these models excel on the target benchmarks but fail terribly to generalize to comparable unseen testing sets. We then examine task specific simple paraphrase-based safeguarding strategies to mitigate…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsArtificial Intelligence in Healthcare and Education · Topic Modeling · Computational and Text Analysis Methods