LLMEval-Fair: A Large-Scale Longitudinal Study on Robust and Fair Evaluation of Large Language Models

Ming Zhang; Yujiong Shen; Jingyi Deng; Yuhui Wang; Huayu Sha; Kexin Tan; Qiyuan Peng; Yue Zhang; Junzhe Wang; Shichun Liu; Yueyuan Huang; Jingqi Tong; Changhao Jiang; Yilong Wu; Zhihao Zhang; Mingqi Wu; Mingxu Chai; Zhiheng Xi; Shihan Dou; Tao Gui; Qi Zhang; Xuanjing Huang

arXiv:2508.05452·cs.CL·April 16, 2026

LLMEval-Fair: A Large-Scale Longitudinal Study on Robust and Fair Evaluation of Large Language Models

Ming Zhang, Yujiong Shen, Jingyi Deng, Yuhui Wang, Huayu Sha, Kexin Tan, Qiyuan Peng, Yue Zhang, Junzhe Wang, Shichun Liu, Yueyuan Huang, Jingqi Tong, Changhao Jiang, Yilong Wu, Zhihao Zhang, Mingqi Wu, Mingxu Chai, Zhiheng Xi, Shihan Dou, Tao Gui, Qi Zhang, Xuanjing Huang

PDF

1 Repo 1 Datasets

TL;DR

LLMEval-Fair introduces a dynamic, contamination-resistant evaluation framework for large language models, revealing true capabilities and vulnerabilities beyond static benchmarks through a longitudinal study of 60 models.

Contribution

This work presents a novel dynamic evaluation framework with automated integrity checks, a calibrated LLM judge, and a longitudinal study, advancing fair and robust assessment of LLMs.

Findings

01

Performance ceiling on knowledge memorization identified

02

Data contamination vulnerabilities exposed in static benchmarks

03

Ranking stability demonstrated over 30 months

Abstract

Existing evaluation of Large Language Models (LLMs) on static benchmarks is vulnerable to data contamination and leaderboard overfitting, critical issues that obscure true model capabilities. To address this, we introduce LLMEval-Fair, a framework for dynamic evaluation of LLMs. LLMEval-Fair is built on a proprietary bank of 220k graduate-level questions, from which it dynamically samples unseen test sets for each evaluation run. Its automated pipeline ensures integrity via contamination-resistant data curation, a novel anti-cheating architecture, and a calibrated LLM-as-a-judge process achieving 90% agreement with human experts, complemented by a relative ranking system for fair comparison. A 30-month longitudinal study of nearly 60 leading models reveals a performance ceiling on knowledge memorization and exposes data contamination vulnerabilities undetectable by static benchmarks.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

llmeval/LLMEval-Fair
github

Datasets

llmeval-fdu/LLMEval-Fair
dataset· 444 dl
444 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.