CoreEval: Automatically Building Contamination-Resilient Datasets with Real-World Knowledge toward Reliable LLM Evaluation

Jingqian Zhao; Bingbing Wang; Geng Tu; Yice Zhang; Qianlong Wang; Bin Liang; Jing Li; Ruifeng Xu

arXiv:2511.18889·cs.CL·November 25, 2025

CoreEval: Automatically Building Contamination-Resilient Datasets with Real-World Knowledge toward Reliable LLM Evaluation

Jingqian Zhao, Bingbing Wang, Geng Tu, Yice Zhang, Qianlong Wang, Bin Liang, Jing Li, Ruifeng Xu

PDF

Open Access 1 Video

TL;DR

CoreEval is a novel method that automatically updates datasets with real-world knowledge to eliminate data contamination, thereby providing more reliable evaluation of large language models.

Contribution

It introduces a contamination-resilient evaluation framework that leverages entity extraction and knowledge retrieval to update datasets while maintaining semantic integrity.

Findings

01

CoreEval effectively reduces performance overestimation due to data contamination.

02

The approach enhances dataset relevance and semantic coherence.

03

Experiments show improved evaluation reliability across multiple NLP tasks.

Abstract

Data contamination poses a significant challenge to the fairness of LLM evaluations in natural language processing tasks by inadvertently exposing models to test data during training. Current studies attempt to mitigate this issue by modifying existing datasets or generating new ones from freshly collected information. However, these methods fall short of ensuring contamination-resilient evaluation, as they fail to fully eliminate pre-existing knowledge from models or preserve the semantic complexity of the original datasets. To address these limitations, we propose \textbf{CoreEval}, a \textbf{Co}ntamination-\textbf{re}silient \textbf{Eval}uation strategy for automatically updating data with real-world knowledge. This approach begins by extracting entity relationships from the original data and leveraging the GDELT database to retrieve relevant, up-to-date knowledge. The retrieved…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

CoreEval: Automatically Building Contamination-Resilient Datasets with Real-World Knowledge toward Reliable LLM Evaluation· underline

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Text Readability and Simplification