Search-Time Data Contamination

Ziwen Han; Meher Mankikar; Julian Michael; Zifan Wang

arXiv:2508.13180·cs.AI·August 20, 2025

Search-Time Data Contamination

Ziwen Han, Meher Mankikar, Julian Michael, Zifan Wang

PDF

TL;DR

This paper identifies search-time contamination in evaluating search-based LLM agents, showing how data leaks from online sources like HuggingFace can artificially inflate performance metrics and compromise benchmark validity.

Contribution

It introduces the concept of search-time contamination, demonstrates its impact on benchmark results, and proposes best practices for more trustworthy evaluation of search-based LLM agents.

Findings

01

Approximately 3% of questions directly retrieved from HuggingFace datasets.

02

Blocking HuggingFace sources reduces accuracy on contaminated questions by about 15%.

03

Public evaluation datasets can be a source of search-time contamination.

Abstract

Data contamination refers to the leakage of evaluation data into model training data, resulting in overfitting to supposedly held-out test sets and compromising test validity. We identify an analogous issue, search-time contamination (STC), in evaluating search-based LLM agents which use tools to gather information from online sources when answering user queries. STC occurs when the retrieval step surfaces a source containing the test question (or a near-duplicate) alongside its answer, enabling agents to copy rather than genuinely infer or reason, undermining benchmark integrity. We find that HuggingFace, an online platform hosting evaluation datasets, appears among retrieved sources in search based agent logs. Consequently, agents often explicitly acknowledge discovering question answer pairs from HuggingFace within their reasoning chains. On three commonly used capability benchmarks:…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.