Search-Time Data Contamination
Ziwen Han, Meher Mankikar, Julian Michael, Zifan Wang

TL;DR
This paper identifies search-time contamination in evaluating search-based LLM agents, showing how data leaks from online sources like HuggingFace can artificially inflate performance metrics and compromise benchmark validity.
Contribution
It introduces the concept of search-time contamination, demonstrates its impact on benchmark results, and proposes best practices for more trustworthy evaluation of search-based LLM agents.
Findings
Approximately 3% of questions directly retrieved from HuggingFace datasets.
Blocking HuggingFace sources reduces accuracy on contaminated questions by about 15%.
Public evaluation datasets can be a source of search-time contamination.
Abstract
Data contamination refers to the leakage of evaluation data into model training data, resulting in overfitting to supposedly held-out test sets and compromising test validity. We identify an analogous issue, search-time contamination (STC), in evaluating search-based LLM agents which use tools to gather information from online sources when answering user queries. STC occurs when the retrieval step surfaces a source containing the test question (or a near-duplicate) alongside its answer, enabling agents to copy rather than genuinely infer or reason, undermining benchmark integrity. We find that HuggingFace, an online platform hosting evaluation datasets, appears among retrieved sources in search based agent logs. Consequently, agents often explicitly acknowledge discovering question answer pairs from HuggingFace within their reasoning chains. On three commonly used capability benchmarks:…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
