The LLM Effect on IR Benchmarks: A Meta-Analysis of Effectiveness, Baselines, and Contamination
Moritz Staudinger, Wojciech Kusa, Allan Hanbury

TL;DR
This paper systematically analyzes how large language models influence IR benchmark results, revealing an apparent LLM effect that may be confounded by data contamination and pretraining memorization.
Contribution
It provides the first comprehensive meta-analysis of LLM impact on IR benchmarks, highlighting contamination issues and the difficulty in distinguishing genuine progress from memorization.
Findings
LLM-based systems show significant improvements on IR benchmarks since 2023.
Data contamination in benchmarks affects the interpretation of LLM effectiveness.
Wider confidence intervals make it hard to confirm true methodological advances.
Abstract
Benchmark collections have long enabled controlled comparison and cumulative progress in Information Retrieval (IR). However, prior meta-analyses have shown that reported effectiveness gains often fail to accumulate, in part due to the use of weak or outdated baselines. While large language models are increasingly used in retrieval pipelines, their impact on established IR benchmarks has not been systematically analyzed. In this study, we analyze 143 publications reporting results on the TREC Robust04 collection and the TREC Deep Learning 2020 (DL20) passage retrieval benchmark to examine longitudinal trends in retrieval effectiveness and baseline strength. We observe what we term an \emph{LLM effect}: recent systems incorporating LLM components achieve 8.8\% higher nDCG@10 on DL20 compared to the best result from TREC 2020 and approximately 20\% higher on Robust04 since 2023. However,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
