The LLM Effect on IR Benchmarks: A Meta-Analysis of Effectiveness, Baselines, and Contamination

Moritz Staudinger; Wojciech Kusa; Allan Hanbury

arXiv:2604.05766·cs.IR·April 8, 2026

The LLM Effect on IR Benchmarks: A Meta-Analysis of Effectiveness, Baselines, and Contamination

Moritz Staudinger, Wojciech Kusa, Allan Hanbury

PDF

TL;DR

This paper systematically analyzes how large language models influence IR benchmark results, revealing an apparent LLM effect that may be confounded by data contamination and pretraining memorization.

Contribution

It provides the first comprehensive meta-analysis of LLM impact on IR benchmarks, highlighting contamination issues and the difficulty in distinguishing genuine progress from memorization.

Findings

01

LLM-based systems show significant improvements on IR benchmarks since 2023.

02

Data contamination in benchmarks affects the interpretation of LLM effectiveness.

03

Wider confidence intervals make it hard to confirm true methodological advances.

Abstract

Benchmark collections have long enabled controlled comparison and cumulative progress in Information Retrieval (IR). However, prior meta-analyses have shown that reported effectiveness gains often fail to accumulate, in part due to the use of weak or outdated baselines. While large language models are increasingly used in retrieval pipelines, their impact on established IR benchmarks has not been systematically analyzed. In this study, we analyze 143 publications reporting results on the TREC Robust04 collection and the TREC Deep Learning 2020 (DL20) passage retrieval benchmark to examine longitudinal trends in retrieval effectiveness and baseline strength. We observe what we term an \emph{LLM effect}: recent systems incorporating LLM components achieve 8.8\% higher nDCG@10 on DL20 compared to the best result from TREC 2020 and approximately 20\% higher on Robust04 since 2023. However,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.