Redefining Retrieval Evaluation in the Era of LLMs
Giovanni Trappolini, Florin Cuconasu, Simone Filice, Yoelle Maarek, Fabrizio Silvestri

TL;DR
This paper introduces UDCG, a new retrieval evaluation metric tailored for LLM-based systems, addressing limitations of traditional IR metrics by considering both relevant and distracting documents to better predict RAG performance.
Contribution
The paper proposes a utility-based annotation schema and a novel metric, UDCG, that better aligns retrieval evaluation with LLM consumption and improves correlation with answer accuracy.
Findings
UDCG shows up to 36% better correlation with answer accuracy.
Traditional IR metrics poorly predict RAG performance in LLM settings.
The new metric accounts for both relevant and distracting documents in evaluation.
Abstract
Traditional Information Retrieval (IR) metrics, such as nDCG, MAP, and MRR, assume that human users sequentially examine documents with diminishing attention to lower ranks. This assumption breaks down in Retrieval Augmented Generation (RAG) systems, where search results are consumed by Large Language Models (LLMs), which, unlike humans, process all retrieved documents as a whole rather than sequentially. Additionally, traditional IR metrics do not account for related but irrelevant documents that actively degrade generation quality, rather than merely being ignored. Due to these two major misalignments, namely human vs. machine position discount and human relevance vs. machine utility, classical IR metrics do not accurately predict RAG performance. We introduce a utility-based annotation schema that quantifies both the positive contribution of relevant passages and the negative impact…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsInformation Retrieval and Search Behavior · Topic Modeling · Expert finding and Q&A systems
