Redefining Retrieval Evaluation in the Era of LLMs

Giovanni Trappolini; Florin Cuconasu; Simone Filice; Yoelle Maarek; Fabrizio Silvestri

arXiv:2510.21440·cs.CL·October 27, 2025

Redefining Retrieval Evaluation in the Era of LLMs

Giovanni Trappolini, Florin Cuconasu, Simone Filice, Yoelle Maarek, Fabrizio Silvestri

PDF

Open Access 1 Datasets 1 Video

TL;DR

This paper introduces UDCG, a new retrieval evaluation metric tailored for LLM-based systems, addressing limitations of traditional IR metrics by considering both relevant and distracting documents to better predict RAG performance.

Contribution

The paper proposes a utility-based annotation schema and a novel metric, UDCG, that better aligns retrieval evaluation with LLM consumption and improves correlation with answer accuracy.

Findings

01

UDCG shows up to 36% better correlation with answer accuracy.

02

Traditional IR metrics poorly predict RAG performance in LLM settings.

03

The new metric accounts for both relevant and distracting documents in evaluation.

Abstract

Traditional Information Retrieval (IR) metrics, such as nDCG, MAP, and MRR, assume that human users sequentially examine documents with diminishing attention to lower ranks. This assumption breaks down in Retrieval Augmented Generation (RAG) systems, where search results are consumed by Large Language Models (LLMs), which, unlike humans, process all retrieved documents as a whole rather than sequentially. Additionally, traditional IR metrics do not account for related but irrelevant documents that actively degrade generation quality, rather than merely being ignored. Due to these two major misalignments, namely human vs. machine position discount and human relevance vs. machine utility, classical IR metrics do not accurately predict RAG performance. We introduce a utility-based annotation schema that quantifies both the positive contribution of relevant passages and the negative impact…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

florin-hf/kilt_corpus_wiki_dump2019
dataset· 43 dl
43 dl

Videos

Redefining Retrieval Evaluation in the Era of LLMs· underline

Taxonomy

TopicsInformation Retrieval and Search Behavior · Topic Modeling · Expert finding and Q&A systems