A Comparison of Methods for Evaluating Generative IR
Negar Arabzadeh, Charles L. A. Clarke

TL;DR
This paper compares various evaluation methods for generative information retrieval systems, emphasizing the use of LLM-generated labels and assessing their alignment with human judgments across multiple tasks.
Contribution
It introduces and validates several evaluation approaches for Gen-IR, focusing on LLM-based labels and their effectiveness compared to human assessments.
Findings
LLM-based evaluation methods can approximate human judgments
Different evaluation strategies vary in autonomy and auditability
Validation across TREC tasks shows promising results for some methods
Abstract
Information retrieval systems increasingly incorporate generative components. For example, in a retrieval augmented generation (RAG) system, a retrieval component might provide a source of ground truth, while a generative component summarizes and augments its responses. In other systems, a large language model (LLM) might directly generate responses without consulting a retrieval component. While there are multiple definitions of generative information retrieval (Gen-IR) systems, in this paper we focus on those systems where the system's response is not drawn from a fixed collection of documents or passages. The response to a query may be entirely new text. Since traditional IR evaluation methods break down under this model, we explore various methods that extend traditional offline evaluation approaches to the Gen-IR context. Offline IR evaluation traditionally employs paid human…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSmart Systems and Machine Learning · Machine Learning and ELM
MethodsFocus · ALIGN
