Generative Information Retrieval Evaluation
Marwah Alaofi, Negar Arabzadeh, Charles L. A. Clarke, Mark, Sanderson

TL;DR
This paper explores the evolving landscape of generative information retrieval evaluation, emphasizing the role of large language models in assessment, the challenges of evaluating GenIR systems, and the need to balance automated and human evaluations.
Contribution
It provides a comprehensive review of LLM-based evaluation methods for IR and proposes approaches to address circularity and maintain human-grounded assessment.
Findings
LLMs may outperform crowdworkers in relevance judgments
Evaluation of GenIR systems can be viewed as 'slow search'
Human assessment remains essential despite automation
Abstract
In this chapter, we consider generative information retrieval evaluation from two distinct but interrelated perspectives. First, large language models (LLMs) themselves are rapidly becoming tools for evaluation, with current research indicating that LLMs may be superior to crowdsource workers and other paid assessors on basic relevance judgement tasks. We review past and ongoing related research, including speculation on the future of shared task initiatives, such as TREC, and a discussion on the continuing need for human assessments. Second, we consider the evaluation of emerging LLM-based generative information retrieval (GenIR) systems, including retrieval augmented generation (RAG) systems. We consider approaches that focus both on the end-to-end evaluation of GenIR systems and on the evaluation of a retrieval component as an element in a RAG system. Going forward, we expect the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsInformation Retrieval and Search Behavior
MethodsAttention Is All You Need · Dropout · Adam · Linear Layer · Linear Warmup With Linear Decay · Layer Normalization · Refunds@Expedia|||How do I get a full refund from Expedia? · Weight Decay · Byte Pair Encoding · Dense Connections
