Generative Information Retrieval Evaluation

Marwah Alaofi; Negar Arabzadeh; Charles L. A. Clarke; Mark; Sanderson

arXiv:2404.08137·cs.IR·January 31, 2025·2 cites

Generative Information Retrieval Evaluation

Marwah Alaofi, Negar Arabzadeh, Charles L. A. Clarke, Mark, Sanderson

PDF

Open Access

TL;DR

This paper explores the evolving landscape of generative information retrieval evaluation, emphasizing the role of large language models in assessment, the challenges of evaluating GenIR systems, and the need to balance automated and human evaluations.

Contribution

It provides a comprehensive review of LLM-based evaluation methods for IR and proposes approaches to address circularity and maintain human-grounded assessment.

Findings

01

LLMs may outperform crowdworkers in relevance judgments

02

Evaluation of GenIR systems can be viewed as 'slow search'

03

Human assessment remains essential despite automation

Abstract

In this chapter, we consider generative information retrieval evaluation from two distinct but interrelated perspectives. First, large language models (LLMs) themselves are rapidly becoming tools for evaluation, with current research indicating that LLMs may be superior to crowdsource workers and other paid assessors on basic relevance judgement tasks. We review past and ongoing related research, including speculation on the future of shared task initiatives, such as TREC, and a discussion on the continuing need for human assessments. Second, we consider the evaluation of emerging LLM-based generative information retrieval (GenIR) systems, including retrieval augmented generation (RAG) systems. We consider approaches that focus both on the end-to-end evaluation of GenIR systems and on the evaluation of a retrieval component as an element in a RAG system. Going forward, we expect the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsInformation Retrieval and Search Behavior

MethodsAttention Is All You Need · Dropout · Adam · Linear Layer · Linear Warmup With Linear Decay · Layer Normalization · Refunds@Expedia|||How do I get a full refund from Expedia? · Weight Decay · Byte Pair Encoding · Dense Connections