A Workbench for Autograding Retrieve/Generate Systems

Laura Dietz

arXiv:2405.13177·cs.IR·May 24, 2024

A Workbench for Autograding Retrieve/Generate Systems

Laura Dietz

PDF

Open Access

TL;DR

This paper introduces a workbench for evaluating autoregressive LLM-based IR systems using alternative methods like relevance judgments, key fact coverage, and exam question answering, addressing limitations of traditional passage-level assessments.

Contribution

It presents a novel evaluation workbench that leverages LLMs for assessing IR system responses through multiple innovative approaches.

Findings

01

LLMs can effectively judge response relevance.

02

The workbench enables development of new test collections.

03

Evaluation methods impact system ranking and development.

Abstract

This resource paper addresses the challenge of evaluating Information Retrieval (IR) systems in the era of autoregressive Large Language Models (LLMs). Traditional methods relying on passage-level judgments are no longer effective due to the diversity of responses generated by LLM-based systems. We provide a workbench to explore several alternative evaluation approaches to judge the relevance of a system's response that incorporate LLMs: 1. Asking an LLM whether the response is relevant; 2. Asking the LLM which set of nuggets (i.e., relevant key facts) is covered in the response; 3. Asking the LLM to answer a set of exam questions with the response. This workbench aims to facilitate the development of new, reusable test collections. Researchers can manually refine sets of nuggets and exam questions, observing their impact on system evaluation and leaderboard rankings. Resource…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHydrogen Storage and Materials · Extraction and Separation Processes

MethodsSparse Evolutionary Training