Reliable Confidence Intervals for Information Retrieval Evaluation Using   Generative A.I

Harrie Oosterhuis; Rolf Jagerman; Zhen Qin; Xuanhui Wang; Michael; Bendersky

arXiv:2407.02464·cs.IR·July 3, 2024

Reliable Confidence Intervals for Information Retrieval Evaluation Using Generative A.I

Harrie Oosterhuis, Rolf Jagerman, Zhen Qin, Xuanhui Wang, Michael, Bendersky

PDF

TL;DR

This paper introduces methods to generate reliable confidence intervals for IR evaluation metrics using generative AI annotations, addressing errors and variability with theoretical guarantees, thus enabling cost-effective and dependable IR system assessment.

Contribution

It proposes two novel methods leveraging prediction-powered inference and conformal risk control to produce reliable confidence intervals for IR evaluation metrics based on AI-generated relevance annotations.

Findings

01

Confidence intervals accurately reflect variance and bias in LLM-based annotations

02

Conformal risk control method adapts CIs per query and document for ranking metrics

03

Methods outperform empirical bootstrapping in capturing evaluation uncertainty

Abstract

The traditional evaluation of information retrieval (IR) systems is generally very costly as it requires manual relevance annotation from human experts. Recent advancements in generative artificial intelligence -- specifically large language models (LLMs) -- can generate relevance annotations at an enormous scale with relatively small computational costs. Potentially, this could alleviate the costs traditionally associated with IR evaluation and make it applicable to numerous low-resource applications. However, generated relevance annotations are not immune to (systematic) errors, and as a result, directly using them for evaluation produces unreliable results. In this work, we propose two methods based on prediction-powered inference and conformal risk control that utilize computer-generated relevance annotations to place reliable confidence intervals (CIs) around IR evaluation…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.