Reliable Confidence Intervals for Information Retrieval Evaluation Using Generative A.I
Harrie Oosterhuis, Rolf Jagerman, Zhen Qin, Xuanhui Wang, Michael, Bendersky

TL;DR
This paper introduces methods to generate reliable confidence intervals for IR evaluation metrics using generative AI annotations, addressing errors and variability with theoretical guarantees, thus enabling cost-effective and dependable IR system assessment.
Contribution
It proposes two novel methods leveraging prediction-powered inference and conformal risk control to produce reliable confidence intervals for IR evaluation metrics based on AI-generated relevance annotations.
Findings
Confidence intervals accurately reflect variance and bias in LLM-based annotations
Conformal risk control method adapts CIs per query and document for ranking metrics
Methods outperform empirical bootstrapping in capturing evaluation uncertainty
Abstract
The traditional evaluation of information retrieval (IR) systems is generally very costly as it requires manual relevance annotation from human experts. Recent advancements in generative artificial intelligence -- specifically large language models (LLMs) -- can generate relevance annotations at an enormous scale with relatively small computational costs. Potentially, this could alleviate the costs traditionally associated with IR evaluation and make it applicable to numerous low-resource applications. However, generated relevance annotations are not immune to (systematic) errors, and as a result, directly using them for evaluation produces unreliable results. In this work, we propose two methods based on prediction-powered inference and conformal risk control that utilize computer-generated relevance annotations to place reliable confidence intervals (CIs) around IR evaluation…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
