Uncertainty Quantification in Retrieval Augmented Question Answering

Laura Perez-Beltrachini; Mirella Lapata

arXiv:2502.18108·cs.CL·September 12, 2025

Uncertainty Quantification in Retrieval Augmented Question Answering

Laura Perez-Beltrachini, Mirella Lapata

PDF

1 Repo 3 Reviews

TL;DR

This paper introduces a method to quantify the usefulness of retrieved passages in retrieval-augmented question answering by predicting passage utility, improving the assessment of answer correctness efficiently.

Contribution

It proposes a lightweight neural model to estimate passage utility, outperforming traditional information-theoretic metrics and matching sampling-based methods.

Findings

01

Neural passage utility prediction correlates with answer correctness.

02

The approach outperforms simple metrics in estimating passage usefulness.

03

Efficiently approximates more expensive sampling methods.

Abstract

Retrieval augmented Question Answering (QA) helps QA models overcome knowledge gaps by incorporating retrieved evidence, typically a set of passages, alongside the question at test time. Previous studies show that this approach improves QA performance and reduces hallucinations, without, however, assessing whether the retrieved passages are indeed useful at answering correctly. In this work, we propose to quantify the uncertainty of a QA model via estimating the utility of the passages it is provided with. We train a lightweight neural model to predict passage utility for a target QA model and show that while simple information theoretic metrics can predict answer correctness up to a certain extent, our approach efficiently approximates or outperforms more expensive sampling-based methods. Code and data are available at https://github.com/lauhaide/ragu.

Peer Reviews

Decision·Submitted to ICLR 2025

Reviewer 01Rating 5Confidence 3

Strengths

The authors ran experiments with 2 QA models and for a lot of the settings the utility ranker outperforms existing metrics in terms of uncertainty estimations of the retrieved passages. Experiments results also suggest that the method they proposed is also robust to OOD datasets where the ranker is not trained on.

Weaknesses

- Many of the notations are unclear. See in Questions. - QA models used for evaluation only limit to Gemma2-9b-instruct and Llama3.1-8b-instruct which are of similar size. More experiments should be done using models with various sizes to see if similar conclusions still hold. - For Llama3.1-8b-instruct, results from table3 seems to suggest that Utility Ranker is not doing better than just looking at the probability of generating the next token to be "True". Is the training of this ranker real

Reviewer 02Rating 5Confidence 4

Strengths

- The approach to train a separate smaller neural network to predict passage utility scores is novel. The construction of the data and loss for the scoring model, using entailment and accuracy is also intersting and original. - The paper provides an efficient way to predict the error rate at an example level, which could be very useful for latency sensitive systems in order to make a triggering decision for question answering. - The overall flow of the paper is good, it is succinctly written, a

Weaknesses

- One strong shortcoming of this approach is where multiple passages are needed to correctly answer the question, i.e. using multihop reasoning. In such cases, the utility both each of the passages in isolation could be low, and hurt the error prediction. Most of the baselines that use the entire passage set would be robust to this. - The modeling utility scores used to create the ranking dataset has room for improvement. The scores could have smoother accuracy or entailment values instead of

Reviewer 03Rating 3Confidence 4

Strengths

This work presents a method for uncertainty estimation in retrieval-based QA. Their method trains a separate smaller LM to estimate uncertainty in the base QA system's predictions based on a passage, question, and predicted answer. This system is trained on

Weaknesses

## Related Work + Baselines Similar methods that use small, additional trained models to estimate uncertainty have been proposed by [1] and [2] ([1] is referenced in related work, but not compared against). Additionally, [3] has also noted the overlap between this passage utility / calibration task and similarly uses pretrained NLI models to verify / estimate uncertainty in QA system predictions. Given the similarity of these methods, they are important points of comparison to understand how th

Code & Models

Repositories

lauhaide/ragu
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

MethodsSparse Evolutionary Training