Understanding the Impact of Confidence in Retrieval Augmented Generation: A Case Study in the Medical Domain

Shintaro Ozaki; Yuta Kato; Siyuan Feng; Masayo Tomita; Kazuki Hayashi; Wataru Hashimoto; Ryoma Obara; Masafumi Oyamada; Katsuhiko Hayashi; Hidetaka Kamigaito; Taro Watanabe

arXiv:2412.20309·cs.CL·August 20, 2025

Understanding the Impact of Confidence in Retrieval Augmented Generation: A Case Study in the Medical Domain

Shintaro Ozaki, Yuta Kato, Siyuan Feng, Masayo Tomita, Kazuki Hayashi, Wataru Hashimoto, Ryoma Obara, Masafumi Oyamada, Katsuhiko Hayashi, Hidetaka Kamigaito, Taro Watanabe

PDF

Open Access 1 Repo 1 Video

TL;DR

This study investigates how Retrieval Augmented Generation (RAG) influences the confidence levels of Large Language Models in the medical domain, analyzing whether models can assess the relevance of retrieved documents through output probabilities.

Contribution

It provides an empirical analysis of confidence mechanisms in RAG models within the medical field, highlighting their ability to judge document relevance based on output probabilities.

Findings

01

Certain models can assess the relevance of retrieved documents.

02

Model confidence correlates with answer correctness.

03

Evaluation metrics like calibration error and entropy are effective.

Abstract

Retrieval Augmented Generation (RAG) complements the knowledge of Large Language Models (LLMs) by leveraging external information to enhance response accuracy for queries. This approach is widely applied in several fields by taking its advantage of injecting the most up-to-date information, and researchers are focusing on understanding and improving this aspect to unlock the full potential of RAG in such high-stakes applications. However, despite the potential of RAG to address these needs, the mechanisms behind the confidence levels of its outputs remain underexplored. Our study focuses on the impact of RAG, specifically examining whether RAG improves the confidence of LLM outputs in the medical domain. We conduct this analysis across various configurations and models. We evaluate confidence by treating the model's predicted probability as its output and calculating several evaluation…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

naist-nlp/CC_RAG
pytorchOfficial

Videos

Understanding the Impact of Confidence in Retrieval Augmented Generation: A Case Study in the Medical Domain· underline

Taxonomy

TopicsRecommender Systems and Techniques · Intelligent Tutoring Systems and Adaptive Learning

MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Attention Is All You Need · Byte Pair Encoding · Attention Dropout · Linear Layer · Softmax · Dense Connections · Linear Warmup With Linear Decay · Dropout · WordPiece