CARE-RAG - Clinical Assessment and Reasoning in RAG

Deepthi Potluri; Aby Mammen Mathew; Jeffrey B DeWitt; Alexander L. Rasgon; Yide Hao; Junyuan Hong; Ying Ding

arXiv:2511.15994·cs.AI·November 21, 2025

CARE-RAG - Clinical Assessment and Reasoning in RAG

Deepthi Potluri, Aby Mammen Mathew, Jeffrey B DeWitt, Alexander L. Rasgon, Yide Hao, Junyuan Hong, Ying Ding

PDF

Open Access 1 Datasets

TL;DR

This paper investigates the challenges of ensuring correct reasoning in retrieval-augmented language models within clinical contexts, proposing an evaluation framework to measure reasoning accuracy, consistency, and fidelity.

Contribution

It introduces a novel evaluation framework for assessing reasoning in RAG models in clinical settings, highlighting the importance of rigorous reasoning assessment for safe deployment.

Findings

01

Errors persist despite authoritative retrieval.

02

Retrieval-augmented generation can constrain outputs.

03

Rigorous reasoning evaluation is essential for safety.

Abstract

Access to the right evidence does not guarantee that large language models (LLMs) will reason with it correctly. This gap between retrieval and reasoning is especially concerning in clinical settings, where outputs must align with structured protocols. We study this gap using Written Exposure Therapy (WET) guidelines as a testbed. In evaluating model responses to curated clinician-vetted questions, we find that errors persist even when authoritative passages are provided. To address this, we propose an evaluation framework that measures accuracy, consistency, and fidelity of reasoning. Our results highlight both the potential and the risks: retrieval-augmented generation (RAG) can constrain outputs, but safe deployment requires assessing reasoning as rigorously as retrieval.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

Kafoo/therascribe-gold-1M-with-images
dataset· 22 dl
22 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Neurobiology of Language and Bilingualism · Artificial Intelligence in Healthcare and Education