CARE-RAG - Clinical Assessment and Reasoning in RAG
Deepthi Potluri, Aby Mammen Mathew, Jeffrey B DeWitt, Alexander L. Rasgon, Yide Hao, Junyuan Hong, Ying Ding

TL;DR
This paper investigates the challenges of ensuring correct reasoning in retrieval-augmented language models within clinical contexts, proposing an evaluation framework to measure reasoning accuracy, consistency, and fidelity.
Contribution
It introduces a novel evaluation framework for assessing reasoning in RAG models in clinical settings, highlighting the importance of rigorous reasoning assessment for safe deployment.
Findings
Errors persist despite authoritative retrieval.
Retrieval-augmented generation can constrain outputs.
Rigorous reasoning evaluation is essential for safety.
Abstract
Access to the right evidence does not guarantee that large language models (LLMs) will reason with it correctly. This gap between retrieval and reasoning is especially concerning in clinical settings, where outputs must align with structured protocols. We study this gap using Written Exposure Therapy (WET) guidelines as a testbed. In evaluating model responses to curated clinician-vetted questions, we find that errors persist even when authoritative passages are provided. To address this, we propose an evaluation framework that measures accuracy, consistency, and fidelity of reasoning. Our results highlight both the potential and the risks: retrieval-augmented generation (RAG) can constrain outputs, but safe deployment requires assessing reasoning as rigorously as retrieval.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Neurobiology of Language and Bilingualism · Artificial Intelligence in Healthcare and Education
