CARE: Towards Clinical Accountability in Multi-Modal Medical Reasoning with an Evidence-Grounded Agentic Framework

Yuexi Du; Jinglu Wang; Shujie Liu; Nicha C. Dvornek; Yan Lu

arXiv:2603.01607·cs.AI·March 12, 2026

CARE: Towards Clinical Accountability in Multi-Modal Medical Reasoning with an Evidence-Grounded Agentic Framework

Yuexi Du, Jinglu Wang, Shujie Liu, Nicha C. Dvornek, Yan Lu

PDF

Open Access 3 Reviews

TL;DR

CARE introduces a modular, evidence-grounded framework for multi-modal medical reasoning that enhances accuracy and accountability by mimicking clinical workflows and explicitly utilizing localized evidence.

Contribution

The paper proposes a novel agentic framework with decoupled modules and reinforcement learning to improve medical reasoning accuracy and accountability over existing end-to-end models.

Findings

01

CARE-Flow improves accuracy by 10.9% over SOTA.

02

CARE-Coord outperforms heavily pre-trained models by 5.2%.

03

Explicit evidence and modular design enhance trust and correctness.

Abstract

Large visual language models (VLMs) have shown strong multi-modal medical reasoning ability, but most operate as end-to-end black boxes, diverging from clinicians' evidence-based, staged workflows and hindering clinical accountability. Complementarily, expert visual grounding models can accurately localize regions of interest (ROIs), providing explicit, reliable evidence that improves both reasoning accuracy and trust. In this paper, we introduce CARE, advancing Clinical Accountability in multi-modal medical Reasoning with an Evidence-grounded agentic framework. Unlike existing approaches that couple grounding and reasoning within a single generalist model, CARE decomposes the task into coordinated sub-modules to reduce shortcut learning and hallucination: a compact VLM proposes relevant medical entities; an expert entity-referring segmentation model produces pixel-level ROI evidence;…

Peer Reviews

Decision·ICLR 2026 Poster

Reviewer 01Rating 6Confidence 3

Strengths

The paper's primary strength lies in its principled and clinically inspired architecture. The decomposition of medical reasoning into discrete, specialist-driven stages [hypothesize (entity proposal) → localize (segmentation) → reason (grounded VQA)] is a significant conceptual advance. This design directly addresses the "black box" problem by providing explicit, pixel-level evidence for each decision, thereby enhancing both interpretability and debuggability. Empirically, the work is robust, de

Weaknesses

### **Baseline Comparison Fairness and Transparency** The validity of the primary results in Table 1 depends critically on understanding each baseline’s training regime. It remains unclear which comparison models were fine-tuned on the same in-domain datasets (OmniMedVQA, VQA-RAD, SLAKE) and which were evaluated zero-shot. It is indeed unjust to call the columns in-domain and OOD solely based on your configurations and then compare them against other models. The authors should explicitly disclos

Reviewer 02Rating 6Confidence 3

Strengths

- Well-motivated workflow design. The authors tackle the core problem—controlling hallucinations in large medical VLMs—through a structured, interpretable workflow rather than pure end-to-end training. This decomposition aligns well with clinical reasoning and is conceptually elegant. - Coordinator design. The introduction of an LLM-based coordinator for planning and quality control is intuitive and methodologically sound. It also adds an additional layer of reliability that most prior medical

Weaknesses

See questions

Reviewer 03Rating 6Confidence 4

Strengths

The paper proposes a novel framework for agentic visual reasoning with tool usage in medicine. They identify core workflow steps helpful for medical visual reasoning and train dedicated agents for these tasks. While the core concept of tool-supported visual reasoning exists for generalist VQA with similar training strategies, the originality and significance of this work arises from a well-executed specialization for the medical domain that I can see to be easily built upon. The method, module

Weaknesses

1. ##### **Missing related work:** The paper claims to beat state-of-the-art on VQA-RAD and SLAKE, however, there are multiple methods with better performance on VQA-RAD and SLAKE not mentioned in the comparison, e.g. \[1,2,3\]. These works should be included for a complete contextualization of the work and the claim adapted. 2. **Unclear contribution of the coordinator** In figure 3 and 18 we can see that the coordinator (GPT-5) is able to overwrite the answers of the medical entity prop

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Topic Modeling · Explainable Artificial Intelligence (XAI)