CARE: Towards Clinical Accountability in Multi-Modal Medical Reasoning with an Evidence-Grounded Agentic Framework
Yuexi Du, Jinglu Wang, Shujie Liu, Nicha C. Dvornek, Yan Lu

TL;DR
CARE introduces a modular, evidence-grounded framework for multi-modal medical reasoning that enhances accuracy and accountability by mimicking clinical workflows and explicitly utilizing localized evidence.
Contribution
The paper proposes a novel agentic framework with decoupled modules and reinforcement learning to improve medical reasoning accuracy and accountability over existing end-to-end models.
Findings
CARE-Flow improves accuracy by 10.9% over SOTA.
CARE-Coord outperforms heavily pre-trained models by 5.2%.
Explicit evidence and modular design enhance trust and correctness.
Abstract
Large visual language models (VLMs) have shown strong multi-modal medical reasoning ability, but most operate as end-to-end black boxes, diverging from clinicians' evidence-based, staged workflows and hindering clinical accountability. Complementarily, expert visual grounding models can accurately localize regions of interest (ROIs), providing explicit, reliable evidence that improves both reasoning accuracy and trust. In this paper, we introduce CARE, advancing Clinical Accountability in multi-modal medical Reasoning with an Evidence-grounded agentic framework. Unlike existing approaches that couple grounding and reasoning within a single generalist model, CARE decomposes the task into coordinated sub-modules to reduce shortcut learning and hallucination: a compact VLM proposes relevant medical entities; an expert entity-referring segmentation model produces pixel-level ROI evidence;…
Peer Reviews
Decision·ICLR 2026 Poster
The paper's primary strength lies in its principled and clinically inspired architecture. The decomposition of medical reasoning into discrete, specialist-driven stages [hypothesize (entity proposal) → localize (segmentation) → reason (grounded VQA)] is a significant conceptual advance. This design directly addresses the "black box" problem by providing explicit, pixel-level evidence for each decision, thereby enhancing both interpretability and debuggability. Empirically, the work is robust, de
### **Baseline Comparison Fairness and Transparency** The validity of the primary results in Table 1 depends critically on understanding each baseline’s training regime. It remains unclear which comparison models were fine-tuned on the same in-domain datasets (OmniMedVQA, VQA-RAD, SLAKE) and which were evaluated zero-shot. It is indeed unjust to call the columns in-domain and OOD solely based on your configurations and then compare them against other models. The authors should explicitly disclos
- Well-motivated workflow design. The authors tackle the core problem—controlling hallucinations in large medical VLMs—through a structured, interpretable workflow rather than pure end-to-end training. This decomposition aligns well with clinical reasoning and is conceptually elegant. - Coordinator design. The introduction of an LLM-based coordinator for planning and quality control is intuitive and methodologically sound. It also adds an additional layer of reliability that most prior medical
See questions
The paper proposes a novel framework for agentic visual reasoning with tool usage in medicine. They identify core workflow steps helpful for medical visual reasoning and train dedicated agents for these tasks. While the core concept of tool-supported visual reasoning exists for generalist VQA with similar training strategies, the originality and significance of this work arises from a well-executed specialization for the medical domain that I can see to be easily built upon. The method, module
1. ##### **Missing related work:** The paper claims to beat state-of-the-art on VQA-RAD and SLAKE, however, there are multiple methods with better performance on VQA-RAD and SLAKE not mentioned in the comparison, e.g. \[1,2,3\]. These works should be included for a complete contextualization of the work and the claim adapted. 2. **Unclear contribution of the coordinator** In figure 3 and 18 we can see that the coordinator (GPT-5) is able to overwrite the answers of the medical entity prop
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Topic Modeling · Explainable Artificial Intelligence (XAI)
