Making medical vision-language models think causally across modalities with retrieval-augmented cross-modal reasoning

Weiqin Yang; Haowen Xue; Qingyi Peng; Hexuan Hu; Qian Huang; Tingbo Zhang

arXiv:2601.18356·cs.LG·January 27, 2026

Making medical vision-language models think causally across modalities with retrieval-augmented cross-modal reasoning

Weiqin Yang, Haowen Xue, Qingyi Peng, Hexuan Hu, Qian Huang, Tingbo Zhang

PDF

Open Access

TL;DR

This paper introduces a causal reasoning framework for medical vision-language models that leverages retrieval of causal information to improve accuracy, robustness, and interpretability in clinical tasks.

Contribution

It presents Multimodal Causal Retrieval-Augmented Generation, integrating causal inference with multimodal retrieval to enhance medical VLM reasoning beyond superficial correlations.

Findings

01

Improved factual accuracy in radiology report generation

02

Enhanced robustness to distribution shifts

03

Increased interpretability of model reasoning

Abstract

Medical vision-language models (VLMs) achieve strong performance in diagnostic reporting and image-text alignment, yet their underlying reasoning mechanisms remain fundamentally correlational, exhibiting reliance on superficial statistical associations that fail to capture the causal pathophysiological mechanisms central to clinical decision-making. This limitation makes them fragile, prone to hallucinations, and sensitive to dataset biases. Retrieval-augmented generation (RAG) offers a partial remedy by grounding predictions in external knowledge. However, conventional RAG depends on semantic similarity, introducing new spurious correlations. We propose Multimodal Causal Retrieval-Augmented Generation, a framework that integrates causal inference principles with multimodal retrieval. It retrieves clinically relevant exemplars and causal graphs from external sources, conditioning model…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Topic Modeling · Artificial Intelligence in Healthcare and Education