KG-CMI: Knowledge graph enhanced cross-Mamba interaction for medical visual question answering

Xianyao Zheng; Hong Yu; Hui Cui; Changming Sun; Xiangyu Li; Ran Su; Leyi Wei; Jia Zhou; Junbo Wang; Qiangguo Jin

arXiv:2604.00601·cs.CV·April 2, 2026

KG-CMI: Knowledge graph enhanced cross-Mamba interaction for medical visual question answering

Xianyao Zheng, Hong Yu, Hui Cui, Changming Sun, Xiangyu Li, Ran Su, Leyi Wei, Jia Zhou, Junbo Wang, Qiangguo Jin

PDF

TL;DR

KG-CMI is a novel framework that enhances medical visual question answering by integrating medical knowledge graphs and multi-task learning, significantly improving accuracy and interpretability on multiple datasets.

Contribution

It introduces a knowledge graph enhanced cross-modal interaction framework with multi-task learning for improved Med-VQA performance.

Findings

01

Outperforms state-of-the-art methods on VQA-RAD, SLAKE, and OVQA datasets.

02

Effectively integrates medical knowledge graphs for better feature alignment.

03

Enhances interpretability through validation experiments.

Abstract

Medical visual question answering (Med-VQA) is a crucial multimodal task in clinical decision support and telemedicine. Recent methods fail to fully leverage domain-specific medical knowledge, making it difficult to accurately associate lesion features in medical images with key diagnostic criteria. Additionally, classification-based approaches typically rely on predefined answer sets. Treating Med-VQA as a simple classification problem limits its ability to adapt to the diversity of free-form answers and may overlook detailed semantic information in those answers. To address these challenges, we propose a knowledge graph enhanced cross-Mamba interaction (KG-CMI) framework, which consists of a fine-grained cross-modal feature alignment (FCFA) module, a knowledge graph embedding (KGE) module, a cross-modal interaction representation (CMIR) module, and a free-form answer enhanced…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.