KG-CMI: Knowledge graph enhanced cross-Mamba interaction for medical visual question answering
Xianyao Zheng, Hong Yu, Hui Cui, Changming Sun, Xiangyu Li, Ran Su, Leyi Wei, Jia Zhou, Junbo Wang, Qiangguo Jin

TL;DR
KG-CMI is a novel framework that enhances medical visual question answering by integrating medical knowledge graphs and multi-task learning, significantly improving accuracy and interpretability on multiple datasets.
Contribution
It introduces a knowledge graph enhanced cross-modal interaction framework with multi-task learning for improved Med-VQA performance.
Findings
Outperforms state-of-the-art methods on VQA-RAD, SLAKE, and OVQA datasets.
Effectively integrates medical knowledge graphs for better feature alignment.
Enhances interpretability through validation experiments.
Abstract
Medical visual question answering (Med-VQA) is a crucial multimodal task in clinical decision support and telemedicine. Recent methods fail to fully leverage domain-specific medical knowledge, making it difficult to accurately associate lesion features in medical images with key diagnostic criteria. Additionally, classification-based approaches typically rely on predefined answer sets. Treating Med-VQA as a simple classification problem limits its ability to adapt to the diversity of free-form answers and may overlook detailed semantic information in those answers. To address these challenges, we propose a knowledge graph enhanced cross-Mamba interaction (KG-CMI) framework, which consists of a fine-grained cross-modal feature alignment (FCFA) module, a knowledge graph embedding (KGE) module, a cross-modal interaction representation (CMIR) module, and a free-form answer enhanced…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
