CMI-MTL: Cross-Mamba interaction based multi-task learning for medical visual question answering
Qiangguo Jin, Xianyao Zheng, Hui Cui, Changming Sun, Yuqi Fang, Cong Cong, Ran Su, Leyi Wei, Ping Xuan, Junbo Wang

TL;DR
This paper introduces CMI-MTL, a novel multi-task learning framework that enhances medical visual question answering by effectively aligning cross-modal features and leveraging free-form answers, outperforming existing methods on multiple datasets.
Contribution
The paper proposes a new CMI-MTL framework with three modules that improve cross-modal feature alignment and answer diversity handling in Med-VQA tasks.
Findings
Outperforms state-of-the-art on VQA-RAD, SLAKE, OVQA datasets
Improves interpretability of Med-VQA models
Enhances open-ended answer generation capabilities
Abstract
Medical visual question answering (Med-VQA) is a crucial multimodal task in clinical decision support and telemedicine. Recent self-attention based methods struggle to effectively handle cross-modal semantic alignments between vision and language. Moreover, classification-based methods rely on predefined answer sets. Treating this task as a simple classification problem may make it unable to adapt to the diversity of free-form answers and overlook the detailed semantic information of free-form answers. In order to tackle these challenges, we introduce a Cross-Mamba Interaction based Multi-Task Learning (CMI-MTL) framework that learns cross-modal feature representations from images and texts. CMI-MTL comprises three key modules: fine-grained visual-text feature alignment (FVTA), cross-modal interleaved feature representation (CIFR), and free-form answer-enhanced multi-task learning…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Topic Modeling · Domain Adaptation and Few-Shot Learning
