CMI-MTL: Cross-Mamba interaction based multi-task learning for medical visual question answering

Qiangguo Jin; Xianyao Zheng; Hui Cui; Changming Sun; Yuqi Fang; Cong Cong; Ran Su; Leyi Wei; Ping Xuan; Junbo Wang

arXiv:2511.01357·cs.CV·November 4, 2025

CMI-MTL: Cross-Mamba interaction based multi-task learning for medical visual question answering

Qiangguo Jin, Xianyao Zheng, Hui Cui, Changming Sun, Yuqi Fang, Cong Cong, Ran Su, Leyi Wei, Ping Xuan, Junbo Wang

PDF

Open Access

TL;DR

This paper introduces CMI-MTL, a novel multi-task learning framework that enhances medical visual question answering by effectively aligning cross-modal features and leveraging free-form answers, outperforming existing methods on multiple datasets.

Contribution

The paper proposes a new CMI-MTL framework with three modules that improve cross-modal feature alignment and answer diversity handling in Med-VQA tasks.

Findings

01

Outperforms state-of-the-art on VQA-RAD, SLAKE, OVQA datasets

02

Improves interpretability of Med-VQA models

03

Enhances open-ended answer generation capabilities

Abstract

Medical visual question answering (Med-VQA) is a crucial multimodal task in clinical decision support and telemedicine. Recent self-attention based methods struggle to effectively handle cross-modal semantic alignments between vision and language. Moreover, classification-based methods rely on predefined answer sets. Treating this task as a simple classification problem may make it unable to adapt to the diversity of free-form answers and overlook the detailed semantic information of free-form answers. In order to tackle these challenges, we introduce a Cross-Mamba Interaction based Multi-Task Learning (CMI-MTL) framework that learns cross-modal feature representations from images and texts. CMI-MTL comprises three key modules: fine-grained visual-text feature alignment (FVTA), cross-modal interleaved feature representation (CIFR), and free-form answer-enhanced multi-task learning…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Topic Modeling · Domain Adaptation and Few-Shot Learning