RoiMAM: Region-of-Interest Medical Attention Model for Efficient Vision-Language Understanding
Jiayan Yang, Zhuoyu Wu, Wenqi Fang

TL;DR
RoiMAM is an efficient vision-language model for medical visual question answering that focuses on lesion regions and improves accuracy with significantly reduced model size.
Contribution
It introduces a training-free ROI generation and semantic suppression, along with a text prompt enhancer, achieving better efficiency and accuracy over existing models.
Findings
Achieves 2% higher accuracy on SLAKE dataset.
Reduces model size to less than 20% of MedVInT-TD.
Improves accuracy by 4.6% on PMC-VQA.
Abstract
Vision-Language Models (VLMs) facilitate medical visual question answering (MedVQA) by jointly interpreting images and text. However, existing models typically depend on large architectures and closed-set answers, which limits their efficiency and potential clinical applicability. To overcome these shortcomings, we introduce RoiMAM, an efficient VLM. It integrates a training-free ROI Generation Module with Semantic Selective Suppression to focus on lesion-relevant regions, alongside a Text Prompt Enhancer module that provides modality-specific context without introducing training parameters. Compared to the widely used MedVInT-TD model, our design achieves efficient and accurate diagnosis at less than 20\% of the model size, while improving accuracy by approximately 2% on SLAKE and 4.6% on PMC-VQA.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
