RoiMAM: Region-of-Interest Medical Attention Model for Efficient Vision-Language Understanding

Jiayan Yang; Zhuoyu Wu; Wenqi Fang

arXiv:2605.15561·cs.CV·May 18, 2026

RoiMAM: Region-of-Interest Medical Attention Model for Efficient Vision-Language Understanding

Jiayan Yang, Zhuoyu Wu, Wenqi Fang

PDF

TL;DR

RoiMAM is an efficient vision-language model for medical visual question answering that focuses on lesion regions and improves accuracy with significantly reduced model size.

Contribution

It introduces a training-free ROI generation and semantic suppression, along with a text prompt enhancer, achieving better efficiency and accuracy over existing models.

Findings

01

Achieves 2% higher accuracy on SLAKE dataset.

02

Reduces model size to less than 20% of MedVInT-TD.

03

Improves accuracy by 4.6% on PMC-VQA.

Abstract

Vision-Language Models (VLMs) facilitate medical visual question answering (MedVQA) by jointly interpreting images and text. However, existing models typically depend on large architectures and closed-set answers, which limits their efficiency and potential clinical applicability. To overcome these shortcomings, we introduce RoiMAM, an efficient VLM. It integrates a training-free ROI Generation Module with Semantic Selective Suppression to focus on lesion-relevant regions, alongside a Text Prompt Enhancer module that provides modality-specific context without introducing training parameters. Compared to the widely used MedVInT-TD model, our design achieves efficient and accurate diagnosis at less than 20\% of the model size, while improving accuracy by approximately 2% on SLAKE and 4.6% on PMC-VQA.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.