Less Is More? Selective Visual Attention to High-Importance Regions for Multimodal Radiology Summarization

Mst. Fahmida Sultana Naznin; Adnan Ibney Faruq; Mushfiqur Rahman; Niloy Kumar Mondal; Md. Mehedi Hasan Shawon; Md Rakibul Hasan

arXiv:2603.29901·cs.CV·April 1, 2026

Less Is More? Selective Visual Attention to High-Importance Regions for Multimodal Radiology Summarization

Mst. Fahmida Sultana Naznin, Adnan Ibney Faruq, Mushfiqur Rahman, Niloy Kumar Mondal, Md. Mehedi Hasan Shawon, Md Rakibul Hasan

PDF

TL;DR

This paper shows that selectively focusing on high-importance visual regions improves multimodal radiology report summarization, challenging the belief that more visual data is always better.

Contribution

The authors introduce ViTAS, a novel multi-stage pipeline that effectively fuses visual and textual data, achieving state-of-the-art results in radiology summarization.

Findings

01

ViTAS achieves 29.25% BLEU-4 and 69.83% ROUGE-L scores.

02

Selective visual attention improves factual alignment and human evaluation scores.

03

Focusing on relevant visual patches outperforms using full images.

Abstract

Automated radiology report summarization aims to distill verbose findings into concise clinical impressions, but existing multimodal models often struggle with visual noise and fail to meaningfully improve over strong text-only baselines in the FINDINGS $\to$ IMPRESSION transformation. We challenge two prevailing assumptions: (1) that more visual input is always better, and (2) that multimodal models add limited value when findings already contain rich image-derived detail. Through controlled ablations on MIMIC-CXR benchmark, we show that selectively focusing on pathology-relevant visual patches rather than full images yields substantially better performance. We introduce ViTAS, Visual-Text Attention Summarizer, a multi-stage pipeline that combines ensemble-guided MedSAM2 lung segmentation, bidirectional cross-attention for multi-view fusion, Shapley-guided adaptive patch clustering,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.