Visual Attention Never Fades: Selective Progressive Attention ReCalibration for Detailed Image Captioning in Multimodal Large Language Models

Mingi Jung; Saehyung Lee; Eunji Kim; and Sungroh Yoon

arXiv:2502.01419·cs.CV·June 5, 2025

Visual Attention Never Fades: Selective Progressive Attention ReCalibration for Detailed Image Captioning in Multimodal Large Language Models

Mingi Jung, Saehyung Lee, Eunji Kim, and Sungroh Yoon

PDF

Open Access

TL;DR

This paper introduces SPARC, a training-free method that improves detailed image captioning in multimodal large language models by selectively recalibrating visual attention, leading to better balance between precision and recall.

Contribution

SPARC is a novel, training-free approach that enhances visual token influence during decoding, addressing attention weakening and noise in caption generation.

Findings

01

SPARC improves both precision and recall in image captioning.

02

It outperforms existing methods with minimal computational overhead.

03

Human evaluations confirm the quality improvements.

Abstract

Detailed image captioning is essential for tasks like data generation and aiding visually impaired individuals. High-quality captions require a balance between precision and recall, which remains challenging for current multimodal large language models (MLLMs). In this work, we hypothesize that this limitation stems from weakening and increasingly noisy visual attention as responses lengthen. To address this issue, we propose SPARC (Selective Progressive Attention ReCalibration), a training-free method that enhances the contribution of visual tokens during decoding. SPARC is founded on three key observations: (1) increasing the influence of all visual tokens reduces recall; thus, SPARC selectively amplifies visual tokens; (2) as captions lengthen, visual attention becomes noisier, so SPARC identifies critical visual tokens by leveraging attention differences across time steps; (3) as…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Topic Modeling