From Attenuation to Attention: Variational Information Flow Manipulation for Fine-Grained Visual Perception
Jilong Zhu, Yang Feng

TL;DR
This paper introduces the Variational Information Flow framework to improve fine-grained visual perception in multimodal large language models by addressing visual attenuation issues.
Contribution
It proposes a probabilistic VIF method using CVAE to model visual saliency, enhancing fine-grained perception in existing MLLMs.
Findings
VIF improves performance on diverse visual perception benchmarks.
VIF effectively models visual saliency relevant to question-answer pairs.
Extensive evaluations show VIF's competitive advantage over previous methods.
Abstract
While Multimodal Large Language Models (MLLMs) have demonstrated impressive capabilities in general visual understanding, they frequently falter in fine-grained perception tasks that require identifying tiny objects or discerning subtle visual relationships. We attribute this limitation to Visual Attenuation: a phenomenon where sparse fine-grained visual signals are prematurely suppressed or diluted by dominant textual tokens during network propagation, resulting in a "loss of focus" during the deep-level decision-making process. Existing input-centric solutions fail to fundamentally reverse this intrinsic mechanism of information loss. To address this challenge, we propose the Variational Information Flow (VIF) framework. Adopting a probabilistic perspective, VIF leverages a Conditional Variational Autoencoder (CVAE) to model the visual saliency relevant to the question-answer pair as…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
