From Attenuation to Attention: Variational Information Flow Manipulation for Fine-Grained Visual Perception

Jilong Zhu; Yang Feng

arXiv:2604.12508·cs.CV·April 15, 2026

From Attenuation to Attention: Variational Information Flow Manipulation for Fine-Grained Visual Perception

Jilong Zhu, Yang Feng

PDF

TL;DR

This paper introduces the Variational Information Flow framework to improve fine-grained visual perception in multimodal large language models by addressing visual attenuation issues.

Contribution

It proposes a probabilistic VIF method using CVAE to model visual saliency, enhancing fine-grained perception in existing MLLMs.

Findings

01

VIF improves performance on diverse visual perception benchmarks.

02

VIF effectively models visual saliency relevant to question-answer pairs.

03

Extensive evaluations show VIF's competitive advantage over previous methods.

Abstract

While Multimodal Large Language Models (MLLMs) have demonstrated impressive capabilities in general visual understanding, they frequently falter in fine-grained perception tasks that require identifying tiny objects or discerning subtle visual relationships. We attribute this limitation to Visual Attenuation: a phenomenon where sparse fine-grained visual signals are prematurely suppressed or diluted by dominant textual tokens during network propagation, resulting in a "loss of focus" during the deep-level decision-making process. Existing input-centric solutions fail to fundamentally reverse this intrinsic mechanism of information loss. To address this challenge, we propose the Variational Information Flow (VIF) framework. Adopting a probabilistic perspective, VIF leverages a Conditional Variational Autoencoder (CVAE) to model the visual saliency relevant to the question-answer pair as…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.