Aligning What Vision-Language Models See and Perceive with Adaptive Information Flow

Chengxin Liu; Wonseok Choi; Chenshuang Zhang; Tae-Hyun Oh

arXiv:2604.15809·cs.CV·April 20, 2026

Aligning What Vision-Language Models See and Perceive with Adaptive Information Flow

Chengxin Liu, Wonseok Choi, Chenshuang Zhang, Tae-Hyun Oh

PDF

1 Repo

TL;DR

This paper introduces a method to improve vision-language models by modulating information flow based on token importance, leading to better task performance across multiple datasets.

Contribution

The authors propose a token dynamics-based approach to selectively enhance relevant visual information during inference in VLMs.

Findings

01

Significant performance improvements on visual question answering and grounding tasks.

02

Effective identification of important visual tokens using activation patterns.

03

Enhanced perception accuracy without retraining the models.

Abstract

Vision-Language Models (VLMs) have demonstrated strong capability in a wide range of tasks such as visual recognition, document parsing, and visual grounding. Nevertheless, recent work shows that while VLMs often manage to capture the correct image region corresponding to the question, they do not necessarily produce the correct answers. In this work, we demonstrate that this misalignment could be attributed to suboptimal information flow within VLMs, where text tokens distribute too much attention to irrelevant visual tokens, leading to incorrect answers. Based on the observation, we show that modulating the information flow during inference can improve the perception capability of VLMs. The idea is that text tokens should only be associated with important visual tokens during decoding, eliminating the interference of irrelevant regions. To achieve this, we propose a token…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

https://cxliu0.github.io/AIF
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.