Vision Inference Former: Sustaining Visual Consistency in Multimodal Large Language Models

Xinpeng Dong; Min Zhang; Kairong Han; Xu Tan; Fei Wu; Kun Kuang

arXiv:2605.18160·cs.CV·May 19, 2026

Vision Inference Former: Sustaining Visual Consistency in Multimodal Large Language Models

Xinpeng Dong, Min Zhang, Kairong Han, Xu Tan, Fei Wu, Kun Kuang

PDF

1 Repo

TL;DR

The paper introduces Vision Inference Former (VIF), a lightweight module that enhances visual consistency in multimodal large language models by directly injecting visual semantics during decoding.

Contribution

VIF provides a novel architectural approach that maintains visual grounding throughout generation, improving alignment and performance across multiple tasks.

Findings

01

VIF improves performance on 14 benchmark tasks.

02

VIF maintains visual grounding during generation.

03

VIF introduces minimal additional computational overhead.

Abstract

In recent years, multimodal large language models (MLLMs) have achieved remarkable progress, primarily attributed to effective paradigms for integrating visual and textual information. The dominant connector-based paradigm projects visual features into textual sequence, enabling unified multimodal alignment and reasoning within a generative architecture. However, our experiments reveal two key limitations: (1) Although visual information serves as the core evidential modality in MLLMs, it is treated on par with textual tokens, diminishing the unique contribution of the visual modality; (2) As generation length increases, particularly within a limited context window, the model's dependence on visual information progressively weakens, resulting in deteriorated vision-language alignment and reduced consistency between generated content and visual semantics. To address these challenges, we…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Dong-Xinpeng/VIF
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.