Focusing Where Vision Matters: Selective Training for Large Vision Language Models via Visual Information Gain

Seulbi Lee; Sangheum Hwang

arXiv:2602.17186·cs.CV·May 22, 2026

Focusing Where Vision Matters: Selective Training for Large Vision Language Models via Visual Information Gain

Seulbi Lee, Sangheum Hwang

PDF

TL;DR

This paper introduces Visual Information Gain (VIG), a perplexity-based metric to quantify visual input's contribution, enabling selective training of large vision language models to enhance visual grounding and reduce language bias.

Contribution

The paper proposes a novel VIG metric and a VIG-guided selective training scheme that improves visual grounding with less supervision by focusing on visually informative data.

Findings

01

VIG effectively highlights visually grounded elements like colors and spatial relations.

02

Selective training based on VIG improves model performance and reduces language bias.

03

The approach achieves better results with significantly less supervision.

Abstract

Large Vision Language Models (LVLMs) have achieved remarkable progress, yet they often suffer from language bias, producing answers without relying on visual evidence. While prior work attempts to mitigate this issue through decoding strategies, architectural modifications, or curated instruction data, they typically lack a quantitative measure of how much individual training samples or tokens actually benefit from the image. In this work, we introduce Visual Information Gain (VIG), a perplexity-based metric that measures the reduction in prediction uncertainty provided by visual input. VIG enables fine-grained analysis at both sample and token levels, effectively highlighting visually grounded elements such as colors, spatial relations, and attributes. Leveraging this, we propose a VIG-guided selective training scheme that prioritizes high-VIG samples and tokens. This approach improves…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Natural Language Processing Techniques