Starve to Perceive: Taming Lazy Perception in VLMs with Constrained Visual Bandwidth
Yuhuan Wu, Cong Wei, Fangzhen Lin, Wenhu Chen, Haozhe Wang

TL;DR
This paper introduces a training paradigm called Starve to Perceive that constrains visual bandwidth in vision-language models, encouraging active perception and improving performance across benchmarks.
Contribution
It proposes a simple, plug-in method to induce active perception in VLMs by limiting visual information, leading to significant performance gains.
Findings
Models trained with perceptual starvation improve by 5% on average across benchmarks.
The approach does not require auxiliary losses, reward shaping, or architectural changes.
Active perception becomes the primary strategy for task success under constrained visual bandwidth.
Abstract
Vision-Language Models (VLMs) deployed as situated agents in high-resolution visual environments require active perception -- the ability to dynamically decide where to look through operations like zooming, cropping, and panning. However, current training paradigms produce models that mimic the surface form of such operations without functionally depending on their outputs, a phenomenon we term lazy perception. We trace this to a fundamental learning asymmetry: when coarse global views combined with language priors suffice for moderate accuracy, the model has no incentive to learn harder multi-step visual search. If a model can succeed without actively looking, it will never learn to look. This motivates Starve to Perceive, a training paradigm that constrains visual bandwidth -- restricting each observation to a tight token budget so that no single view suffices for task completion,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
