The Perceptual Bandwidth Bottleneck in Vision-Language Models: Active Visual Reasoning via Sequential Experimental Design

Anjie Liu; Ziqin Gong; Yan Song; Yuxiang Chen; Xiaolong Liu; Hengtong Lu; Kaike Zhang; Chen Wei; Jun Wang

arXiv:2605.01345·cs.CV·May 12, 2026

The Perceptual Bandwidth Bottleneck in Vision-Language Models: Active Visual Reasoning via Sequential Experimental Design

Anjie Liu, Ziqin Gong, Yan Song, Yuxiang Chen, Xiaolong Liu, Hengtong Lu, Kaike Zhang, Chen Wei, Jun Wang

PDF

TL;DR

This paper introduces a sequential experimental design framework for vision-language models, enabling active visual reasoning by selectively acquiring task-relevant evidence in high-resolution settings.

Contribution

It formalizes active evidence acquisition as Bayesian optimal experimental design and proposes FOVEA, a training-free method for improved high-resolution visual reasoning.

Findings

01

FOVEA improves reasoning performance on high-resolution benchmarks.

02

The approach yields strong gains in remote-sensing search tasks.

03

It outperforms direct and ReAct-style baselines.

Abstract

Visual perception in modern Vision-Language Models (VLMs) is constrained by a perceptual bandwidth bottleneck: a broad field of view preserves global context but sacrifices the fine-grained details required for complex reasoning. We argue that high-resolution visual reasoning is therefore not only semantic reasoning but also task-relevant evidence acquisition under limited perceptual bandwidth. Inspired by active vision and information foraging, we formalise this process as sequential Bayesian optimal experimental design (S-BOED), where an agent decides which visual evidence to acquire before answering. Since exact Bayesian inference is intractable in continuous gigapixel spaces, we derive a tractable coverage--resolution objective as a proxy for task-relevant information gain. We instantiate this framework with FOVEA, a training-free procedure that refines VLM crop proposals through…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.