Foveated Reasoning: Stateful, Action-based Visual Focusing for Vision-Language Models
Juhong Min, Lazar Valkov, Vitali Petsiuk, Hossein Souri, Deen Dayal Mohan

TL;DR
This paper introduces Foveated Reasoner, a vision-language model that mimics human foveation by selectively focusing high-resolution analysis on important regions, improving accuracy with fewer visual tokens.
Contribution
It unifies foveation and reasoning in a single autoregressive framework, trained with reinforcement learning to optimize selective high-resolution evidence acquisition.
Findings
Achieves stronger accuracy under limited visual tokens across benchmarks.
Learns effective foveation policies that focus on important regions.
Reduces compute overhead by selectively acquiring high-resolution evidence.
Abstract
Vision-language models benefit from high-resolution images, but the increase in visual-token count incurs high compute overhead. Humans resolve this tension via foveation: a coarse view guides "where to look", while selectively acquired high-acuity evidence refines "what to think". We introduce Foveated Reasoner, an autoregressive vision-language framework that unifies foveation and reasoning within a single decoding trajectory. Starting from a low-resolution view, the model triggers foveation only when needed, retrieves high-resolution evidence from selected regions, and injects it back into the same decoding trajectory. We train the method with a two-stage pipeline: coldstart supervision to bootstrap foveation behavior, followed by reinforcement learning to jointly improve evidence acquisition and task accuracy while discouraging trivial "see-everything" solutions. Experiments show…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
