Foveated Reasoning: Stateful, Action-based Visual Focusing for Vision-Language Models

Juhong Min; Lazar Valkov; Vitali Petsiuk; Hossein Souri; Deen Dayal Mohan

arXiv:2604.21079·cs.CV·April 24, 2026

Foveated Reasoning: Stateful, Action-based Visual Focusing for Vision-Language Models

Juhong Min, Lazar Valkov, Vitali Petsiuk, Hossein Souri, Deen Dayal Mohan

PDF

TL;DR

This paper introduces Foveated Reasoner, a vision-language model that mimics human foveation by selectively focusing high-resolution analysis on important regions, improving accuracy with fewer visual tokens.

Contribution

It unifies foveation and reasoning in a single autoregressive framework, trained with reinforcement learning to optimize selective high-resolution evidence acquisition.

Findings

01

Achieves stronger accuracy under limited visual tokens across benchmarks.

02

Learns effective foveation policies that focus on important regions.

03

Reduces compute overhead by selectively acquiring high-resolution evidence.

Abstract

Vision-language models benefit from high-resolution images, but the increase in visual-token count incurs high compute overhead. Humans resolve this tension via foveation: a coarse view guides "where to look", while selectively acquired high-acuity evidence refines "what to think". We introduce Foveated Reasoner, an autoregressive vision-language framework that unifies foveation and reasoning within a single decoding trajectory. Starting from a low-resolution view, the model triggers foveation only when needed, retrieves high-resolution evidence from selected regions, and injects it back into the same decoding trajectory. We train the method with a two-stage pipeline: coldstart supervision to bootstrap foveation behavior, followed by reinforcement learning to jointly improve evidence acquisition and task accuracy while discouraging trivial "see-everything" solutions. Experiments show…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.