GazeVLM: Active Vision via Internal Attention Control for Multimodal Reasoning
Brown Ebouky, Gabriele Carrino, Niccolo Avogaro, Christoph Studer, Andrea Bartezzaghi, Mattia Rigotti

TL;DR
GazeVLM introduces a novel multimodal architecture that internally controls attention to improve spatial reasoning and reduce hallucinations, achieving state-of-the-art performance in high-resolution multimodal tasks.
Contribution
It proposes GazeVLM, a model that internalizes active vision principles with autonomous gaze control, enhancing reasoning capabilities without external cropping or expanded context.
Findings
GazeVLM surpasses state-of-the-art VLMs by nearly 4% in its class.
It outperforms agentic multimodal pipelines by over 5% on HRBench datasets.
The model demonstrates improved spatial reasoning and reduced hallucinations.
Abstract
Human visual reasoning is governed by active vision, a process where metacognitive control drives top-down goal-directed attention, dynamically routing foveal focus toward task-relevant details while maintaining peripheral awareness of the global scene. In contrast, modern Vision-Language Models (VLMs) process visual information passively, relying on the static accumulation of massive token contexts that dilute spatial reasoning and induce linguistic hallucinations. Here we propose the following paradigm shift: GazeVLM, a multimodal architecture that internalizes this metacognitive oversight over its deployment of attention resources directly into the reasoning loop. By empowering the VLM to autonomously generate gaze tokens (), GazeVLM establishes a top-down control mechanism over its own causal attention mask. The model dynamically dictates its focal intent,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
