GazeVLM: Active Vision via Internal Attention Control for Multimodal Reasoning

Brown Ebouky; Gabriele Carrino; Niccolo Avogaro; Christoph Studer; Andrea Bartezzaghi; Mattia Rigotti

arXiv:2605.07817·cs.CV·May 11, 2026

GazeVLM: Active Vision via Internal Attention Control for Multimodal Reasoning

Brown Ebouky, Gabriele Carrino, Niccolo Avogaro, Christoph Studer, Andrea Bartezzaghi, Mattia Rigotti

PDF

TL;DR

GazeVLM introduces a novel multimodal architecture that internally controls attention to improve spatial reasoning and reduce hallucinations, achieving state-of-the-art performance in high-resolution multimodal tasks.

Contribution

It proposes GazeVLM, a model that internalizes active vision principles with autonomous gaze control, enhancing reasoning capabilities without external cropping or expanded context.

Findings

01

GazeVLM surpasses state-of-the-art VLMs by nearly 4% in its class.

02

It outperforms agentic multimodal pipelines by over 5% on HRBench datasets.

03

The model demonstrates improved spatial reasoning and reduced hallucinations.

Abstract

Human visual reasoning is governed by active vision, a process where metacognitive control drives top-down goal-directed attention, dynamically routing foveal focus toward task-relevant details while maintaining peripheral awareness of the global scene. In contrast, modern Vision-Language Models (VLMs) process visual information passively, relying on the static accumulation of massive token contexts that dilute spatial reasoning and induce linguistic hallucinations. Here we propose the following paradigm shift: GazeVLM, a multimodal architecture that internalizes this metacognitive oversight over its deployment of attention resources directly into the reasoning loop. By empowering the VLM to autonomously generate gaze tokens ( $<LOOK>$ ), GazeVLM establishes a top-down control mechanism over its own causal attention mask. The model dynamically dictates its focal intent,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.