Learning to See What You Need: Gaze Attention for Multimodal Large Language Models
Junha Song, Byeongho Heo, Geonmo Gu, Jaegul Choo, Dongyoon Han, Sangdoo Yun

TL;DR
This paper introduces Gaze Attention, a mechanism for multimodal large language models that selectively attends to relevant visual regions, reducing computation and improving focus without losing global context.
Contribution
It proposes a novel gaze-based attention method that dynamically focuses on task-relevant visual regions, significantly reducing computational overhead while maintaining or improving performance.
Findings
Gaze Attention matches or surpasses dense-attention baselines.
Reduces attention computation by up to 90%.
Maintains holistic visual awareness with learnable context tokens.
Abstract
When humans describe a visual scene, they do not process the entire image uniformly; instead, they selectively fixate on regions relevant to their intended description. In contrast, current multimodal large language models (MLLMs) attend to all visual tokens at each generation step, leading to diluted focus and unnecessary computational overhead. In this work, we introduce Gaze Attention, a novel mechanism that enables MLLMs to selectively attend to task-relevant visual regions during generation. Specifically, we spatially group visual embeddings-stored as key-value caches-into compact gaze regions, each represented by a lightweight descriptor. At each decoding step, the model dynamically selects the most relevant regions and restricts attention to them, reducing redundant computation while enhancing focus. To mitigate the loss of global context caused by localized attention, we further…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
