VL4Gaze: Unleashing Vision-Language Models for Gaze Following

Shijing Wang; Chaoqun Cui; Yaping Huang; Hyung Jin Chang; Yihua Cheng

arXiv:2512.20735·cs.CV·December 25, 2025

VL4Gaze: Unleashing Vision-Language Models for Gaze Following

Shijing Wang, Chaoqun Cui, Yaping Huang, Hyung Jin Chang, Yihua Cheng

PDF

Open Access

TL;DR

VL4Gaze introduces a large-scale benchmark to evaluate and enhance vision-language models' ability to understand gaze, revealing current limitations and the benefits of targeted training for gaze-related tasks.

Contribution

This paper presents VL4Gaze, the first comprehensive benchmark for gaze understanding in vision-language models, and demonstrates the importance of task-specific supervision for improving gaze interpretation.

Findings

01

Large-scale VLMs struggle with gaze semantics without supervision.

02

Training on VL4Gaze significantly improves gaze understanding.

03

Targeted multi-task training enhances VLMs' gaze-related capabilities.

Abstract

Human gaze provides essential cues for interpreting attention, intention, and social interaction in visual scenes, yet gaze understanding remains largely unexplored in current vision-language models (VLMs). While recent VLMs achieve strong scene-level reasoning across a range of visual tasks, there exists no benchmark that systematically evaluates or trains them for gaze interpretation, leaving open the question of whether gaze understanding can emerge from general-purpose vision-language pre-training. To address this gap, we introduce VL4Gaze, the first large-scale benchmark designed to investigate, evaluate, and unlock the potential of VLMs for gaze understanding. VL4Gaze contains 489K automatically generated question-answer pairs across 124K images and formulates gaze understanding as a unified VQA problem through four complementary tasks: (1) gaze object description, (2) gaze…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Gaze Tracking and Assistive Technology · Visual Attention and Saliency Detection