Eyes on VLM: Benchmarking Gaze Following and Social Gaze Prediction in Vision Language Models
Hengfei Wang, Anshul Gupta, Pierre Vuillecard, Jean-Marc Odobez

TL;DR
This paper introduces EyeVLM, a comprehensive evaluation framework for assessing the gaze understanding capabilities of vision-language models across tasks and models, revealing current limitations.
Contribution
The work systematically benchmarks VLMs on gaze following and social gaze prediction, highlighting their deficiencies and exploring zero-shot and fine-tuning approaches.
Findings
Current VLMs lack precise gaze understanding.
Standard training reduces the gap with visual models.
Significant improvements are still needed.
Abstract
Vision-language models (VLMs) have rapidly evolved into general-purpose multimodal reasoners with strong zero-shot generalization. In this context, VLMs could greatly benefit the analysis of human gaze and attention, a central task in human behavior understanding that requires reasoning about the physical scene as well as the activity, interactions, and social context. However, the extent to which VLMs can reliably understand human gaze and related attentional behaviors remains largely unexplored. In this work, we present EyeVLM, a systematic evaluation framework for gaze understanding in VLMs across two complementary dimensions: tasks and models. To assess gaze understanding capabilities, we focus on two core tasks. The first, gaze following, i.e., predicting the 2D location where a person is looking, has a geometric and visual processing focus, requiring a precise understanding of the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
