Eyes on VLM: Benchmarking Gaze Following and Social Gaze Prediction in Vision Language Models

Hengfei Wang; Anshul Gupta; Pierre Vuillecard; Jean-Marc Odobez

arXiv:2605.19859·cs.CV·May 20, 2026

Eyes on VLM: Benchmarking Gaze Following and Social Gaze Prediction in Vision Language Models

Hengfei Wang, Anshul Gupta, Pierre Vuillecard, Jean-Marc Odobez

PDF

TL;DR

This paper introduces EyeVLM, a comprehensive evaluation framework for assessing the gaze understanding capabilities of vision-language models across tasks and models, revealing current limitations.

Contribution

The work systematically benchmarks VLMs on gaze following and social gaze prediction, highlighting their deficiencies and exploring zero-shot and fine-tuning approaches.

Findings

01

Current VLMs lack precise gaze understanding.

02

Standard training reduces the gap with visual models.

03

Significant improvements are still needed.

Abstract

Vision-language models (VLMs) have rapidly evolved into general-purpose multimodal reasoners with strong zero-shot generalization. In this context, VLMs could greatly benefit the analysis of human gaze and attention, a central task in human behavior understanding that requires reasoning about the physical scene as well as the activity, interactions, and social context. However, the extent to which VLMs can reliably understand human gaze and related attentional behaviors remains largely unexplored. In this work, we present EyeVLM, a systematic evaluation framework for gaze understanding in VLMs across two complementary dimensions: tasks and models. To assess gaze understanding capabilities, we focus on two core tasks. The first, gaze following, i.e., predicting the 2D location where a person is looking, has a geometric and visual processing focus, requiring a precise understanding of the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.