Vision-Language Models Mistake Head Orientation for Gaze Direction: Nonverbal Conversation Cues
Zory Zhang, Pinyuan Feng, Bingyang Wang, Tianwei Zhao, Suyang Yu, Qingying Gao, Hokin Deng, Ziqiao Ma, Yijiang Li, and Dezhi Luo

TL;DR
This paper investigates how Vision-Language Models interpret gaze direction, revealing they rely on head orientation cues rather than eye appearance, which differs from human perception and affects nonverbal communication understanding.
Contribution
It demonstrates that VLMs predominantly use head orientation rather than eye cues for gaze inference, highlighting a bias likely stemming from training data.
Findings
VLMs perform worse than humans in gaze inference tasks.
VLMs rely on head orientation cues instead of eye appearance.
Finetuning a transformer model suggests data bias influences this behavior.
Abstract
Where someone looks is a nonverbal communication cue that children and adults readily use. How well can Vision-Language Models (VLMs) infer gaze targets? To construct evaluation stimuli, we captured 1,360 real-world photos of scenes in which a person gazes at one of several objects on a table. Importantly, we also controlled the gazer's head orientation: sometimes it was directed toward the gaze target, sometimes toward a distractor object, and sometimes left unconstrained. We found a substantial performance gap between VLMs and humans, ruled out alternative explanations such as resolution and object-naming skills, and identified the main reason for the gap as VLMs inferring gaze direction using head orientation rather than eye appearance. Such a bias is likely due to data rather than architecture, as suggested by a proof-of-concept experiment finetuning a transformer-based vision…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
