Exploring the Zero-Shot Capabilities of Vision-Language Models for Improving Gaze Following
Anshul Gupta, Pierre Vuillecard, Arya Farkhondeh, Jean-Marc Odobez

TL;DR
This paper investigates the zero-shot capabilities of Vision-Language Models for extracting contextual cues to enhance gaze following, demonstrating that leveraging these models improves performance and generalization in gaze prediction tasks.
Contribution
The study evaluates various VLMs and prompting strategies for zero-shot cue recognition and integrates these cues into a state-of-the-art gaze following model, showing improved results.
Findings
BLIP-2 is the top performing VLM for cue extraction
In-context learning improves gaze following performance
Ensembling prompts enhances robustness
Abstract
Contextual cues related to a person's pose and interactions with objects and other people in the scene can provide valuable information for gaze following. While existing methods have focused on dedicated cue extraction methods, in this work we investigate the zero-shot capabilities of Vision-Language Models (VLMs) for extracting a wide array of contextual cues to improve gaze following performance. We first evaluate various VLMs, prompting strategies, and in-context learning (ICL) techniques for zero-shot cue recognition performance. We then use these insights to extract contextual cues for gaze following, and investigate their impact when incorporated into a state of the art model for the task. Our analysis indicates that BLIP-2 is the overall top performing VLM and that ICL can improve performance. We also observe that VLMs are sensitive to the choice of the text prompt although…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGaze Tracking and Assistive Technology · Visual Attention and Saliency Detection · Spatial Cognition and Navigation
MethodsSparse Evolutionary Training
