Exploring the Zero-Shot Capabilities of Vision-Language Models for   Improving Gaze Following

Anshul Gupta; Pierre Vuillecard; Arya Farkhondeh; Jean-Marc Odobez

arXiv:2406.03907·cs.CV·June 7, 2024·2 cites

Exploring the Zero-Shot Capabilities of Vision-Language Models for Improving Gaze Following

Anshul Gupta, Pierre Vuillecard, Arya Farkhondeh, Jean-Marc Odobez

PDF

Open Access

TL;DR

This paper investigates the zero-shot capabilities of Vision-Language Models for extracting contextual cues to enhance gaze following, demonstrating that leveraging these models improves performance and generalization in gaze prediction tasks.

Contribution

The study evaluates various VLMs and prompting strategies for zero-shot cue recognition and integrates these cues into a state-of-the-art gaze following model, showing improved results.

Findings

01

BLIP-2 is the top performing VLM for cue extraction

02

In-context learning improves gaze following performance

03

Ensembling prompts enhances robustness

Abstract

Contextual cues related to a person's pose and interactions with objects and other people in the scene can provide valuable information for gaze following. While existing methods have focused on dedicated cue extraction methods, in this work we investigate the zero-shot capabilities of Vision-Language Models (VLMs) for extracting a wide array of contextual cues to improve gaze following performance. We first evaluate various VLMs, prompting strategies, and in-context learning (ICL) techniques for zero-shot cue recognition performance. We then use these insights to extract contextual cues for gaze following, and investigate their impact when incorporated into a state of the art model for the task. Our analysis indicates that BLIP-2 is the overall top performing VLM and that ICL can improve performance. We also observe that VLMs are sensitive to the choice of the text prompt although…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGaze Tracking and Assistive Technology · Visual Attention and Saliency Detection · Spatial Cognition and Navigation

MethodsSparse Evolutionary Training