CLIP-Gaze: Towards General Gaze Estimation via Visual-Linguistic Model

Pengwei Yin; Guanzhong Zeng; Jingjing Wang; Di Xie

arXiv:2403.05124·cs.CV·March 11, 2024·1 cites

CLIP-Gaze: Towards General Gaze Estimation via Visual-Linguistic Model

Pengwei Yin, Guanzhong Zeng, Jingjing Wang, Di Xie

PDF

Open Access 1 Video

TL;DR

CLIP-Gaze introduces a novel vision-language framework for gaze estimation that leverages pre-trained models and prompt tuning to enhance cross-domain generalization, outperforming existing methods across multiple datasets.

Contribution

This work is the first to apply a vision-and-language cross-modality approach to gaze estimation, utilizing prompt optimization and sample relationships for improved domain generalization.

Findings

01

Outperforms existing methods on four cross-domain benchmarks

02

Utilizes a pre-trained vision-language model for gaze feature extraction

03

Employs prompt tuning and sample relationship modeling to enhance generalization

Abstract

Gaze estimation methods often experience significant performance degradation when evaluated across different domains, due to the domain gap between the testing and training data. Existing methods try to address this issue using various domain generalization approaches, but with little success because of the limited diversity of gaze datasets, such as appearance, wearable, and image quality. To overcome these limitations, we propose a novel framework called CLIP-Gaze that utilizes a pre-trained vision-language model to leverage its transferable knowledge. Our framework is the first to leverage the vision-and-language cross-modality approach for gaze estimation task. Specifically, we extract gaze-relevant feature by pushing it away from gaze-irrelevant features which can be flexibly constructed via language descriptions. To learn more suitable prompts, we propose a personalized context…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

CLIP-Gaze: Towards General Gaze Estimation via Visual-Linguistic Model· underline

Taxonomy

TopicsGaze Tracking and Assistive Technology · Hand Gesture Recognition Systems · Gait Recognition and Analysis