CLIP-Gaze: Towards General Gaze Estimation via Visual-Linguistic Model
Pengwei Yin, Guanzhong Zeng, Jingjing Wang, Di Xie

TL;DR
CLIP-Gaze introduces a novel vision-language framework for gaze estimation that leverages pre-trained models and prompt tuning to enhance cross-domain generalization, outperforming existing methods across multiple datasets.
Contribution
This work is the first to apply a vision-and-language cross-modality approach to gaze estimation, utilizing prompt optimization and sample relationships for improved domain generalization.
Findings
Outperforms existing methods on four cross-domain benchmarks
Utilizes a pre-trained vision-language model for gaze feature extraction
Employs prompt tuning and sample relationship modeling to enhance generalization
Abstract
Gaze estimation methods often experience significant performance degradation when evaluated across different domains, due to the domain gap between the testing and training data. Existing methods try to address this issue using various domain generalization approaches, but with little success because of the limited diversity of gaze datasets, such as appearance, wearable, and image quality. To overcome these limitations, we propose a novel framework called CLIP-Gaze that utilizes a pre-trained vision-language model to leverage its transferable knowledge. Our framework is the first to leverage the vision-and-language cross-modality approach for gaze estimation task. Specifically, we extract gaze-relevant feature by pushing it away from gaze-irrelevant features which can be flexibly constructed via language descriptions. To learn more suitable prompts, we propose a personalized context…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsGaze Tracking and Assistive Technology · Hand Gesture Recognition Systems · Gait Recognition and Analysis
