Recurrent CNN for 3D Gaze Estimation using Appearance and Shape Cues
Cristina Palmero, Javier Selva, Mohammad Ali Bagheri, Sergio Escalera

TL;DR
This paper introduces a multi-modal recurrent CNN approach for 3D gaze estimation that integrates appearance and shape cues, significantly improving accuracy over previous methods.
Contribution
It presents a novel multi-modal recurrent CNN architecture that combines face, eyes, and landmarks for person- and head pose-independent 3D gaze estimation.
Findings
Achieved 14.6% improvement over state-of-the-art on EYEDIAP dataset.
Further improved accuracy by 4% using temporal information.
Effective across diverse head poses and gaze directions.
Abstract
Gaze behavior is an important non-verbal cue in social signal processing and human-computer interaction. In this paper, we tackle the problem of person- and head pose-independent 3D gaze estimation from remote cameras, using a multi-modal recurrent convolutional neural network (CNN). We propose to combine face, eyes region, and face landmarks as individual streams in a CNN to estimate gaze in still images. Then, we exploit the dynamic nature of gaze by feeding the learned features of all the frames in a sequence to a many-to-one recurrent module that predicts the 3D gaze vector of the last frame. Our multi-modal static solution is evaluated on a wide range of head poses and gaze directions, achieving a significant improvement of 14.6% over the state of the art on EYEDIAP dataset, further improved by 4% when the temporal modality is included.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGaze Tracking and Assistive Technology · Hand Gesture Recognition Systems · Advanced Computing and Algorithms
