Learning Spatio-Temporal Feature Representations for Video-Based Gaze Estimation

Alexandre Personnic; Mihai B\^ace

arXiv:2512.17673·cs.CV·December 22, 2025

Learning Spatio-Temporal Feature Representations for Video-Based Gaze Estimation

Alexandre Personnic, Mihai B\^ace

PDF

Open Access

TL;DR

This paper introduces ST-Gaze, a novel spatio-temporal model that effectively captures intra-frame spatial and inter-frame temporal dynamics for improved video-based gaze estimation, achieving state-of-the-art results.

Contribution

The paper presents a new spatio-temporal gaze estimation model combining CNN, attention modules, and sequence modeling, outperforming existing methods on the EVE dataset.

Findings

01

ST-Gaze achieves state-of-the-art performance on EVE dataset.

02

Preserving intra-frame spatial context improves gaze estimation accuracy.

03

Model effectively captures both spatial and temporal gaze dynamics.

Abstract

Video-based gaze estimation methods aim to capture the inherently temporal dynamics of human eye gaze from multiple image frames. However, since models must capture both spatial and temporal relationships, performance is limited by the feature representations within a frame but also between multiple frames. We propose the Spatio-Temporal Gaze Network (ST-Gaze), a model that combines a CNN backbone with dedicated channel attention and self-attention modules to fuse eye and face features optimally. The fused features are then treated as a spatial sequence, allowing for the capture of an intra-frame context, which is then propagated through time to model inter-frame dynamics. We evaluated our method on the EVE dataset and show that ST-Gaze achieves state-of-the-art performance both with and without person-specific adaptation. Additionally, our ablation study provides further insights into…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGaze Tracking and Assistive Technology · Visual Attention and Saliency Detection · Vestibular and auditory disorders