ARGaze: Autoregressive Transformers for Online Egocentric Gaze Estimation
Jia Li, Wenjie Zhao, Shijian Deng, Bolin Lai, Yuheng Wu, RUijia Chen, Jon E. Froehlich, Yuhang Zhao, Yapeng Tian

TL;DR
ARGaze introduces an autoregressive transformer model for online egocentric gaze estimation, leveraging temporal continuity and recent gaze history to improve prediction accuracy in first-person videos.
Contribution
The paper presents a novel autoregressive transformer approach for online egocentric gaze estimation, emphasizing sequential prediction with bounded gaze history for improved robustness.
Findings
Achieves state-of-the-art performance on egocentric benchmarks.
Autoregressive modeling with recent gaze history is crucial for robustness.
Enables bounded-resource streaming inference for real-time applications.
Abstract
Online egocentric gaze estimation predicts where a camera wearer is looking from first-person video using only past and current frames, a task essential for augmented reality and assistive technologies. Unlike third-person gaze estimation, this setting lacks explicit head or eye signals, requiring models to infer current visual attention from sparse, indirect cues such as hand-object interactions and salient scene content. We observe that gaze exhibits strong temporal continuity during goal-directed activities: knowing where a person looked recently provides a powerful prior for predicting where they look next. Inspired by vision-conditioned autoregressive decoding in vision-language models, we propose ARGaze, which reformulates gaze estimation as sequential prediction: at each timestep, a transformer decoder predicts current gaze by conditioning on (i) current visual features and (ii)…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGaze Tracking and Assistive Technology · Visual Attention and Saliency Detection · Mind wandering and attention
