ARGaze: Autoregressive Transformers for Online Egocentric Gaze Estimation

Jia Li; Wenjie Zhao; Shijian Deng; Bolin Lai; Yuheng Wu; RUijia Chen; Jon E. Froehlich; Yuhang Zhao; Yapeng Tian

arXiv:2602.05132·cs.CV·February 6, 2026

ARGaze: Autoregressive Transformers for Online Egocentric Gaze Estimation

Jia Li, Wenjie Zhao, Shijian Deng, Bolin Lai, Yuheng Wu, RUijia Chen, Jon E. Froehlich, Yuhang Zhao, Yapeng Tian

PDF

Open Access

TL;DR

ARGaze introduces an autoregressive transformer model for online egocentric gaze estimation, leveraging temporal continuity and recent gaze history to improve prediction accuracy in first-person videos.

Contribution

The paper presents a novel autoregressive transformer approach for online egocentric gaze estimation, emphasizing sequential prediction with bounded gaze history for improved robustness.

Findings

01

Achieves state-of-the-art performance on egocentric benchmarks.

02

Autoregressive modeling with recent gaze history is crucial for robustness.

03

Enables bounded-resource streaming inference for real-time applications.

Abstract

Online egocentric gaze estimation predicts where a camera wearer is looking from first-person video using only past and current frames, a task essential for augmented reality and assistive technologies. Unlike third-person gaze estimation, this setting lacks explicit head or eye signals, requiring models to infer current visual attention from sparse, indirect cues such as hand-object interactions and salient scene content. We observe that gaze exhibits strong temporal continuity during goal-directed activities: knowing where a person looked recently provides a powerful prior for predicting where they look next. Inspired by vision-conditioned autoregressive decoding in vision-language models, we propose ARGaze, which reformulates gaze estimation as sequential prediction: at each timestep, a transformer decoder predicts current gaze by conditioning on (i) current visual features and (ii)…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGaze Tracking and Assistive Technology · Visual Attention and Saliency Detection · Mind wandering and attention