Infinite Gaze Generation for Videos with Autoregressive Diffusion
Jenna Kang, Colin Groth, Tong Wu, Finley Torrens, Patsorn Sangkloy, Gordon Wetzstein, Qi Sun

TL;DR
This paper introduces an autoregressive diffusion model for generating continuous, long-range human gaze trajectories in videos, surpassing existing short-term models in accuracy and realism.
Contribution
It presents a novel generative framework for infinite-horizon gaze prediction, capturing long-term dependencies and detailed temporal dynamics in videos.
Findings
Outperforms existing models in long-range accuracy
Produces more realistic gaze trajectories
Handles videos of arbitrary length
Abstract
Predicting human gaze in video is fundamental to advancing scene understanding and multimodal interaction. While traditional saliency maps provide spatial probability distributions and scanpaths offer ordered fixations, both abstractions often collapse the fine-grained temporal dynamics of raw gaze. Furthermore, existing models are typically constrained to short-term windows ( 3-5s), failing to capture the long-range behavioral dependencies inherent in real-world content. We present a generative framework for infinite-horizon raw gaze prediction in videos of arbitrary length. By leveraging an autoregressive diffusion model, we synthesize gaze trajectories characterized by continuous spatial coordinates and high-resolution timestamps. Our model is conditioned on a saliency-aware visual latent space. Quantitative and qualitative evaluations demonstrate that our approach…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsVisual Attention and Saliency Detection · Gaze Tracking and Assistive Technology · Multimodal Machine Learning Applications
