Modeling Human Gaze Behavior with Diffusion Models for Unified Scanpath Prediction
Giuseppe Cartella, Vittorio Cuculo, Alessandro D'Amelio, Marcella Cornia, Giuseppe Boccignone, Rita Cucchiara

TL;DR
This paper introduces ScanDiff, a diffusion model-based approach combined with Vision Transformers to generate diverse, realistic, and task-adaptive human gaze scanpaths, outperforming existing methods in capturing visual exploration variability.
Contribution
The paper presents a novel diffusion model architecture for scanpath prediction that explicitly models variability and incorporates textual conditioning for task-specific gaze behavior.
Findings
Outperforms state-of-the-art in free-viewing and task-driven scenarios.
Produces more diverse and realistic scanpaths.
Effectively captures human visual exploration variability.
Abstract
Predicting human gaze scanpaths is crucial for understanding visual attention, with applications in human-computer interaction, autonomous systems, and cognitive robotics. While deep learning models have advanced scanpath prediction, most existing approaches generate averaged behaviors, failing to capture the variability of human visual exploration. In this work, we present ScanDiff, a novel architecture that combines diffusion models with Vision Transformers to generate diverse and realistic scanpaths. Our method explicitly models scanpath variability by leveraging the stochastic nature of diffusion models, producing a wide range of plausible gaze trajectories. Additionally, we introduce textual conditioning to enable task-driven scanpath generation, allowing the model to adapt to different visual search objectives. Experiments on benchmark datasets show that ScanDiff surpasses…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
