Modeling Human Gaze Behavior with Diffusion Models for Unified Scanpath Prediction

Giuseppe Cartella; Vittorio Cuculo; Alessandro D'Amelio; Marcella Cornia; Giuseppe Boccignone; Rita Cucchiara

arXiv:2507.23021·cs.CV·August 1, 2025

Modeling Human Gaze Behavior with Diffusion Models for Unified Scanpath Prediction

Giuseppe Cartella, Vittorio Cuculo, Alessandro D'Amelio, Marcella Cornia, Giuseppe Boccignone, Rita Cucchiara

PDF

TL;DR

This paper introduces ScanDiff, a diffusion model-based approach combined with Vision Transformers to generate diverse, realistic, and task-adaptive human gaze scanpaths, outperforming existing methods in capturing visual exploration variability.

Contribution

The paper presents a novel diffusion model architecture for scanpath prediction that explicitly models variability and incorporates textual conditioning for task-specific gaze behavior.

Findings

01

Outperforms state-of-the-art in free-viewing and task-driven scenarios.

02

Produces more diverse and realistic scanpaths.

03

Effectively captures human visual exploration variability.

Abstract

Predicting human gaze scanpaths is crucial for understanding visual attention, with applications in human-computer interaction, autonomous systems, and cognitive robotics. While deep learning models have advanced scanpath prediction, most existing approaches generate averaged behaviors, failing to capture the variability of human visual exploration. In this work, we present ScanDiff, a novel architecture that combines diffusion models with Vision Transformers to generate diverse and realistic scanpaths. Our method explicitly models scanpath variability by leveraging the stochastic nature of diffusion models, producing a wide range of plausible gaze trajectories. Additionally, we introduce textual conditioning to enable task-driven scanpath generation, allowing the model to adapt to different visual search objectives. Experiments on benchmark datasets show that ScanDiff surpasses…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.