Learning Human Visual Attention on 3D Surfaces through Geometry-Queried Semantic Priors
Soham Pahari, Sandeep C. Kumain

TL;DR
This paper presents SemGeo-AttentionNet, a novel architecture that combines geometric and semantic cues to model human visual attention on 3D surfaces, outperforming existing methods and incorporating temporal scanpath generation.
Contribution
The paper introduces a dual-stream model with asymmetric cross-modal fusion that integrates semantic priors and geometric features for 3D saliency prediction, including a new temporal scanpath formulation.
Findings
Significant performance improvements on SAL3D, NUS3D, and 3DVA datasets.
Effective modeling of human visual attention on 3D surfaces.
First temporal scanpath generation respecting 3D mesh topology.
Abstract
Human visual attention on three-dimensional objects emerges from the interplay between bottom-up geometric processing and top-down semantic recognition. Existing 3D saliency methods rely on hand-crafted geometric features or learning-based approaches that lack semantic awareness, failing to explain why humans fixate on semantically meaningful but geometrically unremarkable regions. We introduce SemGeo-AttentionNet, a dual-stream architecture that explicitly formalizes this dichotomy through asymmetric cross-modal fusion, leveraging diffusion-based semantic priors from geometry-conditioned multi-view rendering and point cloud transformers for geometric processing. Cross-attention ensures geometric features query semantic content, enabling bottom-up distinctiveness to guide top-down retrieval. We extend our framework to temporal scanpath generation through reinforcement learning,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsVisual Attention and Saliency Detection · 3D Shape Modeling and Analysis · Human Pose and Action Recognition
