From Scene to Object: Text-Guided Dual-Gaze Prediction
Zehong Ke, Yanbo Jiang, Jinhao Li, Zhiyuan Liu, Yiqian Tu, Qingwen Meng, Heye Huang, Jianqiang Wang

TL;DR
This paper introduces a novel framework for fine-grained, object-level driver attention prediction using a new dataset and a dual-branch model, significantly improving spatial accuracy and interpretability.
Contribution
It constructs G-W3DA, an object-level driver attention dataset, and proposes DualGaze-VLM, a model that achieves precise, intent-driven attention prediction surpassing existing methods.
Findings
DualGaze-VLM outperforms SOTA models with up to 17.8% improvement in Similarity (SIM).
Attention heatmaps from DualGaze-VLM are perceived as authentic by 88.22% of human evaluators.
The dataset and model enable more accurate and interpretable driver attention prediction.
Abstract
Interpretable driver attention prediction is crucial for human-like autonomous driving. However, existing datasets provide only scene-level global gaze rather than fine-grained object-level annotations, inherently failing to support text-grounded cognitive modeling. Consequently, while Vision-Language Models (VLMs) hold great potential for semantic reasoning, this critical data limitations leads to severe text-vision decoupling and visual-bias hallucinations. To break this bottleneck and achieve precise object-level attention prediction, this paper proposes a novel dual-branch gaze prediction framework, establishing a complete paradigm from data construction to model architecture. First, we construct G-W3DA, a object-level driver attention dataset. By integrating a multimodal large language model with the Segment Anything Model 3 (SAM3), we decouple macroscopic heatmaps into…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
