From Scene to Object: Text-Guided Dual-Gaze Prediction

Zehong Ke; Yanbo Jiang; Jinhao Li; Zhiyuan Liu; Yiqian Tu; Qingwen Meng; Heye Huang; Jianqiang Wang

arXiv:2604.20191·cs.CV·April 29, 2026

From Scene to Object: Text-Guided Dual-Gaze Prediction

Zehong Ke, Yanbo Jiang, Jinhao Li, Zhiyuan Liu, Yiqian Tu, Qingwen Meng, Heye Huang, Jianqiang Wang

PDF

TL;DR

This paper introduces a novel framework for fine-grained, object-level driver attention prediction using a new dataset and a dual-branch model, significantly improving spatial accuracy and interpretability.

Contribution

It constructs G-W3DA, an object-level driver attention dataset, and proposes DualGaze-VLM, a model that achieves precise, intent-driven attention prediction surpassing existing methods.

Findings

01

DualGaze-VLM outperforms SOTA models with up to 17.8% improvement in Similarity (SIM).

02

Attention heatmaps from DualGaze-VLM are perceived as authentic by 88.22% of human evaluators.

03

The dataset and model enable more accurate and interpretable driver attention prediction.

Abstract

Interpretable driver attention prediction is crucial for human-like autonomous driving. However, existing datasets provide only scene-level global gaze rather than fine-grained object-level annotations, inherently failing to support text-grounded cognitive modeling. Consequently, while Vision-Language Models (VLMs) hold great potential for semantic reasoning, this critical data limitations leads to severe text-vision decoupling and visual-bias hallucinations. To break this bottleneck and achieve precise object-level attention prediction, this paper proposes a novel dual-branch gaze prediction framework, establishing a complete paradigm from data construction to model architecture. First, we construct G-W3DA, a object-level driver attention dataset. By integrating a multimodal large language model with the Segment Anything Model 3 (SAM3), we decouple macroscopic heatmaps into…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.