VISTA: Vision-Language Imitation of Situational Thinking and Attention for Human-Like Driver Focus in Dynamic Environments

Kaiser Hamid; Khandakar Ashrafi Akbar; Nade Liang

arXiv:2508.05852·cs.CV·August 11, 2025

VISTA: Vision-Language Imitation of Situational Thinking and Attention for Human-Like Driver Focus in Dynamic Environments

Kaiser Hamid, Khandakar Ashrafi Akbar, Nade Liang

PDF

Open Access

TL;DR

This paper introduces VISTA, a vision-language framework that predicts and explains driver attention shifts in dynamic driving scenes using natural language, enhancing interpretability and supporting autonomous driving applications.

Contribution

It presents a novel approach combining vision-language modeling with driver attention prediction, utilizing few-shot learning and refined captions for improved interpretability.

Findings

01

Outperforms general-purpose VLMs in attention shift detection

02

Enables natural language descriptions of driver gaze behavior

03

Provides a foundation for explainable AI in autonomous driving

Abstract

Driver visual attention prediction is a critical task in autonomous driving and human-computer interaction (HCI) research. Most prior studies focus on estimating attention allocation at a single moment in time, typically using static RGB images such as driving scene pictures. In this work, we propose a vision-language framework that models the changing landscape of drivers' gaze through natural language, using few-shot and zero-shot learning on single RGB images. We curate and refine high-quality captions from the BDD-A dataset using human-in-the-loop feedback, then fine-tune LLaVA to align visual perception with attention-centric scene understanding. Our approach integrates both low-level cues and top-down context (e.g., route semantics, risk anticipation), enabling language-based descriptions of gaze behavior. We evaluate performance across training regimes (few shot, and one-shot)…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHuman-Automation Interaction and Safety · Safety Warnings and Signage