From Segments to Scenes: Temporal Understanding in Autonomous Driving via Vision-Language Model

Kevin Cannons; Saeed Ranjbar Alvar; Mohammad Asiful Hossain; Ahmad Rezaei; Mohsen Gholami; Alireza Heidarikhazaei; Zhou Weimin; Yong Zhang; Mohammad Akbari

arXiv:2512.05277·cs.CV·December 18, 2025

From Segments to Scenes: Temporal Understanding in Autonomous Driving via Vision-Language Model

Kevin Cannons, Saeed Ranjbar Alvar, Mohammad Asiful Hossain, Ahmad Rezaei, Mohsen Gholami, Alireza Heidarikhazaei, Zhou Weimin, Yong Zhang, Mohammad Akbari

PDF

Open Access 1 Datasets

TL;DR

This paper introduces the TAD benchmark to evaluate vision-language models' ability to understand temporal dynamics in autonomous driving footage, revealing current limitations and proposing training-free solutions to enhance motion understanding.

Contribution

The paper presents the TAD benchmark focused on autonomous driving, evaluates existing models, and proposes two novel training-free methods to improve temporal understanding in this domain.

Findings

01

Current models show substandard accuracy on TAD due to poor motion understanding.

02

Scene-CoT and TCogMap improve accuracy by up to 17.72%.

03

TAD serves as a new benchmark for future research in autonomous driving temporal understanding.

Abstract

Temporal understanding in autonomous driving (AD) remains a significant challenge, even for recent state-of-the-art (SoTA) Vision-Language Models (VLMs). Prior work has introduced datasets and benchmarks aimed at improving temporal reasoning, but these have emphasized other video content, including sports, cooking, and movies. No existing benchmark focuses exclusively on the unique challenges of temporal understanding in ego-centric AD footage. To fill this gap, the Temporal Understanding in Autonomous Driving (TAD) benchmark is presented, which evaluates VLMs' ability to capture the dynamic relationships between actions in AD. TAD comprises nearly 6,000 question-answer (QA) pairs, spanning 7 human-designed tasks. In addition, an evaluation is performed that consists of 9 closed- and open-source generalist models as well as SoTA AD specialist models. When applied to TAD, current SoTA…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

vbdai/TAD
dataset· 105 dl
105 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Human Pose and Action Recognition · Action Observation and Synchronization