From Segments to Scenes: Temporal Understanding in Autonomous Driving via Vision-Language Model
Kevin Cannons, Saeed Ranjbar Alvar, Mohammad Asiful Hossain, Ahmad Rezaei, Mohsen Gholami, Alireza Heidarikhazaei, Zhou Weimin, Yong Zhang, Mohammad Akbari

TL;DR
This paper introduces the TAD benchmark to evaluate vision-language models' ability to understand temporal dynamics in autonomous driving footage, revealing current limitations and proposing training-free solutions to enhance motion understanding.
Contribution
The paper presents the TAD benchmark focused on autonomous driving, evaluates existing models, and proposes two novel training-free methods to improve temporal understanding in this domain.
Findings
Current models show substandard accuracy on TAD due to poor motion understanding.
Scene-CoT and TCogMap improve accuracy by up to 17.72%.
TAD serves as a new benchmark for future research in autonomous driving temporal understanding.
Abstract
Temporal understanding in autonomous driving (AD) remains a significant challenge, even for recent state-of-the-art (SoTA) Vision-Language Models (VLMs). Prior work has introduced datasets and benchmarks aimed at improving temporal reasoning, but these have emphasized other video content, including sports, cooking, and movies. No existing benchmark focuses exclusively on the unique challenges of temporal understanding in ego-centric AD footage. To fill this gap, the Temporal Understanding in Autonomous Driving (TAD) benchmark is presented, which evaluates VLMs' ability to capture the dynamic relationships between actions in AD. TAD comprises nearly 6,000 question-answer (QA) pairs, spanning 7 human-designed tasks. In addition, an evaluation is performed that consists of 9 closed- and open-source generalist models as well as SoTA AD specialist models. When applied to TAD, current SoTA…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Human Pose and Action Recognition · Action Observation and Synchronization
