Probing the Reliability of Driving VLMs: From Inconsistent Responses to Grounded Temporal Reasoning

Chun-Peng Chang; Chen-Yu Wang; Holger Caesar; Alain Pagani

arXiv:2603.09512·cs.CV·March 11, 2026

Probing the Reliability of Driving VLMs: From Inconsistent Responses to Grounded Temporal Reasoning

Chun-Peng Chang, Chen-Yu Wang, Holger Caesar, Alain Pagani

PDF

Open Access

TL;DR

This paper critically examines the reliability of Vision-Language Models in autonomous driving, highlighting issues of response inconsistency and limited temporal reasoning, and introduces a new benchmark and tuning method to improve their performance.

Contribution

It introduces FutureVQA, a benchmark dataset for assessing temporal reasoning in driving VLMs, and proposes a self-supervised tuning method to enhance consistency and temporal understanding.

Findings

01

Models with strong visual understanding do not excel in temporal reasoning tasks.

02

Response inconsistency increases with minor input perturbations.

03

The proposed tuning method improves both consistency and temporal reasoning.

Abstract

A reliable driving assistant should provide consistent responses based on temporally grounded reasoning derived from observed information. In this work, we investigate whether Vision-Language Models (VLMs), when applied as driving assistants, can response consistantly and understand how present observations shape future outcomes, or whether their outputs merely reflect patterns memorized during training without temporally grounded reasoning. While recent efforts have integrated VLMs into autonomous driving, prior studies typically emphasize scene understanding and instruction generation, implicitly assuming that strong visual interpretation naturally enables consistant future reasoning and thus ensures reliable decision-making, a claim we critically examine. We focus on two major challenges limiting VLM reliability in this setting: response inconsistency, where minor input perturbations…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Social Robot Interaction and HRI · Explainable Artificial Intelligence (XAI)