VLM-AutoDrive: Post-Training Vision-Language Models for Safety-Critical Autonomous Driving Events

Mohammad Qazim Bhat; Yufan Huang; Niket Agarwal; Hao Wang; Michael Woods; John Kenyon; Tsung-Yi Lin; Xiaodong Yang; Ming-Yu Liu; Kevin Xie

arXiv:2603.18178·cs.CV·May 19, 2026

VLM-AutoDrive: Post-Training Vision-Language Models for Safety-Critical Autonomous Driving Events

Mohammad Qazim Bhat, Yufan Huang, Niket Agarwal, Hao Wang, Michael Woods, John Kenyon, Tsung-Yi Lin, Xiaodong Yang, Ming-Yu Liu, Kevin Xie

PDF

3 Models

TL;DR

VLM-AutoDrive is a modular framework that adapts pretrained vision-language models for high-fidelity detection of safety-critical autonomous driving events, significantly improving collision detection accuracy.

Contribution

It introduces a novel post-training approach combining metadata, LLM-generated descriptions, VQA, and CoT supervision to enhance domain alignment and interpretability in autonomous driving perception.

Findings

01

Collision F1 improved from 0.00 to 0.69

02

Overall accuracy increased from 35.35% to 77.27%

03

Achieved substantial gains in collision and near-collision detection

Abstract

The rapid growth of ego-centric dashcam footage presents a major challenge for detecting safety-critical events such as collisions and near-collisions, scenarios that are brief, rare, and difficult for generic vision models to capture. While multimodal large language models (MLLMs) demonstrate strong general reasoning ability, they underperform in driving contexts due to domain and temporal misalignment. We introduce VLM-AutoDrive, a modular post-training framework for adapting pretrained Vision-Language Models (VLMs) to high-fidelity anomaly detection. The framework integrates metadata-derived captions, LLM-generated descriptions, visual question answering (VQA) pairs, and chain-of-thought (CoT) reasoning supervision to enable domain-aligned and interpretable learning. Off-the-shelf VLMs such as NVIDIA's Cosmos-Reason1 7B (CR1) exhibit near-zero Collision recall in zero-shot…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Autonomous Vehicle Technology and Safety · Advanced Neural Network Applications