TL;DR
VLM-AutoDrive is a modular framework that adapts pretrained vision-language models for high-fidelity detection of safety-critical autonomous driving events, significantly improving collision detection accuracy.
Contribution
It introduces a novel post-training approach combining metadata, LLM-generated descriptions, VQA, and CoT supervision to enhance domain alignment and interpretability in autonomous driving perception.
Findings
Collision F1 improved from 0.00 to 0.69
Overall accuracy increased from 35.35% to 77.27%
Achieved substantial gains in collision and near-collision detection
Abstract
The rapid growth of ego-centric dashcam footage presents a major challenge for detecting safety-critical events such as collisions and near-collisions, scenarios that are brief, rare, and difficult for generic vision models to capture. While multimodal large language models (MLLMs) demonstrate strong general reasoning ability, they underperform in driving contexts due to domain and temporal misalignment. We introduce VLM-AutoDrive, a modular post-training framework for adapting pretrained Vision-Language Models (VLMs) to high-fidelity anomaly detection. The framework integrates metadata-derived captions, LLM-generated descriptions, visual question answering (VQA) pairs, and chain-of-thought (CoT) reasoning supervision to enable domain-aligned and interpretable learning. Off-the-shelf VLMs such as NVIDIA's Cosmos-Reason1 7B (CR1) exhibit near-zero Collision recall in zero-shot…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Autonomous Vehicle Technology and Safety · Advanced Neural Network Applications
