OmniDrive-R1: Reinforcement-driven Interleaved Multi-modal Chain-of-Thought for Trustworthy Vision-Language Autonomous Driving

Zhenguo Zhang; Haohan Zheng; Yishen Wang; Le Xu; Tianchen Deng; Xuefeng Chen; Qu Chen; Bo Zhang; Wuxiong Huang

arXiv:2512.14044·cs.CV·May 1, 2026

OmniDrive-R1: Reinforcement-driven Interleaved Multi-modal Chain-of-Thought for Trustworthy Vision-Language Autonomous Driving

Zhenguo Zhang, Haohan Zheng, Yishen Wang, Le Xu, Tianchen Deng, Xuefeng Chen, Qu Chen, Bo Zhang, Wuxiong Huang

PDF

TL;DR

OmniDrive-R1 is an end-to-end vision-language model for autonomous driving that uses reinforcement learning to improve reasoning accuracy and reliability without relying on dense labels.

Contribution

It introduces a novel interleaved multi-modal chain-of-thought framework with reinforcement-driven visual grounding and a process-based reward, enabling joint perception and reasoning.

Findings

01

Significant improvement in reasoning score from 51.77% to 80.35%.

02

Final answer accuracy increased from 37.81% to 73.62%.

03

Eliminates need for dense localization labels through annotation-free rewards.

Abstract

The deployment of Vision-Language Models (VLMs) in safety-critical domains like autonomous driving (AD) is critically hindered by reliability failures, most notably object hallucination. This failure stems from their reliance on ungrounded, text-based Chain-of-Thought (CoT) reasoning. While existing multi-modal CoT approaches attempt mitigation, they suffer from two fundamental flaws: (1) decoupled perception and reasoning stages that prevent end-to-end joint optimization, and (2) reliance on expensive, dense localization labels. Thus we introduce OmniDrive-R1, an end-to-end VLM framework designed for autonomous driving, which unifies perception and reasoning through an interleaved Multi-modal Chain-of-Thought (iMCoT) mechanism. Our core innovation is an Reinforcement-driven visual grounding capability, enabling the model to autonomously direct its attention and "zoom in" on critical…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.