CoT4AD: A Vision-Language-Action Model with Explicit Chain-of-Thought Reasoning for Autonomous Driving
Zhaohui Wang, Tengbo Yu, Hao Tang

TL;DR
CoT4AD introduces a Chain-of-Thought reasoning framework into vision-language-action models for autonomous driving, significantly improving reasoning and decision-making in complex scenarios.
Contribution
It presents a novel CoT-based VLA framework that explicitly models reasoning processes, enhancing numerical and causal reasoning in autonomous driving tasks.
Findings
Achieves state-of-the-art results on nuScenes and Bench2Drive benchmarks.
Improves reasoning capabilities in complex driving scenarios.
Demonstrates robustness in dynamic environments.
Abstract
Vision-Language-Action (VLA) models have recently attracted growing attention in end-to-end autonomous driving for their strong reasoning capabilities and rich world knowledge. However, existing VLAs often suffer from limited numerical reasoning ability and overly simplified input-output mappings, which hinder their performance in complex driving scenarios requiring step-by-step causal reasoning. To address these challenges, we propose CoT4AD, a novel VLA framework that introduces Chain-of-Thought (CoT) reasoning for autonomous driving to enhance both numerical and causal reasoning in Vision-Language Models (VLMs). CoT4AD integrates visual observations and language instructions to perform semantic reasoning, scene understanding, and trajectory planning. During training, it explicitly models a perception-question-prediction-action CoT to align the reasoning space with the action space…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Autonomous Vehicle Technology and Safety · Advanced Neural Network Applications
