Accelerating Structured Chain-of-Thought in Autonomous Vehicles
Yi Gu, Yan Wang, Yuxiao Chen, Yurong You, Wenjie Luo, Yue Wang, Wenhao Ding, Boyi Li, Heng Yang, Boris Ivanovic, Marco Pavone

TL;DR
This paper introduces FastDriveCoT, a parallel decoding method that accelerates structured Chain-of-Thought reasoning in autonomous vehicles, significantly reducing inference latency while maintaining reasoning quality.
Contribution
We propose a novel parallel decoding approach for structured CoT that decomposes reasoning into sub-tasks, enabling concurrent generation and faster inference in autonomous driving models.
Findings
Achieved 3-4× speedup in CoT generation
Reduced end-to-end latency significantly
Maintained downstream task performance
Abstract
Chain-of-Thought (CoT) reasoning enhances the decision-making capabilities of vision-language-action models in autonomous driving, but its autoregressive nature introduces significant inference latency, making it impractical for real-time applications. To address this, we introduce FastDriveCoT, a novel parallel decoding method that accelerates template-structured CoT. Our approach decomposes the reasoning process into a dependency graph of distinct sub-tasks, such as identifying critical objects and summarizing traffic rules, some of which can be generated in parallel. By generating multiple independent reasoning steps concurrently within a single forward pass, we significantly reduce the number of sequential computations. Experiments demonstrate a 3-4 speedup in CoT generation and a substantial reduction in end-to-end latency across various model architectures, all while…
Peer Reviews
Decision·Submitted to ICLR 2026
- Good overall novelty for AV: The presented approach shows good novelty in the field of AV reasoning. - The authors present good technical innovation, by combining the structure CoT tmeplate, the dynamic programming algorithm and by maintining the zero extra FLOPs - Good emperical results by showing the speed up of 3-4 times CoT reasonsing and therfore 2x faster E2E inference - Good emperical results with different VLMs like Qwen2, Qwen2.5 and Qwen3 - Good comprehensive albation studies for ef
Currently, the paper shows a very good technical approach, but there are strong weaknesses that questions the papers overall impact: - Currently, the method relies on highly structured reasoning templates that are selected by the authors for the driving task. It is unclear to the reader, if this enalbes generalization at all. My concern here is that this approach will not generalize very well to open-ended reasoning tasks. This is unfortunately something we see in AV every day. - This dependenc
- The reviewer found the proposed idea to predefine a CoT template and decode independent fields in parallel using a dependency graph to be interesting. - Again, interestingly parallel decoding even slightly improves template adherence so trajectory ADE at 3 seconds for Qwen2.5 VL 3B improves to 0.482 from 0.511 showing structure can help quality. - Experiments in Table 1, the ablation style analysis in Figure 4, show consistent 3$\times$ to 4$\times$ CoT speedup with only small drops in some l
- Table 1 only compares no CoT and standard autoregressive CoT but it should also compare against shorter skeleton of thought decoding or speculative decoding baselines which are natural for speed claims. - Some typos the reviewer could see: Line 115: dependecies -> dependencies; Figure 3 caption independency -> independence; Line 352 diving -> driving - See questions below.
- This paper addresses a very practical and significant problem. The robustness and interpretability of LLMs are crucial for AV systems, but their latency is a major obstacle to deployment. - The method of decomposing CoT into a dependency graph and achieving parallel decoding in a single forward pass via a custom attention mask appears sensible. This approach can effectively utilize the parallel computing capabilities of modern GPUs and fully reuse the KV cache. - The paper demonstrates signi
- The core contribution heavily relies on a manually designed, highly fixed CoT template. Although the authors mention this template is an "example" (line 185), the entire methodology (including the dependency graph construction) is based on this fixed structure. Furthermore, when handling "multi-instance" fields (like lanes and critical objects), the method depends on a fixed number of slots (e.g., 3 time ranges for lanes, 4 critical objects). This might be fragile in complex, dynamic real-worl
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Autonomous Vehicle Technology and Safety · Reinforcement Learning in Robotics
