DualCoT-VLA: Visual-Linguistic Chain of Thought via Parallel Reasoning for Vision-Language-Action Models

Zhide Zhong; Junfeng Li; Junjie He; Haodong Yan; Xin Gong; Guanyi Zhao; Yingjie Cai; Jiantao Gao; Xu Yan; Bingbing Liu; Yingcong Chen; Liuqing Yang; Haoang Li

arXiv:2603.22280·cs.CV·March 24, 2026

DualCoT-VLA: Visual-Linguistic Chain of Thought via Parallel Reasoning for Vision-Language-Action Models

Zhide Zhong, Junfeng Li, Junjie He, Haodong Yan, Xin Gong, Guanyi Zhao, Yingjie Cai, Jiantao Gao, Xu Yan, Bingbing Liu, Yingcong Chen, Liuqing Yang, Haoang Li

PDF

Open Access

TL;DR

DualCoT-VLA introduces a parallel reasoning approach combining visual and linguistic chain-of-thoughts to enhance complex task planning and spatial understanding in vision-language-action models, reducing inference latency and improving performance.

Contribution

It presents a novel parallel reasoning mechanism with dual visual and linguistic CoT, enabling comprehensive multi-modal reasoning and efficient inference in VLA models.

Findings

01

Achieves state-of-the-art results on LIBERO and RoboCasa GR1 benchmarks.

02

Demonstrates improved reasoning and manipulation capabilities in real-world robotic tasks.

03

Reduces inference latency compared to autoregressive CoT methods.

Abstract

Vision-Language-Action (VLA) models map visual observations and language instructions directly to robotic actions. While effective for simple tasks, standard VLA models often struggle with complex, multi-step tasks requiring logical planning, as well as precise manipulations demanding fine-grained spatial perception. Recent efforts have incorporated Chain-of-Thought (CoT) reasoning to endow VLA models with a ``thinking before acting'' capability. However, current CoT-based VLA models face two critical limitations: 1) an inability to simultaneously capture low-level visual details and high-level logical planning due to their reliance on isolated, single-modal CoT; 2) high inference latency with compounding errors caused by step-by-step autoregressive decoding. To address these limitations, we propose DualCoT-VLA, a visual-linguistic CoT method for VLA models with a parallel reasoning…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Social Robot Interaction and HRI · Domain Adaptation and Few-Shot Learning