DeepThinkVLA: Enhancing Reasoning Capability of Vision-Language-Action Models
Cheng Yin, Yankai Lin, Wang Xu, Sikyuen Tam, Xiangrui Zeng, Zhiyuan Liu, Zhouping Yin

TL;DR
This paper investigates when and why Chain-of-Thought reasoning improves vision-language-action models, identifying key conditions for effectiveness and proposing a new model, DeepThinkVLA, that surpasses existing baselines.
Contribution
It systematically diagnoses CoT's effectiveness in VLA models, introduces conditions for success, and develops DeepThinkVLA with novel attention mechanisms and training pipelines.
Findings
DeepThinkVLA achieves 97.0% success on LIBERO.
It outperforms baselines by up to 21.7 points.
The model demonstrates practical effectiveness in real-world robot experiments.
Abstract
Does Chain-of-Thought (CoT) reasoning genuinely improve Vision-Language-Action (VLA) models, or does it merely add overhead? Existing CoT-VLA systems report limited and inconsistent gains, yet no prior work has rigorously diagnosed when and why CoT helps robots act. Through systematic experiments, we identify two necessary conditions that must be jointly satisfied for CoT to be effective in VLA: (1) Decoding Alignment -- CoT and actions must be generated with modality-appropriate mechanisms; forcing both through a single autoregressive decoder is not merely suboptimal but actively harmful, degrading performance by 4.2 percentage points; (2) Causal Alignment -- CoT must be causally linked to task success via outcome-based optimization; without it, supervised CoT is indistinguishable from no reasoning at all under distribution shift, exhibiting a 32.0\,pp performance drop nearly identical…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
