DeepThinkVLA: Enhancing Reasoning Capability of Vision-Language-Action Models

Cheng Yin; Yankai Lin; Wang Xu; Sikyuen Tam; Xiangrui Zeng; Zhiyuan Liu; Zhouping Yin

arXiv:2511.15669·cs.LG·April 21, 2026

DeepThinkVLA: Enhancing Reasoning Capability of Vision-Language-Action Models

Cheng Yin, Yankai Lin, Wang Xu, Sikyuen Tam, Xiangrui Zeng, Zhiyuan Liu, Zhouping Yin

PDF

1 Repo 3 Models 1 Datasets

TL;DR

This paper investigates when and why Chain-of-Thought reasoning improves vision-language-action models, identifying key conditions for effectiveness and proposing a new model, DeepThinkVLA, that surpasses existing baselines.

Contribution

It systematically diagnoses CoT's effectiveness in VLA models, introduces conditions for success, and develops DeepThinkVLA with novel attention mechanisms and training pipelines.

Findings

01

DeepThinkVLA achieves 97.0% success on LIBERO.

02

It outperforms baselines by up to 21.7 points.

03

The model demonstrates practical effectiveness in real-world robot experiments.

Abstract

Does Chain-of-Thought (CoT) reasoning genuinely improve Vision-Language-Action (VLA) models, or does it merely add overhead? Existing CoT-VLA systems report limited and inconsistent gains, yet no prior work has rigorously diagnosed when and why CoT helps robots act. Through systematic experiments, we identify two necessary conditions that must be jointly satisfied for CoT to be effective in VLA: (1) Decoding Alignment -- CoT and actions must be generated with modality-appropriate mechanisms; forcing both through a single autoregressive decoder is not merely suboptimal but actively harmful, degrading performance by 4.2 percentage points; (2) Causal Alignment -- CoT must be causally linked to task success via outcome-based optimization; without it, supervised CoT is indistinguishable from no reasoning at all under distribution shift, exhibiting a 32.0\,pp performance drop nearly identical…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

OpenBMB/DeepThinkVLA
github

Models

Datasets

yinchenghust/libero_cot
dataset· 3.4k dl
3.4k dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.