ACoT-VLA: Action Chain-of-Thought for Vision-Language-Action Models
Linqing Zhong, Yi Liu, Yifei Wei, Ziyu Xiong, Maoqing Yao, Si Liu, Guanghui Ren

TL;DR
This paper introduces ACoT-VLA, a novel vision-language-action model that uses structured action reasoning to improve robot manipulation, outperforming previous indirect reasoning methods.
Contribution
It proposes a new ACoT paradigm with explicit and implicit action reasoners, enabling more effective and grounded policy learning for robotic tasks.
Findings
ACoT-VLA outperforms existing models in real-world and simulation environments.
The explicit action reasoner generates coarse reference trajectories for better guidance.
The implicit reasoner extracts latent action priors to enhance policy grounding.
Abstract
Vision-Language-Action models have emerged as essential generalist robot policies for diverse manipulation tasks, conventionally relying on directly translating multimodal inputs into actions via Vision-Language Model embeddings. Recent advancements have introduced explicit intermediary reasoning-such as sub-task prediction (language) or goal image synthesis (vision)-to guide action generation. However, these intermediate reasoning are often indirect and inherently limited in their capacity to convey the full, granular information required for precise action execution. Instead, we posit that the most effective form of reasoning is one that deliberates directly in the action space. We introduce Action Chain-of-Thought (ACoT), a paradigm where the reasoning process itself is formulated as a structured sequence of coarse action intents that guide the final policy. In this paper, we propose…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
