ACoT-VLA: Action Chain-of-Thought for Vision-Language-Action Models

Linqing Zhong; Yi Liu; Yifei Wei; Ziyu Xiong; Maoqing Yao; Si Liu; Guanghui Ren

arXiv:2601.11404·cs.RO·March 31, 2026

ACoT-VLA: Action Chain-of-Thought for Vision-Language-Action Models

Linqing Zhong, Yi Liu, Yifei Wei, Ziyu Xiong, Maoqing Yao, Si Liu, Guanghui Ren

PDF

1 Repo 1 Datasets

TL;DR

This paper introduces ACoT-VLA, a novel vision-language-action model that uses structured action reasoning to improve robot manipulation, outperforming previous indirect reasoning methods.

Contribution

It proposes a new ACoT paradigm with explicit and implicit action reasoners, enabling more effective and grounded policy learning for robotic tasks.

Findings

01

ACoT-VLA outperforms existing models in real-world and simulation environments.

02

The explicit action reasoner generates coarse reference trajectories for better guidance.

03

The implicit reasoner extracts latent action priors to enhance policy grounding.

Abstract

Vision-Language-Action models have emerged as essential generalist robot policies for diverse manipulation tasks, conventionally relying on directly translating multimodal inputs into actions via Vision-Language Model embeddings. Recent advancements have introduced explicit intermediary reasoning-such as sub-task prediction (language) or goal image synthesis (vision)-to guide action generation. However, these intermediate reasoning are often indirect and inherently limited in their capacity to convey the full, granular information required for precise action execution. Instead, we posit that the most effective form of reasoning is one that deliberates directly in the action space. We introduce Action Chain-of-Thought (ACoT), a paradigm where the reasoning process itself is formulated as a structured sequence of coarse action intents that guide the final policy. In this paper, we propose…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

AgibotTech/ACoT-VLA
github

Datasets

slabhead/ml-papers-mix-2026-04
dataset· 51 dl
51 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.