CogACT: A Foundational Vision-Language-Action Model for Synergizing Cognition and Action in Robotic Manipulation
Qixiu Li, Yaobo Liang, Zeyu Wang, Lin Luo, Xi Chen, Mozheng Liao,, Fangyun Wei, Yu Deng, Sicheng Xu, Yizhong Zhang, Xiaofan Wang, Bei Liu,, Jianlong Fu, Jianmin Bao, Dong Chen, Yuanchun Shi, Jiaolong Yang, Baining Guo

TL;DR
CogACT introduces a novel vision-language-action model with a specialized action module and diffusion transformers, significantly improving robotic manipulation success rates and generalization across diverse environments and robot embodiments.
Contribution
The paper proposes a componentized VLA architecture with a diffusion action transformer, enhancing task performance and adaptability over existing models.
Findings
Surpasses existing VLAs by over 35% in simulated success rates.
Achieves 55% higher success in real robot experiments compared to similar models.
Outperforms larger RT-2-X model by 18% in simulation.
Abstract
The advancement of large Vision-Language-Action (VLA) models has significantly improved robotic manipulation in terms of language-guided task execution and generalization to unseen scenarios. While existing VLAs adapted from pretrained large Vision-Language-Models (VLM) have demonstrated promising generalizability, their task performance is still unsatisfactory as indicated by the low tasks success rates in different environments. In this paper, we present a new advanced VLA architecture derived from VLM. Unlike previous works that directly repurpose VLM for action prediction by simple action quantization, we propose a omponentized VLA architecture that has a specialized action module conditioned on VLM output. We systematically study the design of the action module and demonstrates the strong performance enhancement with diffusion action transformers for action sequence modeling, as…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗CogACT/CogACT-Basemodel· 3.4k dl· ♡ 183.4k dl♡ 18
- 🤗CogACT/CogACT-Largemodel· 143 dl· ♡ 5143 dl♡ 5
- 🤗CogACT/CogACT-Smallmodel· 221 dl· ♡ 5221 dl♡ 5
- 🤗Dexmal/libero-db-cogactmodel· 184 dl· ♡ 1184 dl♡ 1
- 🤗Dexmal/simpler-db-cogactmodel· 31 dl· ♡ 131 dl♡ 1
- 🤗Dexmal/calvin-db-cogactmodel· 5 dl· ♡ 15 dl♡ 1
- 🤗Dexmal/maniskill2-db-cogactmodel· 3 dl3 dl
- 🤗Dexmal/robotwin-db-cogactmodel· ♡ 1♡ 1
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsRobotics and Automated Systems · Robot Manipulation and Learning
MethodsDiffusion
