CogACT: A Foundational Vision-Language-Action Model for Synergizing   Cognition and Action in Robotic Manipulation

Qixiu Li; Yaobo Liang; Zeyu Wang; Lin Luo; Xi Chen; Mozheng Liao,; Fangyun Wei; Yu Deng; Sicheng Xu; Yizhong Zhang; Xiaofan Wang; Bei Liu,; Jianlong Fu; Jianmin Bao; Dong Chen; Yuanchun Shi; Jiaolong Yang; Baining Guo

arXiv:2411.19650·cs.RO·December 2, 2024·5 cites

CogACT: A Foundational Vision-Language-Action Model for Synergizing Cognition and Action in Robotic Manipulation

Qixiu Li, Yaobo Liang, Zeyu Wang, Lin Luo, Xi Chen, Mozheng Liao,, Fangyun Wei, Yu Deng, Sicheng Xu, Yizhong Zhang, Xiaofan Wang, Bei Liu,, Jianlong Fu, Jianmin Bao, Dong Chen, Yuanchun Shi, Jiaolong Yang, Baining Guo

PDF

Open Access 8 Models

TL;DR

CogACT introduces a novel vision-language-action model with a specialized action module and diffusion transformers, significantly improving robotic manipulation success rates and generalization across diverse environments and robot embodiments.

Contribution

The paper proposes a componentized VLA architecture with a diffusion action transformer, enhancing task performance and adaptability over existing models.

Findings

01

Surpasses existing VLAs by over 35% in simulated success rates.

02

Achieves 55% higher success in real robot experiments compared to similar models.

03

Outperforms larger RT-2-X model by 18% in simulation.

Abstract

The advancement of large Vision-Language-Action (VLA) models has significantly improved robotic manipulation in terms of language-guided task execution and generalization to unseen scenarios. While existing VLAs adapted from pretrained large Vision-Language-Models (VLM) have demonstrated promising generalizability, their task performance is still unsatisfactory as indicated by the low tasks success rates in different environments. In this paper, we present a new advanced VLA architecture derived from VLM. Unlike previous works that directly repurpose VLM for action prediction by simple action quantization, we propose a omponentized VLA architecture that has a specialized action module conditioned on VLM output. We systematically study the design of the action module and demonstrates the strong performance enhancement with diffusion action transformers for action sequence modeling, as…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsRobotics and Automated Systems · Robot Manipulation and Learning

MethodsDiffusion