ChatVLA: Unified Multimodal Understanding and Robot Control with   Vision-Language-Action Model

Zhongyi Zhou; Yichen Zhu; Minjie Zhu; Junjie Wen; Ning Liu; Zhiyuan; Xu; Weibin Meng; Ran Cheng; Yaxin Peng; Chaomin Shen; Feifei Feng

arXiv:2502.14420·cs.RO·February 24, 2025·2 cites

ChatVLA: Unified Multimodal Understanding and Robot Control with Vision-Language-Action Model

Zhongyi Zhou, Yichen Zhu, Minjie Zhu, Junjie Wen, Ning Liu, Zhiyuan, Xu, Weibin Meng, Ran Cheng, Yaxin Peng, Chaomin Shen, Feifei Feng

PDF

Open Access 1 Repo 1 Video

TL;DR

ChatVLA introduces a unified multimodal framework that enhances robot understanding and control by addressing training challenges with phased alignment and mixture-of-experts, achieving superior results on benchmarks and real-world tasks.

Contribution

The paper presents ChatVLA, a novel multimodal model with phased training and mixture-of-experts architecture, improving upon existing vision-language-action models for robot understanding and control.

Findings

01

Outperforms state-of-the-art VLA methods on multimodal benchmarks.

02

Achieves 6x higher performance on MMMU and 47.2% on MMStar.

03

Demonstrates superior real-world robot task performance.

Abstract

Humans possess a unified cognitive ability to perceive, comprehend, and interact with the physical world. Why can't large language models replicate this holistic understanding? Through a systematic analysis of existing training paradigms in vision-language-action models (VLA), we identify two key challenges: spurious forgetting, where robot training overwrites crucial visual-text alignments, and task interference, where competing control and understanding tasks degrade performance when trained jointly. To overcome these limitations, we propose ChatVLA, a novel framework featuring Phased Alignment Training, which incrementally integrates multimodal data after initial control mastery, and a Mixture-of-Experts architecture to minimize task interference. ChatVLA demonstrates competitive performance on visual question-answering datasets and significantly surpasses state-of-the-art…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

tutujingyugang1/ChatVLA_public
pytorchOfficial

Videos

ChatVLA: Unified Multimodal Understanding and Robot Control with Vision-Language-Action Model· underline

Taxonomy

TopicsMultimodal Machine Learning Applications · Natural Language Processing Techniques · AI in Service Interactions