ChatVLA: Unified Multimodal Understanding and Robot Control with Vision-Language-Action Model
Zhongyi Zhou, Yichen Zhu, Minjie Zhu, Junjie Wen, Ning Liu, Zhiyuan, Xu, Weibin Meng, Ran Cheng, Yaxin Peng, Chaomin Shen, Feifei Feng

TL;DR
ChatVLA introduces a unified multimodal framework that enhances robot understanding and control by addressing training challenges with phased alignment and mixture-of-experts, achieving superior results on benchmarks and real-world tasks.
Contribution
The paper presents ChatVLA, a novel multimodal model with phased training and mixture-of-experts architecture, improving upon existing vision-language-action models for robot understanding and control.
Findings
Outperforms state-of-the-art VLA methods on multimodal benchmarks.
Achieves 6x higher performance on MMMU and 47.2% on MMStar.
Demonstrates superior real-world robot task performance.
Abstract
Humans possess a unified cognitive ability to perceive, comprehend, and interact with the physical world. Why can't large language models replicate this holistic understanding? Through a systematic analysis of existing training paradigms in vision-language-action models (VLA), we identify two key challenges: spurious forgetting, where robot training overwrites crucial visual-text alignments, and task interference, where competing control and understanding tasks degrade performance when trained jointly. To overcome these limitations, we propose ChatVLA, a novel framework featuring Phased Alignment Training, which incrementally integrates multimodal data after initial control mastery, and a Mixture-of-Experts architecture to minimize task interference. ChatVLA demonstrates competitive performance on visual question-answering datasets and significantly surpasses state-of-the-art…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsMultimodal Machine Learning Applications · Natural Language Processing Techniques · AI in Service Interactions
