ChatVLA-2: Vision-Language-Action Model with Open-World Embodied Reasoning from Pretrained Knowledge

Zhongyi Zhou; Yichen Zhu; Junjie Wen; Chaomin Shen; Yi Xu

arXiv:2505.21906·cs.RO·June 2, 2025

ChatVLA-2: Vision-Language-Action Model with Open-World Embodied Reasoning from Pretrained Knowledge

Zhongyi Zhou, Yichen Zhu, Junjie Wen, Chaomin Shen, Yi Xu

PDF

Open Access 1 Repo

TL;DR

ChatVLA-2 is a novel vision-language-action model that retains pre-trained knowledge and demonstrates advanced reasoning and comprehension abilities in robotics, surpassing existing methods in open-world understanding and spatial reasoning.

Contribution

We introduce ChatVLA-2, a mixture-of-experts VLA model with a two-stage training pipeline that preserves VLM capabilities while enabling complex reasoning in robotic tasks.

Findings

01

Exceptional mathematical reasoning and OCR capabilities without explicit training.

02

Strong spatial reasoning skills for interpreting novel instructions.

03

Outperforms state-of-the-art imitation learning methods.

Abstract

Vision-language-action (VLA) models have emerged as the next generation of models in robotics. However, despite leveraging powerful pre-trained Vision-Language Models (VLMs), existing end-to-end VLA systems often lose key capabilities during fine-tuning as the model adapts to specific robotic tasks. We argue that a generalizable VLA model should retain and expand upon the VLM's core competencies: 1) Open-world embodied reasoning - the VLA should inherit the knowledge from VLM, i.e., recognize anything that the VLM can recognize, be capable of solving math problems, and possess visual-spatial intelligence, 2) Reasoning following - effectively translating the open-world reasoning into actionable steps for the robot. In this work, we introduce ChatVLA-2, a novel mixture-of-expert VLA model coupled with a specialized two-stage training pipeline designed to preserve the VLM's original…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

tutujingyugang1/ChatVLA_public
pytorch

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications