RoboBrain 2.0 Technical Report

BAAI RoboBrain Team: Mingyu Cao; Huajie Tan; Yuheng Ji; Xiansheng Chen; Minglan Lin; Zhiyu Li; Zhou Cao; Pengwei Wang; Enshen Zhou; Yi Han; Yingbo Tang; Xiangqi Xu; Wei Guo; Yaoxu Lyu; Yijie Xu; Jiayu Shi; Mengfei Du; Cheng Chi; Mengdi Zhao; Xiaoshuai Hao; Junkai Zhao; Xiaojie Zhang; Shanyu Rong; Huaihai Lyu; Zhengliang Cai; Yankai Fu; Ning Chen; Bolun Zhang; Lingfeng Zhang; Shuyi Zhang; Dong Liu; Xi Feng; Songjing Wang; Xiaodan Liu; Yance Jiao; Mengsi Lyu; Zhuo Chen; Chenrui He; Yulong Ao; Xue Sun; Zheqi He; Jingshu Zheng; Xi Yang; Donghai Shi; Kunchang Xie; Bochao Zhang; Shaokai Nie; Chunlei Men; Yonghua Lin; Zhongyuan Wang; Tiejun Huang; Shanghang Zhang

arXiv:2507.02029·cs.RO·September 16, 2025

RoboBrain 2.0 Technical Report

BAAI RoboBrain Team: Mingyu Cao, Huajie Tan, Yuheng Ji, Xiansheng Chen, Minglan Lin, Zhiyu Li, Zhou Cao, Pengwei Wang, Enshen Zhou, Yi Han, Yingbo Tang, Xiangqi Xu, Wei Guo, Yaoxu Lyu, Yijie Xu, Jiayu Shi, Mengfei Du, Cheng Chi, Mengdi Zhao, Xiaoshuai Hao, Junkai Zhao

PDF

6 Models

TL;DR

RoboBrain 2.0 is a new embodied vision-language foundation model that unifies perception, reasoning, and planning, achieving state-of-the-art results in complex real-world embodied AI tasks with two scalable variants.

Contribution

It introduces RoboBrain 2.0 with a heterogeneous architecture, multi-stage training, and practical applications, advancing embodied AI capabilities and providing open-source resources.

Findings

01

32B model surpasses prior models on benchmarks

02

Supports key embodied AI tasks like spatial understanding and temporal decision-making

03

Achieves strong performance with a compact 7B variant

Abstract

We introduce RoboBrain 2.0, our latest generation of embodied vision-language foundation models, designed to unify perception, reasoning, and planning for complex embodied tasks in physical environments. It comes in two variants: a lightweight 7B model and a full-scale 32B model, featuring a heterogeneous architecture with a vision encoder and a language model. Despite its compact size, RoboBrain 2.0 achieves strong performance across a wide spectrum of embodied reasoning tasks. On both spatial and temporal benchmarks, the 32B variant achieves leading results, surpassing prior open-source and proprietary models. In particular, it supports key real-world embodied AI capabilities, including spatial understanding (e.g., affordance prediction, spatial referring, trajectory forecasting) and temporal decision-making (e.g., closed-loop interaction, multi-agent long-horizon planning, and scene…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.