EmbodiedBrain: Expanding Performance Boundaries of Task Planning for Embodied Intelligence

Ding Zou; Feifan Wang; Mengyu Ge; Siyuan Fan; Zongbing Zhang; Wei Chen; Lingfeng Wang; Zhongyou Hu; Wenrui Yan; Zhengwei Gao; Hao Wang; Weizhao Jin; Yu Zhang; Hainan Zhao; Mingliang Zhang; Xianxian Xi; Yaru Zhang; Wenyuan Li; Zhengguang Gao; Yurui Zhu

arXiv:2510.20578·cs.CV·October 24, 2025

EmbodiedBrain: Expanding Performance Boundaries of Task Planning for Embodied Intelligence

Ding Zou, Feifan Wang, Mengyu Ge, Siyuan Fan, Zongbing Zhang, Wei Chen, Lingfeng Wang, Zhongyou Hu, Wenrui Yan, Zhengwei Gao, Hao Wang, Weizhao Jin, Yu Zhang, Hainan Zhao, Mingliang Zhang, Xianxian Xi, Yaru Zhang, Wenyuan Li, Zhengguang Gao, Yurui Zhu

PDF

Open Access

TL;DR

EmbodiedBrain is a new vision-language foundation model designed to enhance task planning and execution in embodied AI agents, achieving state-of-the-art performance through innovative training and evaluation methods.

Contribution

The paper introduces EmbodiedBrain, a novel embodied foundation model with a unique training methodology and comprehensive evaluation system, advancing embodied AI capabilities.

Findings

01

Achieves superior performance on all evaluation benchmarks.

02

Establishes a new state-of-the-art for embodied foundation models.

03

Open-sources data, models, and evaluation tools for community use.

Abstract

The realization of Artificial General Intelligence (AGI) necessitates Embodied AI agents capable of robust spatial perception, effective task planning, and adaptive execution in physical environments. However, current large language models (LLMs) and multimodal LLMs (MLLMs) for embodied tasks suffer from key limitations, including a significant gap between model design and agent requirements, an unavoidable trade-off between real-time latency and performance, and the use of unauthentic, offline evaluation metrics. To address these challenges, we propose EmbodiedBrain, a novel vision-language foundation model available in both 7B and 32B parameter sizes. Our framework features an agent-aligned data structure and employs a powerful training methodology that integrates large-scale Supervised Fine-Tuning (SFT) with Step-Augumented Group Relative Policy Optimization (Step-GRPO), which boosts…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Reinforcement Learning in Robotics · Advanced Neural Network Applications