Robobench: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models as Embodied Brain

Yulin Luo; Chun-Kai Fan; Menghang Dong; Jiayu Shi; Mengdi Zhao; Bo-Wen Zhang; Cheng Chi; Jiaming Liu; Gaole Dai; Rongyu Zhang; Ruichuan An; Kun Wu; Zhengping Che; Shaoxuan Xie; Guocai Yao; Zhongxia Zhao; Pengwei Wang; Guang Liu; Zhongyuan Wang; Tiejun Huang; Shanghang Zhang

arXiv:2510.17801·cs.RO·October 22, 2025

Robobench: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models as Embodied Brain

Yulin Luo, Chun-Kai Fan, Menghang Dong, Jiayu Shi, Mengdi Zhao, Bo-Wen Zhang, Cheng Chi, Jiaming Liu, Gaole Dai, Rongyu Zhang, Ruichuan An, Kun Wu, Zhengping Che, Shaoxuan Xie, Guocai Yao, Zhongxia Zhao, Pengwei Wang, Guang Liu, Zhongyuan Wang, Tiejun Huang, Shanghang Zhang

PDF

Open Access 1 Datasets 3 Reviews

TL;DR

RoboBench is a new benchmark designed to systematically evaluate multimodal large language models as embodied brains in robotics, covering multiple cognitive dimensions with realistic, diverse datasets and a planning evaluation framework.

Contribution

This work introduces RoboBench, a comprehensive benchmark for assessing multimodal large language models as embodied cognition systems in robotics, addressing previous limitations in scope and realism.

Findings

01

MLLMs struggle with implicit instruction comprehension.

02

Difficulties observed in spatiotemporal reasoning and planning.

03

Challenges in affordance understanding and failure diagnosis.

Abstract

Building robots that can perceive, reason, and act in dynamic, unstructured environments remains a core challenge. Recent embodied systems often adopt a dual-system paradigm, where System 2 handles high-level reasoning while System 1 executes low-level control. In this work, we refer to System 2 as the embodied brain, emphasizing its role as the cognitive core for reasoning and decision-making in manipulation tasks. Given this role, systematic evaluation of the embodied brain is essential. Yet existing benchmarks emphasize execution success, or when targeting high-level reasoning, suffer from incomplete dimensions and limited task realism, offering only a partial picture of cognitive capability. To bridge this gap, we introduce RoboBench, a benchmark that systematically evaluates multimodal large language models (MLLMs) as embodied brains. Motivated by the critical roles across the full…

Peer Reviews

Decision·Submitted to ICLR 2026

Reviewer 01Rating 4Confidence 4

Strengths

1. Comprehensive Scope and Structure: RoboBench uniquely combines five major cognitive dimensions, directly tracing the manipulation pipeline from instruction to error recovery, surpassing existing embodied benchmarks in breadth and integration. 2. Novel Planning Evaluation Framework: The MLLM-as-world-simulator approach moves beyond text matching by assessing step-by-step plan execution feasibility using ground-truth action lists and annotated DAGs. 3. The paper is easy to follow and well-strut

Weaknesses

1. While the empirical evaluation is rigorous, the paper lacks a deeper theoretical analysis of cognitive failure patterns, limitations in MLLMs’ reasoning processes, or broader learning-theoretic implications. Theoretical insights or analyses, such as attention-based analysis, probing, or architectural examination, are limited. 2. Limited Analysis of Model Differences: The results show performance gaps but do not explore how different model designs, training data, or prompt styles affect them.

Reviewer 02Rating 4Confidence 3

Strengths

1. The inclusion of the multi-view planning and error analysis task are good, which complements some existing evaluations for MLLM in embodied tasks.

Weaknesses

1. The paper mostly uses existing data from other benchmarks to construct this evaluation, without any notable principled data curation method. The goal seems to be mostly just extending coverage of existing benchmarks and meshing together aspects from previous evals. The engineering effort is useful, yet research contributions are limited. 2. Comparison with relevant baseline benchmarks in Table 1 seems to be subjective and somewhat dubious. For example, it is unclear what is being referred to

Reviewer 03Rating 2Confidence 4

Strengths

The writing is simple and clear, resulting in good overall readability. The benchmark effectively assesses long-horizon planning ability by simulating whether the generated plans achieve key object-state milestones. It comprehensively covers diverse hardware configurations including bimanual, single-arm, and mobile robot setups as well as multiple task viewpoints, enhancing its generality and applicability.

Weaknesses

The overall work feels engineering-oriented, focusing mainly on data curation rather than proposing new methods to improve MLLM performance, which limits the novelty and conceptual contribution of the paper. The five key capabilities defined in this work appear to have some overlap. For example, embodied instruction comprehension could arguably be considered part of embodied generalized planning, as both involve understanding structured task sequences. The motivation and practical value of the

Code & Models

Datasets

LeoFan01/RoboBench
dataset· 1.0k dl
1.0k dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Robot Manipulation and Learning · Action Observation and Synchronization