RoboCerebra: A Large-scale Benchmark for Long-horizon Robotic Manipulation Evaluation

Songhao Han; Boxiang Qiu; Yue Liao; Siyuan Huang; Chen Gao; Shuicheng Yan; Si Liu

arXiv:2506.06677·cs.RO·October 30, 2025

RoboCerebra: A Large-scale Benchmark for Long-horizon Robotic Manipulation Evaluation

Songhao Han, Boxiang Qiu, Yue Liao, Siyuan Huang, Chen Gao, Shuicheng Yan, Si Liu

PDF

Open Access 1 Datasets

TL;DR

RoboCerebra introduces a comprehensive benchmark for evaluating long-horizon, high-level reasoning in robotic manipulation, combining a large simulation dataset, hierarchical planning framework, and evaluation of vision-language models' reasoning abilities.

Contribution

It presents RoboCerebra, a novel benchmark with extended tasks, hierarchical VLM-based planning, and structured evaluation protocols for robotic reasoning.

Findings

01

State-of-the-art VLMs show limited performance in long-horizon planning.

02

The benchmark reveals challenges in memory and reflection in robotic systems.

03

Longer, more complex tasks expose gaps in current VLM capabilities.

Abstract

Recent advances in vision-language models (VLMs) have enabled instruction-conditioned robotic systems with improved generalization. However, most existing work focuses on reactive System 1 policies, underutilizing VLMs' strengths in semantic reasoning and long-horizon planning. These System 2 capabilities-characterized by deliberative, goal-directed thinking-remain under explored due to the limited temporal scale and structural complexity of current benchmarks. To address this gap, we introduce RoboCerebra, a benchmark for evaluating high-level reasoning in long-horizon robotic manipulation. RoboCerebra includes: (1) a large-scale simulation dataset with extended task horizons and diverse subtask sequences in household environments; (2) a hierarchical framework combining a high-level VLM planner with a low-level vision-language-action (VLA) controller; and (3) an evaluation protocol…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

Yun5/RoboCerebra_TF
dataset· 142 dl
142 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Social Robot Interaction and HRI · Robot Manipulation and Learning

MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Cosine Annealing · Layer Normalization · Linear Warmup With Cosine Annealing · Attention Dropout · Discriminative Fine-Tuning · Byte Pair Encoding · Softmax · Linear Layer · Dropout