TL;DR
RoboLab is a high-fidelity simulation benchmark designed to evaluate the true generalization of task-generalist robotic policies through diverse tasks and systematic analysis of policy robustness.
Contribution
It introduces RoboLab, a scalable simulation framework with 120 tasks and a systematic analysis method to assess policy performance and robustness.
Findings
Current state-of-the-art models show significant performance gaps in RoboLab.
RoboLab enables analysis of policy sensitivity to controlled perturbations.
The benchmark provides granular metrics for evaluating generalization.
Abstract
The pursuit of general-purpose robotics has yielded impressive foundation models, yet simulation-based benchmarking remains a bottleneck due to rapid performance saturation and a lack of true generalization testing. Existing benchmarks often exhibit significant domain overlap between training and evaluation, trivializing success rates and obscuring insights into robustness. We introduce RoboLab, a simulation benchmarking framework designed to address these challenges. Concretely, our framework is designed to answer two questions: (1) to what extent can we understand the performance of a real-world policy by analyzing its behavior in simulation, and (2) which factor most strongly affect policy behavior. First, RoboLab enables human-authored and LLM-enabled generation of scenes and tasks in a robot- and policy-agnostic manner within a high-fidelity simulation environment. We introduce an…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
