THE COLOSSEUM: A Benchmark for Evaluating Generalization for Robotic Manipulation
Wilbert Pumacay, Ishika Singh, Jiafei Duan, Ranjay Krishna, Jesse, Thomason, Dieter Fox

TL;DR
THE COLOSSEUM is a comprehensive simulation benchmark with diverse manipulation tasks designed to evaluate and improve the generalization of robotic manipulation models across various environmental perturbations, bridging simulation and real-world performance.
Contribution
We introduce THE COLOSSEUM, a novel benchmark with 20 tasks and 14 perturbation axes, enabling systematic evaluation of model robustness and generalization in robotic manipulation.
Findings
Model success rates degrade by 30-50% under perturbations.
Multiple perturbations cause success rate drops of ≥75%.
Certain perturbations like distractor objects and lighting significantly impact performance.
Abstract
To realize effective large-scale, real-world robotic applications, we must evaluate how well our robot policies adapt to changes in environmental conditions. Unfortunately, a majority of studies evaluate robot performance in environments closely resembling or even identical to the training setup. We present THE COLOSSEUM, a novel simulation benchmark, with 20 diverse manipulation tasks, that enables systematical evaluation of models across 14 axes of environmental perturbations. These perturbations include changes in color, texture, and size of objects, table-tops, and backgrounds; we also vary lighting, distractors, physical properties perturbations and camera pose. Using THE COLOSSEUM, we compare 5 state-of-the-art manipulation models to reveal that their success rate degrades between 30-50% across these perturbation factors. When multiple perturbations are applied in unison, the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsRobot Manipulation and Learning
