RoboCasa365: A Large-Scale Simulation Framework for Training and Benchmarking Generalist Robots
Soroush Nasiriany, Sepehr Nasiriany, Abhiram Maddukuri, Yuke Zhu

TL;DR
RoboCasa365 is a large-scale, diverse simulation benchmark for household robot tasks, enabling systematic evaluation of generalist robot policies across various settings and providing insights into factors influencing their performance.
Contribution
The paper introduces RoboCasa365, a comprehensive simulation benchmark with extensive tasks, environments, and demonstration data to evaluate and advance generalist robot learning.
Findings
Task diversity significantly impacts generalization.
Larger datasets improve policy robustness.
Environment variation influences transferability.
Abstract
Recent advances in robot learning have accelerated progress toward generalist robots that can perform everyday tasks in human environments. Yet it remains difficult to gauge how close we are to this vision. The field lacks a reproducible, large-scale benchmark for systematic evaluation. To fill this gap, we present RoboCasa365, a comprehensive simulation benchmark for household mobile manipulation. Built on the RoboCasa platform, RoboCasa365 introduces 365 everyday tasks across 2,500 diverse kitchen environments, with over 600 hours of human demonstration data and over 1600 hours of synthetically generated demonstration data -- making it one of the most diverse and large-scale resources for studying generalist policies. RoboCasa365 is designed to support systematic evaluations for different problem settings, including multi-task learning, robot foundation model training, and lifelong…
Peer Reviews
Decision·ICLR 2026 Poster
1. The additional assets and scenes significantly increases the task diversity of the original RoboCasa benchmark. The simulated kitchen environments have highly realistic and diverse layouts; and the authors ensure the task activities are from a diverse set of categories. 2. The paper presents systematic experiments to study key factors in training generalist robot policies. The authors have dedicated effort into implementing various policy training baselines and compared results on both seen
1. Lack of real world evaluations. Although a diverse simulation task suite provides many opportunities for studying algorithms, without a real world digital twin or policy transfer evaluation, the value of the benchmark is limited and it's unclear if any assumptions or implementation bugs in simulation environments would hinder transfer to the real world. 2. Lack of qualitative results. The submission did not include any supplementary materials. For policy learning, it would have been much cle
The benchmark seems quite comprehensive and the experiments are sensible and systematic. One very interesting result is that synthetic demos do not help at all. In the pretraining data study, they show that just using the human demos is good enough. I would like to see additional explanation in why MimicGen is not helpful, since MimicGen itself reports positive gains in using synthetic demos. The paper is easy to read.
This is not really a substantial weakness, but the experiments and results are very straightforward and almost boring. It would be nice to see more interesting or qualitative phenomena, such as why synthetic demos are harmful, why GROOT outperforms other generalist policies (i.e. is hierarchy helpful), seeing if LoRA finetuning affects performance, both in 4.2 and 4.3, etc. Another suggestion is to stratify the tasks, like contact rich tasks, mobile manipulation tasks, etc. and see performance p
- The authors provide a valuable resource for the community to benchmark generalist robot policies: there is really high diversity in simulated scenes, and a large amount of demonstrations for pre-training and post-training - There are a large number (365) of tasks spanning across 50 different categories. There is a large number of scenes, with 50 different layouts and each have 50 different styles. This is a huge improvement over prior work. - The authors also provide a good tool to benchmark l
Overall this is a good paper and a great resource for the community. However, I think the experiments section is still a bit lacking, and misses a good opportunity to provide more insights into the dataset collected by the authors. - The writing in the experiments section is somewhat unclear: (1) It is unclear what the evaluation protocol is and how different models are compared (aka do you report success rate? Are there partial success? etc.). It is also unclear what the numbers mean in Table
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsReinforcement Learning in Robotics · Robot Manipulation and Learning · Social Robot Interaction and HRI
