What Matters in Learning from Large-Scale Datasets for Robot Manipulation
Vaibhav Saxena, Matthew Bronars, Nadun Ranawaka Arachchige, Kuancheng Wang, Woo Chul Shin, Soroush Nasiriany, Ajay Mandlekar, Danfei Xu

TL;DR
This paper investigates how the composition of large-scale robot datasets affects imitation learning, revealing key diversity factors like camera poses and object arrangements, and proposes improved retrieval strategies that significantly enhance policy performance.
Contribution
The study systematically analyzes dataset composition effects on robot learning and introduces effective retrieval methods to optimize the use of existing datasets.
Findings
Camera poses and spatial arrangements are critical for dataset diversity and retrieval.
Retrieval strategies based on these insights outperform existing methods by up to 70%.
Simulation results transfer effectively to real-world robot learning scenarios.
Abstract
Imitation learning from large multi-task demonstration datasets has emerged as a promising path for building generally-capable robots. As a result, 1000s of hours have been spent on building such large-scale datasets around the globe. Despite the continuous growth of such efforts, we still lack a systematic understanding of what data should be collected to improve the utility of a robotics dataset and facilitate downstream policy learning. In this work, we conduct a large-scale dataset composition study to answer this question. We develop a data generation framework to procedurally emulate common sources of diversity in existing datasets (such as sensor placements and object types and arrangements), and use it to generate large-scale robot datasets with controlled compositions, enabling a suite of dataset composition studies that would be prohibitively expensive in the real world. We…
Peer Reviews
Decision·ICLR 2025 Poster
1. This paper collects a large-scale robot manipulation dataset in simulation with diverse dimensions of variations to enhance understanding of each factor’s impact. 2. This paper provides practical suggestions on how to collect and leverage the data, and conducts real-world experiments on the DROID dataset to verify its effectiveness.
1. In section 5.1, from the collector’s perspective, there’s only one task, “clear table”, which seems insufficient to draw conclusions (especially given that more tasks are actually used in the following sections). 2. As illustrated in the paper, object geometry is vital to robot data. However, it is a pity that it is not included in the DVs of the experiments. 3. Although the MimicLabs dataset generates different textures for the experiments, the textures are still primarily pure colors acco
The paper addresses the crucial question of dataset composition in large-scale robotic manipulation through a systematic analysis. The two-perspective approach (collector and retriever) offers practical insights for both dataset collection and utilization. The authors conduct comprehensive experiments and clear validation in both simulation and real-world settings. The definitions and methodology based on "DV" are novel.
1. Results heavily rely on MimicGen's simplified environments with basic textures and lighting. The conclusion that "texture alignment is less important" may not hold in real-world scenarios like those in DROID with complex lighting, shadows, and material properties (as in computer vision, lighting effects are often considered as a special kind of texture). Limited scene diversity compared to real datasets like DROID (8 simulated scenes vs. 52 real buildings) 2. Limited Policy Evaluation: Exper
1. This work tackles a popular yet significant problem concerning usage of large-scale robotics dataset. 2. The authors conduct extensive experiments on collector and retriever perspective with various DVs. 3. The paper is well-written and easy to follow.
1. The experiments in Section 5.1 only consider one task *clear table*, making the conclusions less convincing. Results on more simulation tasks or clarification on the representativeness of this task should be added. 2. The co-train setting in this paper largely focuses on using generated/retrieved data from the same task as the test task (except for some results in Table 3). However, aligning the task setting between the target task and the tasks in a public large-scale dataset is difficult in
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsRobot Manipulation and Learning
MethodsFocus
