Grounding Robot Generalization in Training Data via Retrieval-Augmented VLMs
Jensen Gao, Dorsa Sadigh, Sandy Huang, Dhruv Shah

TL;DR
RADAR is a scalable framework that uses retrieval and vision-language models to analyze and classify the type of policy generalization needed in robot manipulation tasks, improving evaluation precision.
Contribution
Introduces RADAR, a two-stage retrieval and analysis pipeline that characterizes policy generalization in robotics using interpretable data comparisons and large-scale datasets.
Findings
VLMs effectively analyze data for generalization.
Retrieval step accurately identifies relevant training examples.
RADAR scales to large datasets and agrees with human benchmarks.
Abstract
Recent work on robot manipulation has advanced policy generalization to novel scenarios. However, it is often difficult to characterize how different evaluation settings actually represent generalization from the training distribution of a given policy. To work towards more precise evaluation of generalization in robotics, we propose RADAR, a scalable framework for directly comparing test-time evaluation tasks to policy training data, to determine what form of policy generalization is required. RADAR consists of a two-stage pipeline: first, retrieval using generalist policy embeddings identifies which training examples are relevant for a given evaluation task. Next, vision-language models (VLMs) analyze the evaluation task against the retrieved data, outputting interpretable analysis on how they compare along a variety of axes, and an overall classification of what type of policy…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Robot Manipulation and Learning · Domain Adaptation and Few-Shot Learning
