Towards Generalizable Vision-Language Robotic Manipulation: A Benchmark and LLM-guided 3D Policy
Ricardo Garcia, Shizhe Chen, Cordelia Schmid

TL;DR
This paper introduces GemBench, a new benchmark for evaluating vision-language robotic manipulation generalization, and proposes 3D-LOTUS++ which combines 3D information, LLMs, and VLMs to improve performance on novel tasks.
Contribution
The paper presents GemBench, a comprehensive benchmark for generalization, and introduces 3D-LOTUS++ that enhances robotic manipulation capabilities with multi-modal integration.
Findings
3D-LOTUS performs well on seen tasks but struggles with novel tasks.
3D-LOTUS++ achieves state-of-the-art results on novel tasks.
GemBench provides a standardized platform for evaluating generalization in robotic manipulation.
Abstract
Generalizing language-conditioned robotic policies to new tasks remains a significant challenge, hampered by the lack of suitable simulation benchmarks. In this paper, we address this gap by introducing GemBench, a novel benchmark to assess generalization capabilities of vision-language robotic manipulation policies. GemBench incorporates seven general action primitives and four levels of generalization, spanning novel placements, rigid and articulated objects, and complex long-horizon tasks. We evaluate state-of-the-art approaches on GemBench and also introduce a new method. Our approach 3D-LOTUS leverages rich 3D information for action prediction conditioned on language. While 3D-LOTUS excels in both efficiency and performance on seen tasks, it struggles with novel tasks. To address this, we present 3D-LOTUS++, a framework that integrates 3D-LOTUS's motion planning capabilities with…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
