Towards Generalizable Vision-Language Robotic Manipulation: A Benchmark   and LLM-guided 3D Policy

Ricardo Garcia; Shizhe Chen; Cordelia Schmid

arXiv:2410.01345·cs.RO·March 4, 2025

Towards Generalizable Vision-Language Robotic Manipulation: A Benchmark and LLM-guided 3D Policy

Ricardo Garcia, Shizhe Chen, Cordelia Schmid

PDF

1 Repo 2 Models 1 Datasets

TL;DR

This paper introduces GemBench, a new benchmark for evaluating vision-language robotic manipulation generalization, and proposes 3D-LOTUS++ which combines 3D information, LLMs, and VLMs to improve performance on novel tasks.

Contribution

The paper presents GemBench, a comprehensive benchmark for generalization, and introduces 3D-LOTUS++ that enhances robotic manipulation capabilities with multi-modal integration.

Findings

01

3D-LOTUS performs well on seen tasks but struggles with novel tasks.

02

3D-LOTUS++ achieves state-of-the-art results on novel tasks.

03

GemBench provides a standardized platform for evaluating generalization in robotic manipulation.

Abstract

Generalizing language-conditioned robotic policies to new tasks remains a significant challenge, hampered by the lack of suitable simulation benchmarks. In this paper, we address this gap by introducing GemBench, a novel benchmark to assess generalization capabilities of vision-language robotic manipulation policies. GemBench incorporates seven general action primitives and four levels of generalization, spanning novel placements, rigid and articulated objects, and complex long-horizon tasks. We evaluate state-of-the-art approaches on GemBench and also introduce a new method. Our approach 3D-LOTUS leverages rich 3D information for action prediction conditioned on language. While 3D-LOTUS excels in both efficiency and performance on seen tasks, it struggles with novel tasks. To address this, we present 3D-LOTUS++, a framework that integrates 3D-LOTUS's motion planning capabilities with…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

vlc-robot/robot-3dlotus
pytorchOfficial

Models

Datasets

rjgpinel/GEMBench
dataset· 1.0k dl
1.0k dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.