Learning Efficient and Robust Language-conditioned Manipulation using Textual-Visual Relevancy and Equivariant Language Mapping

Mingxi Jia; Haojie Huang; Zhewen Zhang; Chenghao Wang; Linfeng Zhao; Dian Wang; Jason Xinyu Liu; Robin Walters; Robert Platt; Stefanie Tellex

arXiv:2406.15677·cs.RO·June 27, 2025

Learning Efficient and Robust Language-conditioned Manipulation using Textual-Visual Relevancy and Equivariant Language Mapping

Mingxi Jia, Haojie Huang, Zhewen Zhang, Chenghao Wang, Linfeng Zhao, Dian Wang, Jason Xinyu Liu, Robin Walters, Robert Platt, Stefanie Tellex

PDF

Open Access

TL;DR

This paper presents GEM, a method that combines pretrained vision-language models with equivariant language mapping to enable efficient, robust, and generalizable language-conditioned robot manipulation in both simulation and real-world settings.

Contribution

GEM introduces a novel approach that enhances robustness and sample efficiency in language-conditioned manipulation by leveraging equivariant language mapping and pretrained models.

Findings

01

GEM outperforms baselines in sample efficiency and generalization.

02

GEM maintains high performance on unseen objects and poses.

03

GEM demonstrates robustness over large vision-language models.

Abstract

Controlling robots through natural language is pivotal for enhancing human-robot collaboration and synthesizing complex robot behaviors. Recent works that are trained on large robot datasets show impressive generalization abilities. However, such pretrained methods are (1) often fragile to unseen scenarios, and (2) expensive to adapt to new tasks. This paper introduces Grounded Equivariant Manipulation (GEM), a robust yet efficient approach that leverages pretrained vision-language models with equivariant language mapping for language-conditioned manipulation tasks. Our experiments demonstrate GEM's high sample efficiency and generalization ability across diverse tasks in both simulation and the real world. GEM achieves similar or higher performance with orders of magnitude fewer robot data compared with major data-efficient baselines such as CLIPort and VIMA. Finally, our approach…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGeographic Information Systems Studies · Semantic Web and Ontologies · Speech and dialogue systems