Improving Generalization of Language-Conditioned Robot Manipulation
Chenglin Cui, Chaoran Zhu, Changjae Oh, Andrea Cavallaro

TL;DR
This paper introduces a two-stage framework for language-conditioned robot manipulation that learns from few demonstrations, improving generalization and enabling zero-shot transfer in real-world environments.
Contribution
The paper proposes an instance-level semantic fusion module and a two-stage task decomposition approach that enhances generalization in language-conditioned robot manipulation from limited data.
Findings
Improves generalization in unseen environments
Enables zero-shot manipulation in real robots
Performs well with few demonstrations
Abstract
The control of robots for manipulation tasks generally relies on visual input. Recent advances in vision-language models (VLMs) enable the use of natural language instructions to condition visual input and control robots in a wider range of environments. However, existing methods require a large amount of data to fine-tune VLMs for operating in unseen environments. In this paper, we present a framework that learns object-arrangement tasks from just a few demonstrations. We propose a two-stage framework that divides object-arrangement tasks into a target localization stage, for picking the object, and a region determination stage for placing the object. We present an instance-level semantic fusion module that aligns the instance-level image crops with the text embedding, enabling the model to identify the target objects defined by the natural language instructions. We validate our method…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Robot Manipulation and Learning · Advanced Neural Network Applications
