Contrastive Imitation Learning for Language-guided Multi-Task Robotic   Manipulation

Teli Ma; Jiaming Zhou; Zifan Wang; Ronghe Qiu; Junwei Liang

arXiv:2406.09738·cs.RO·June 17, 2024

Contrastive Imitation Learning for Language-guided Multi-Task Robotic Manipulation

Teli Ma, Jiaming Zhou, Zifan Wang, Ronghe Qiu, Junwei Liang

PDF

Open Access

TL;DR

This paper introduces Sigma-Agent, a novel imitation learning framework for multi-task robotic manipulation guided by language and vision, utilizing contrastive learning and a multi-view transformer to improve task understanding and performance.

Contribution

The paper presents Sigma-Agent, combining contrastive imitation learning modules and a multi-view transformer for enhanced multi-task robotic manipulation from language and visual inputs.

Findings

01

Outperforms state-of-the-art methods on 18 RLBench tasks.

02

Achieves 62% success rate in real-world manipulation with a single policy.

03

Surpasses RVT by 5.2% and 5.9% in training efficiency.

Abstract

Developing robots capable of executing various manipulation tasks, guided by natural language instructions and visual observations of intricate real-world environments, remains a significant challenge in robotics. Such robot agents need to understand linguistic commands and distinguish between the requirements of different tasks. In this work, we present Sigma-Agent, an end-to-end imitation learning agent for multi-task robotic manipulation. Sigma-Agent incorporates contrastive Imitation Learning (contrastive IL) modules to strengthen vision-language and current-future representations. An effective and efficient multi-view querying Transformer (MVQ-Former) for aggregating representative semantic information is introduced. Sigma-Agent shows substantial improvement over state-of-the-art methods under diverse settings in 18 RLBench tasks, surpassing RVT by an average of 5.2% and 5.9% in 10…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsRobot Manipulation and Learning · Multimodal Machine Learning Applications · Reinforcement Learning in Robotics

MethodsAttention Is All You Need · Residual Connection · Softmax · Layer Normalization · Byte Pair Encoding · Label Smoothing · Adam · Linear Layer · Multi-Head Attention · Position-Wise Feed-Forward Layer