SAM-E: Leveraging Visual Foundation Model with Sequence Imitation for   Embodied Manipulation

Junjie Zhang; Chenjia Bai; Haoran He; Wenke Xia; Zhigang Wang; Bin; Zhao; Xiu Li; Xuelong Li

arXiv:2405.19586·cs.CV·May 31, 2024·1 cites

SAM-E: Leveraging Visual Foundation Model with Sequence Imitation for Embodied Manipulation

Junjie Zhang, Chenjia Bai, Haoran He, Wenke Xia, Zhigang Wang, Bin, Zhao, Xiu Li, Xuelong Li

PDF

Open Access

TL;DR

SAM-E introduces a novel robot manipulation architecture that leverages a vision foundation model and sequence imitation, significantly improving generalization, efficiency, and performance in multi-task 3D manipulation scenarios.

Contribution

The paper presents SAM-E, combining a pre-trained vision foundation model with sequence imitation and a new multi-channel heatmap for efficient long-term action prediction.

Findings

01

Outperforms baselines in instruction-following tasks

02

Achieves higher execution efficiency

03

Shows improved generalization in few-shot learning

Abstract

Acquiring a multi-task imitation policy in 3D manipulation poses challenges in terms of scene understanding and action prediction. Current methods employ both 3D representation and multi-view 2D representation to predict the poses of the robot's end-effector. However, they still require a considerable amount of high-quality robot trajectories, and suffer from limited generalization in unseen tasks and inefficient execution in long-horizon reasoning. In this paper, we propose SAM-E, a novel architecture for robot manipulation by leveraging a vision-foundation model for generalizable scene understanding and sequence imitation for long-term action reasoning. Specifically, we adopt Segment Anything (SAM) pre-trained on a huge number of images and promptable masks as the foundation model for extracting task-relevant features, and employ parameter-efficient fine-tuning on robot data for a…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHuman Pose and Action Recognition · Human Motion and Animation · Advanced Vision and Imaging

MethodsHeatmap