VIP: Vision Instructed Pre-training for Robotic Manipulation

Zhuoling Li; Liangliang Ren; Jinrong Yang; Yong Zhao; Xiaoyang Wu,; Zhenhua Xu; Xiang Bai; Hengshuang Zhao

arXiv:2410.07169·cs.RO·February 12, 2025

VIP: Vision Instructed Pre-training for Robotic Manipulation

Zhuoling Li, Liangliang Ren, Jinrong Yang, Yong Zhao, Xiaoyang Wu,, Zhenhua Xu, Xiang Bai, Hengshuang Zhao

PDF

Open Access 1 Models 1 Datasets 1 Video

TL;DR

This paper introduces VIP, a vision instructed pre-training method for robotic manipulation that leverages visual guidance and sparse point flows to enhance task understanding and improve performance across diverse tasks.

Contribution

The paper proposes using vision instructions and sparse point flows for pre-training policies, addressing limitations of text-based instructions in robotic manipulation.

Findings

01

VIP significantly improves task performance.

02

Policies can complete complex tasks like opening sealed bottles.

03

Vision instructions outperform text instructions in training policies.

Abstract

The effectiveness of scaling up training data in robotic manipulation is still limited. A primary challenge in manipulation is the tasks are diverse, and the trained policy would be confused if the task targets are not specified clearly. Existing works primarily rely on text instruction to describe targets. However, we reveal that current robotic data cannot train policies to understand text instruction effectively, and vision is much more comprehensible. Therefore, we introduce utilizing vision instruction to specify targets. A straightforward implementation is training a policy to predict the intermediate actions linking the current observation and a future image. Nevertheless, a single future image does not describe the task target in insufficient detail. To handle this problem, we propose to use sparse point flows to provide more detailed information. Extensive tasks are designed…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

🤗
Zhuoling98/VIRT_droid_pretrain
model

Datasets

Zhuoling98/VIRT_data
dataset· 25 dl
25 dl

Videos

VIP: Vision Instructed Pre-training for Robotic Manipulation· slideslive

Taxonomy

TopicsAdvanced Vision and Imaging · Image Processing Techniques and Applications · Teleoperation and Haptic Systems