VIP: Vision Instructed Pre-training for Robotic Manipulation
Zhuoling Li, Liangliang Ren, Jinrong Yang, Yong Zhao, Xiaoyang Wu,, Zhenhua Xu, Xiang Bai, Hengshuang Zhao

TL;DR
This paper introduces VIP, a vision instructed pre-training method for robotic manipulation that leverages visual guidance and sparse point flows to enhance task understanding and improve performance across diverse tasks.
Contribution
The paper proposes using vision instructions and sparse point flows for pre-training policies, addressing limitations of text-based instructions in robotic manipulation.
Findings
VIP significantly improves task performance.
Policies can complete complex tasks like opening sealed bottles.
Vision instructions outperform text instructions in training policies.
Abstract
The effectiveness of scaling up training data in robotic manipulation is still limited. A primary challenge in manipulation is the tasks are diverse, and the trained policy would be confused if the task targets are not specified clearly. Existing works primarily rely on text instruction to describe targets. However, we reveal that current robotic data cannot train policies to understand text instruction effectively, and vision is much more comprehensible. Therefore, we introduce utilizing vision instruction to specify targets. A straightforward implementation is training a policy to predict the intermediate actions linking the current observation and a future image. Nevertheless, a single future image does not describe the task target in insufficient detail. To handle this problem, we propose to use sparse point flows to provide more detailed information. Extensive tasks are designed…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsAdvanced Vision and Imaging · Image Processing Techniques and Applications · Teleoperation and Haptic Systems
