VERM: Leveraging Foundation Models to Create a Virtual Eye for Efficient 3D Robotic Manipulation
Yixiang Chen, Yan Huang, Keji He, Peiyan Li, Liang Wang

TL;DR
VERM introduces a virtual eye mechanism using foundation models to improve 3D robotic manipulation by filtering redundant information, leading to faster training and inference, and better task performance.
Contribution
The paper proposes VERM, a novel virtual view generation method leveraging foundation models for efficient 3D manipulation with reduced computational costs.
Findings
Achieves 1.89x faster training
Achieves 1.54x faster inference
Outperforms previous state-of-the-art methods
Abstract
When performing 3D manipulation tasks, robots have to execute action planning based on perceptions from multiple fixed cameras. The multi-camera setup introduces substantial redundancy and irrelevant information, which increases computational costs and forces the model to spend extra training time extracting crucial task-relevant details. To filter out redundant information and accurately extract task-relevant features, we propose the VERM (Virtual Eye for Robotic Manipulation) method, leveraging the knowledge in foundation models to imagine a virtual task-adaptive view from the constructed 3D point cloud, which efficiently captures necessary information and mitigates occlusion. To facilitate 3D action planning and fine-grained manipulation, we further design a depth-aware module and a dynamic coarse-to-fine procedure. Extensive experimental results on both simulation benchmark RLBench…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsRobot Manipulation and Learning · Robotics and Sensor-Based Localization · Advanced Vision and Imaging
