VERM: Leveraging Foundation Models to Create a Virtual Eye for Efficient 3D Robotic Manipulation

Yixiang Chen; Yan Huang; Keji He; Peiyan Li; Liang Wang

arXiv:2512.16724·cs.RO·December 19, 2025

VERM: Leveraging Foundation Models to Create a Virtual Eye for Efficient 3D Robotic Manipulation

Yixiang Chen, Yan Huang, Keji He, Peiyan Li, Liang Wang

PDF

Open Access

TL;DR

VERM introduces a virtual eye mechanism using foundation models to improve 3D robotic manipulation by filtering redundant information, leading to faster training and inference, and better task performance.

Contribution

The paper proposes VERM, a novel virtual view generation method leveraging foundation models for efficient 3D manipulation with reduced computational costs.

Findings

01

Achieves 1.89x faster training

02

Achieves 1.54x faster inference

03

Outperforms previous state-of-the-art methods

Abstract

When performing 3D manipulation tasks, robots have to execute action planning based on perceptions from multiple fixed cameras. The multi-camera setup introduces substantial redundancy and irrelevant information, which increases computational costs and forces the model to spend extra training time extracting crucial task-relevant details. To filter out redundant information and accurately extract task-relevant features, we propose the VERM (Virtual Eye for Robotic Manipulation) method, leveraging the knowledge in foundation models to imagine a virtual task-adaptive view from the constructed 3D point cloud, which efficiently captures necessary information and mitigates occlusion. To facilitate 3D action planning and fine-grained manipulation, we further design a depth-aware module and a dynamic coarse-to-fine procedure. Extensive experimental results on both simulation benchmark RLBench…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsRobot Manipulation and Learning · Robotics and Sensor-Based Localization · Advanced Vision and Imaging