Learning to See and Act: Task-Aware Virtual View Exploration for Robotic Manipulation
Yongjie Bai, Zhouxia Wang, Yang Liu, Kaijun Luo, Yifan Wen, Mingtong Dai, Weixing Chen, Ziliang Chen, Lingbo Liu, Guanbin Li, Liang Lin

TL;DR
This paper introduces TVVE, a task-aware virtual view exploration framework that enhances robotic manipulation by selecting relevant viewpoints and using a specialized visual encoder, improving robustness and transferability.
Contribution
The paper proposes a novel framework combining task-aware viewpoint selection and a mixture-of-experts encoder to improve multi-task robot manipulation under occlusions and distribution shifts.
Findings
TVVE outperforms baselines in success rates on RLBench tasks.
It demonstrates robustness to visual disturbances and unseen instructions.
The approach improves transferability in multi-task robotic manipulation.
Abstract
Recent vision-language-action (VLA) models for multi-task robot manipulation often rely on fixed camera setups and shared visual encoders, which limit their performance under occlusions and during cross-task transfer. To address these challenges, we propose Task-aware Virtual View Exploration (TVVE), a framework that learns to select task-relevant virtual camera viewpoints and dynamically re-render observations from a reconstructed scene representation using the selected viewpoints. To enable efficient view selection, we train an exploration policy in a pseudo-environment. In addition, we introduce a Task-aware Mixture-of-Experts (TaskMoE) visual encoder that routes visual features to task-specialized experts, mitigating interference in multi-task learning. To evaluate robustness under distribution shifts, we construct RLBench-OG, an out-of-distribution benchmark with visual…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
