Learning Action Manifold with Multi-view Latent Priors for Robotic Manipulation
Junjin Xiao, Dongyang Li, Yandan Yang, Shuang Zeng, Tong Lin, Xinyuan Chang, Feng Xiong, Mu Xu, Xing Wei, Zhiheng Ma, Qing Zhang, Wei-Shi Zheng

TL;DR
This paper introduces a novel approach combining multi-view diffusion models, geometry-guided transformers, and action manifold learning to enhance robotic manipulation and perception in vision-language tasks.
Contribution
It proposes a new framework integrating multi-view synthesis, geometric alignment, and direct action prediction to improve efficiency and robustness in robotic manipulation.
Findings
Achieves higher success rates on LIBERO and RoboTwin 2.0 benchmarks.
Demonstrates robustness and efficiency improvements over state-of-the-art methods.
Validates effectiveness on real-robot manipulation tasks.
Abstract
This paper tackles spatial perception and manipulation challenges in Vision-Language-Action (VLA) models. To address depth ambiguity from monocular input, we leverage a pre-trained multi-view diffusion model to synthesize latent novel views and propose a Geometry-Guided Gated Transformer (G3T) that aligns multi-view features under 3D geometric guidance while adaptively filtering occlusion noise. To improve action learning efficiency, we introduce Action Manifold Learning (AML), which directly predicts actions on the valid action manifold, bypassing inefficient regression of unstructured targets like noise or velocity. Experiments on LIBERO, RoboTwin 2.0, and real-robot tasks show our method achieves superior success rate and robustness over SOTA baselines. Project page: https://junjxiao.github.io/Multi-view-VLA.github.io/.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
