Learning Action Manifold with Multi-view Latent Priors for Robotic Manipulation

Junjin Xiao; Dongyang Li; Yandan Yang; Shuang Zeng; Tong Lin; Xinyuan Chang; Feng Xiong; Mu Xu; Xing Wei; Zhiheng Ma; Qing Zhang; Wei-Shi Zheng

arXiv:2605.11832·cs.RO·May 13, 2026

Learning Action Manifold with Multi-view Latent Priors for Robotic Manipulation

Junjin Xiao, Dongyang Li, Yandan Yang, Shuang Zeng, Tong Lin, Xinyuan Chang, Feng Xiong, Mu Xu, Xing Wei, Zhiheng Ma, Qing Zhang, Wei-Shi Zheng

PDF

1 Repo 1 Models 1 Datasets

TL;DR

This paper introduces a novel approach combining multi-view diffusion models, geometry-guided transformers, and action manifold learning to enhance robotic manipulation and perception in vision-language tasks.

Contribution

It proposes a new framework integrating multi-view synthesis, geometric alignment, and direct action prediction to improve efficiency and robustness in robotic manipulation.

Findings

01

Achieves higher success rates on LIBERO and RoboTwin 2.0 benchmarks.

02

Demonstrates robustness and efficiency improvements over state-of-the-art methods.

03

Validates effectiveness on real-robot manipulation tasks.

Abstract

This paper tackles spatial perception and manipulation challenges in Vision-Language-Action (VLA) models. To address depth ambiguity from monocular input, we leverage a pre-trained multi-view diffusion model to synthesize latent novel views and propose a Geometry-Guided Gated Transformer (G3T) that aligns multi-view features under 3D geometric guidance while adaptively filtering occlusion noise. To improve action learning efficiency, we introduce Action Manifold Learning (AML), which directly predicts actions on the valid action manifold, bypassing inefficient regression of unstructured targets like noise or velocity. Experiments on LIBERO, RoboTwin 2.0, and real-robot tasks show our method achieves superior success rate and robustness over SOTA baselines. Project page: https://junjxiao.github.io/Multi-view-VLA.github.io/.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

https://junjxiao.github.io/Multi-view-VLA.github.io
github

Models

🤗
junjin0/Multi-view-VLA
model

Datasets

junjin0/libero_mv_feats
dataset· 49 dl
49 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.