Beyond Viewpoint Generalization: What Multi-View Demonstrations Offer and How to Synthesize Them for Robot Manipulation?

Boyang Cai; Qiwei Liang; Jiawei Li; Shihang Weng; Zhaoxin Zhang; Tao Lin; Xiangyu Chen; Wenjie Zhang; Jiaqi Mao; Weisheng Xu; Bin Yang; Jiaming Liang; Junhao Cai; Renjing Xu

arXiv:2603.26757·cs.RO·March 31, 2026

Beyond Viewpoint Generalization: What Multi-View Demonstrations Offer and How to Synthesize Them for Robot Manipulation?

Boyang Cai, Qiwei Liang, Jiawei Li, Shihang Weng, Zhaoxin Zhang, Tao Lin, Xiangyu Chen, Wenjie Zhang, Jiaqi Mao, Weisheng Xu, Bin Yang, Jiaming Liang, Junhao Cai, Renjing Xu

PDF

TL;DR

This paper systematically studies the benefits of multi-view demonstrations for robot manipulation, revealing performance improvements, underlying mechanisms, and proposing RoboNVS to synthesize multi-view data from monocular inputs.

Contribution

It quantifies multi-view data benefits, analyzes underlying mechanisms, and introduces RoboNVS for synthesizing multi-view videos to enhance robot manipulation.

Findings

01

Multi-view demonstrations improve success and generalization in robot manipulation.

02

Performance benefits vary with view coverage, not always increasing with more views.

03

RoboNVS effectively synthesizes multi-view data from monocular inputs, boosting downstream performance.

Abstract

Does multi-view demonstration truly improve robot manipulation, or merely enhance cross-view robustness? We present a systematic study quantifying the performance gains, scaling behavior, and underlying mechanisms of multi-view data for robot manipulation. Controlled experiments show that, under both fixed and randomized backgrounds, multi-view demonstrations consistently improve single-view policy success and generalization. Performance varies non-monotonically with view coverage, revealing effective regimes rather than a simple "more is better" trend. Notably, multi-view data breaks the scaling limitation of single-view datasets and continues to raise performance ceilings after saturation. Mechanistic analysis shows that multi-view learning promotes manipulation-relevant visual representations, better aligns the action head with the learned feature distribution, and reduces…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.