RayZer: A Self-supervised Large View Synthesis Model
Hanwen Jiang, Hao Tan, Peng Wang, Haian Jin, Yue Zhao, Sai Bi, Kai, Zhang, Fujun Luan, Kalyan Sunkavalli, Qixing Huang, Georgios Pavlakos

TL;DR
RayZer is a self-supervised 3D vision model that synthesizes novel views from unposed images without requiring ground-truth camera data, leveraging a transformer-based architecture and 3D-aware auto-encoding.
Contribution
It introduces a self-supervised framework for multi-view 3D scene reconstruction and novel view synthesis without any 3D supervision or camera annotations.
Findings
Achieves comparable or superior view synthesis performance to pose-supervised methods.
Effectively disentangles camera and scene representations in a self-supervised manner.
Demonstrates emerging 3D awareness from unposed images.
Abstract
We present RayZer, a self-supervised multi-view 3D Vision model trained without any 3D supervision, i.e., camera poses and scene geometry, while exhibiting emerging 3D awareness. Concretely, RayZer takes unposed and uncalibrated images as input, recovers camera parameters, reconstructs a scene representation, and synthesizes novel views. During training, RayZer relies solely on its self-predicted camera poses to render target views, eliminating the need for any ground-truth camera annotations and allowing RayZer to be trained with 2D image supervision. The emerging 3D awareness of RayZer is attributed to two key factors. First, we design a self-supervised framework, which achieves 3D-aware auto-encoding of input images by disentangling camera and scene representations. Second, we design a transformer-based model in which the only 3D prior is the ray structure, connecting camera, pixel,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
