TL;DR
E-RayZer introduces a self-supervised 3D vision model that learns geometrically grounded representations directly from unlabeled multi-view images, outperforming prior methods and existing pre-trained models on 3D tasks.
Contribution
E-RayZer is the first to perform direct 3D self-supervised reconstruction with explicit geometry, improving 3D-aware representations without supervision.
Findings
E-RayZer outperforms RayZer on pose estimation.
It matches or surpasses supervised models like VGGT.
Its representations outperform leading visual pre-training models on 3D tasks.
Abstract
Self-supervised pre-training has driven rapid progress in foundation models for language, 2D images, and video, yet remains largely unexplored for learning 3D-aware representations from multi-view images. In this paper, we present E-RayZer, a self-supervised 3D vision model that learns geometrically grounded representations directly from unlabeled images. Unlike prior self-supervised methods such as RayZer, which infer 3D indirectly through latent-space view synthesis, E-RayZer operates directly in 3D space, performing self-supervised 3D reconstruction with Explicit geometry. This formulation eliminates shortcut solutions and yields representations that are 3D-aware. To ensure convergence and scalability, we introduce a fine-grained learning curriculum that organizes training from easy to hard samples and harmonizes heterogeneous data sources without any supervision. Experiments show…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
