MuM: Multi-View Masked Image Modeling for 3D Vision
David Nordstr\"om, Johan Edstedt, Fredrik Kahl, Georg B\"okman

TL;DR
MuM introduces a scalable multi-view masked autoencoding approach for 3D vision that outperforms existing models like DINOv3 and CroCo v2 on various downstream tasks.
Contribution
The paper extends masked autoencoding to multiple views of the same scene, offering a simpler and more scalable method for 3D visual feature learning.
Findings
Outperforms DINOv3 and CroCo v2 on downstream tasks
Effective for feedforward reconstruction, dense matching, and pose estimation
Simpler and more scalable than previous multi-view methods
Abstract
Self-supervised learning on images seeks to extract meaningful visual representations from unlabeled data. When scaled to large datasets, this paradigm has achieved state-of-the-art performance and the resulting trained models such as DINOv3 have seen widespread adoption. However, most prior efforts are optimized for semantic understanding rather than geometric reasoning. One important exception is Cross-View Completion, CroCo, which is a form of masked autoencoding (MAE) tailored for 3D understanding. In this work, we continue on the path proposed by CroCo and focus on learning features tailored for 3D vision. In a nutshell, we extend MAE to arbitrarily many views of the same scene. By uniformly masking all views and employing a lightweight decoder with inter-frame attention, our approach is inherently simpler and more scalable than CroCo. We evaluate the resulting model, MuM,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
Topics3D Shape Modeling and Analysis · Advanced Vision and Imaging · Face recognition and analysis
