MuM: Multi-View Masked Image Modeling for 3D Vision

David Nordstr\"om; Johan Edstedt; Fredrik Kahl; Georg B\"okman

arXiv:2511.17309·cs.CV·November 24, 2025

MuM: Multi-View Masked Image Modeling for 3D Vision

David Nordstr\"om, Johan Edstedt, Fredrik Kahl, Georg B\"okman

PDF

Open Access

TL;DR

MuM introduces a scalable multi-view masked autoencoding approach for 3D vision that outperforms existing models like DINOv3 and CroCo v2 on various downstream tasks.

Contribution

The paper extends masked autoencoding to multiple views of the same scene, offering a simpler and more scalable method for 3D visual feature learning.

Findings

01

Outperforms DINOv3 and CroCo v2 on downstream tasks

02

Effective for feedforward reconstruction, dense matching, and pose estimation

03

Simpler and more scalable than previous multi-view methods

Abstract

Self-supervised learning on images seeks to extract meaningful visual representations from unlabeled data. When scaled to large datasets, this paradigm has achieved state-of-the-art performance and the resulting trained models such as DINOv3 have seen widespread adoption. However, most prior efforts are optimized for semantic understanding rather than geometric reasoning. One important exception is Cross-View Completion, CroCo, which is a form of masked autoencoding (MAE) tailored for 3D understanding. In this work, we continue on the path proposed by CroCo and focus on learning features tailored for 3D vision. In a nutshell, we extend MAE to arbitrarily many views of the same scene. By uniformly masking all views and employing a lightweight decoder with inter-frame attention, our approach is inherently simpler and more scalable than CroCo. We evaluate the resulting model, MuM,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

Topics3D Shape Modeling and Analysis · Advanced Vision and Imaging · Face recognition and analysis