M$^{3}$3D: Learning 3D priors using Multi-Modal Masked Autoencoders for 2D image and video understanding
Muhammad Abdullah Jamal, Omid Mohareri

TL;DR
M$^{3}$3D introduces a multi-modal masked autoencoder pre-training strategy that leverages 3D priors and cross-modal representations in RGB-D data, improving performance across various 2D and 3D vision tasks.
Contribution
The paper proposes a novel pre-training method combining masked image modeling and contrastive learning to enhance 3D and cross-modal feature representations.
Findings
Outperforms state-of-the-art on ScanNet, NYUv2, UCF-101, and OR-AR datasets.
Achieves +1.3% mIoU improvement on ScanNet semantic segmentation.
Demonstrates superior data efficiency in low-data regimes.
Abstract
We present a new pre-training strategy called M3D (ulti-odal asked ) built based on Multi-modal masked autoencoders that can leverage 3D priors and learned cross-modal representations in RGB-D data. We integrate two major self-supervised learning frameworks; Masked Image Modeling (MIM) and contrastive learning; aiming to effectively embed masked 3D priors and modality complementary features to enhance the correspondence between modalities. In contrast to recent approaches which are either focusing on specific downstream tasks or require multi-view correspondence, we show that our pre-training strategy is ubiquitous, enabling improved representation learning that can transfer into improved performance on various downstream tasks such as video action recognition, video action detection, 2D semantic segmentation and depth…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
M33D: Learning 3D Priors Using Multi-Modal Masked Autoencoders for 2D Image and Video Understanding· youtube
Taxonomy
TopicsHuman Pose and Action Recognition · Advanced Vision and Imaging · 3D Shape Modeling and Analysis
