M$^{3}$3D: Learning 3D priors using Multi-Modal Masked Autoencoders for   2D image and video understanding

Muhammad Abdullah Jamal; Omid Mohareri

arXiv:2309.15313·cs.CV·September 28, 2023

M$^{3}$3D: Learning 3D priors using Multi-Modal Masked Autoencoders for 2D image and video understanding

Muhammad Abdullah Jamal, Omid Mohareri

PDF

Open Access 1 Video

TL;DR

M$^{3}$3D introduces a multi-modal masked autoencoder pre-training strategy that leverages 3D priors and cross-modal representations in RGB-D data, improving performance across various 2D and 3D vision tasks.

Contribution

The paper proposes a novel pre-training method combining masked image modeling and contrastive learning to enhance 3D and cross-modal feature representations.

Findings

01

Outperforms state-of-the-art on ScanNet, NYUv2, UCF-101, and OR-AR datasets.

02

Achieves +1.3% mIoU improvement on ScanNet semantic segmentation.

03

Demonstrates superior data efficiency in low-data regimes.

Abstract

We present a new pre-training strategy called M $^{3}$ 3D ( $\underline{M}$ ulti- $\underline{M}$ odal $\underline{M}$ asked $\underline{3 D}$ ) built based on Multi-modal masked autoencoders that can leverage 3D priors and learned cross-modal representations in RGB-D data. We integrate two major self-supervised learning frameworks; Masked Image Modeling (MIM) and contrastive learning; aiming to effectively embed masked 3D priors and modality complementary features to enhance the correspondence between modalities. In contrast to recent approaches which are either focusing on specific downstream tasks or require multi-view correspondence, we show that our pre-training strategy is ubiquitous, enabling improved representation learning that can transfer into improved performance on various downstream tasks such as video action recognition, video action detection, 2D semantic segmentation and depth…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

M33D: Learning 3D Priors Using Multi-Modal Masked Autoencoders for 2D Image and Video Understanding· youtube

Taxonomy

TopicsHuman Pose and Action Recognition · Advanced Vision and Imaging · 3D Shape Modeling and Analysis