# Mask3D: Pre-training 2D Vision Transformers by Learning Masked 3D Priors

**Authors:** Ji Hou, Xiaoliang Dai, Zijian He, Angela Dai, Matthias Nie{\ss}ner

arXiv: 2302.14746 · 2023-03-01

## TL;DR

Mask3D introduces a self-supervised pre-training method for 2D vision transformers that leverages RGB-D data to embed 3D priors, significantly improving scene understanding tasks.

## Contribution

It presents a simple masking-based pre-training approach using RGB-D data to incorporate 3D information into 2D vision transformers.

## Key findings

- Outperforms existing self-supervised 3D pre-training methods.
- Achieves +6.5% mIoU improvement on ScanNet semantic segmentation.
- Enhances performance on scene understanding tasks like segmentation and detection.

## Abstract

Current popular backbones in computer vision, such as Vision Transformers (ViT) and ResNets are trained to perceive the world from 2D images. However, to more effectively understand 3D structural priors in 2D backbones, we propose Mask3D to leverage existing large-scale RGB-D data in a self-supervised pre-training to embed these 3D priors into 2D learned feature representations. In contrast to traditional 3D contrastive learning paradigms requiring 3D reconstructions or multi-view correspondences, our approach is simple: we formulate a pre-text reconstruction task by masking RGB and depth patches in individual RGB-D frames. We demonstrate the Mask3D is particularly effective in embedding 3D priors into the powerful 2D ViT backbone, enabling improved representation learning for various scene understanding tasks, such as semantic segmentation, instance segmentation and object detection. Experiments show that Mask3D notably outperforms existing self-supervised 3D pre-training approaches on ScanNet, NYUv2, and Cityscapes image understanding tasks, with an improvement of +6.5% mIoU against the state-of-the-art Pri3D on ScanNet image semantic segmentation.

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/2302.14746/full.md

## Figures

10 figures with captions in the complete paper: https://tomesphere.com/paper/2302.14746/full.md

## References

45 references — full list in the complete paper: https://tomesphere.com/paper/2302.14746/full.md

---
Source: https://tomesphere.com/paper/2302.14746