A Two-Stage Progressive Pre-training using Multi-Modal Contrastive Masked Autoencoders
Muhammad Abdullah Jamal, Omid Mohareri

TL;DR
This paper introduces a two-stage progressive pre-training method for image understanding using multi-modal contrastive masked autoencoders, improving performance on RGB-D datasets through contrastive learning, masked autoencoding, and denoising.
Contribution
It presents a novel two-stage pre-training approach combining contrastive learning and masked autoencoding with denoising for RGB-D data, enhancing model robustness and performance.
Findings
Achieved +1.3% mIoU improvement on ScanNet
Effective in low-data regimes for semantic segmentation
Outperforms state-of-the-art methods on multiple datasets
Abstract
In this paper, we propose a new progressive pre-training method for image understanding tasks which leverages RGB-D datasets. The method utilizes Multi-Modal Contrastive Masked Autoencoder and Denoising techniques. Our proposed approach consists of two stages. In the first stage, we pre-train the model using contrastive learning to learn cross-modal representations. In the second stage, we further pre-train the model using masked autoencoding and denoising/noise prediction used in diffusion models. Masked autoencoding focuses on reconstructing the missing patches in the input modality using local spatial correlations, while denoising learns high frequency components of the input data. Moreover, it incorporates global distillation in the second stage by leveraging the knowledge acquired in stage one. Our approach is scalable, robust and suitable for pre-training RGB-D datasets. Extensive…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHuman Pose and Action Recognition
MethodsDiffusion · Contrastive Learning
