A Two-Stage Progressive Pre-training using Multi-Modal Contrastive   Masked Autoencoders

Muhammad Abdullah Jamal; Omid Mohareri

arXiv:2408.02245·cs.CV·September 17, 2024

A Two-Stage Progressive Pre-training using Multi-Modal Contrastive Masked Autoencoders

Muhammad Abdullah Jamal, Omid Mohareri

PDF

Open Access

TL;DR

This paper introduces a two-stage progressive pre-training method for image understanding using multi-modal contrastive masked autoencoders, improving performance on RGB-D datasets through contrastive learning, masked autoencoding, and denoising.

Contribution

It presents a novel two-stage pre-training approach combining contrastive learning and masked autoencoding with denoising for RGB-D data, enhancing model robustness and performance.

Findings

01

Achieved +1.3% mIoU improvement on ScanNet

02

Effective in low-data regimes for semantic segmentation

03

Outperforms state-of-the-art methods on multiple datasets

Abstract

In this paper, we propose a new progressive pre-training method for image understanding tasks which leverages RGB-D datasets. The method utilizes Multi-Modal Contrastive Masked Autoencoder and Denoising techniques. Our proposed approach consists of two stages. In the first stage, we pre-train the model using contrastive learning to learn cross-modal representations. In the second stage, we further pre-train the model using masked autoencoding and denoising/noise prediction used in diffusion models. Masked autoencoding focuses on reconstructing the missing patches in the input modality using local spatial correlations, while denoising learns high frequency components of the input data. Moreover, it incorporates global distillation in the second stage by leveraging the knowledge acquired in stage one. Our approach is scalable, robust and suitable for pre-training RGB-D datasets. Extensive…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHuman Pose and Action Recognition

MethodsDiffusion · Contrastive Learning