CoMAE: Single Model Hybrid Pre-training on Small-Scale RGB-D Datasets
Jiange Yang, Sheng Guo, Gangshan Wu, Limin Wang

TL;DR
CoMAE introduces a unified self-supervised pre-training framework for RGB and depth data that enhances scene recognition performance on small datasets by combining contrastive learning and masked autoencoding.
Contribution
This work proposes a novel single-model hybrid pre-training approach for RGB-D data, integrating contrastive learning and masked autoencoding with curriculum learning.
Findings
Effective on SUN RGB-D and NYUDv2 datasets
Data-efficient, performs well with small unlabeled datasets
Competitive with large-scale supervised pre-training
Abstract
Current RGB-D scene recognition approaches often train two standalone backbones for RGB and depth modalities with the same Places or ImageNet pre-training. However, the pre-trained depth network is still biased by RGB-based models which may result in a suboptimal solution. In this paper, we present a single-model self-supervised hybrid pre-training framework for RGB and depth modalities, termed as CoMAE. Our CoMAE presents a curriculum learning strategy to unify the two popular self-supervised representation learning algorithms: contrastive learning and masked image modeling. Specifically, we first build a patch-level alignment task to pre-train a single encoder shared by two modalities via cross-modal contrastive learning. Then, the pre-trained contrastive encoder is passed to a multi-modal masked autoencoder to capture the finer context features from a generative perspective. In…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsDomain Adaptation and Few-Shot Learning · Advanced Vision and Imaging · Image Processing Techniques and Applications
MethodsContrastive Learning
