RevColV2: Exploring Disentangled Representations in Masked Image   Modeling

Qi Han; Yuxuan Cai; Xiangyu Zhang

arXiv:2309.01005·cs.CV·September 6, 2023·6 cites

RevColV2: Exploring Disentangled Representations in Masked Image Modeling

Qi Han, Yuxuan Cai, Xiangyu Zhang

PDF

Open Access 1 Repo 1 Video

TL;DR

RevColV2 introduces a reversible autoencoder architecture for masked image modeling that maintains disentangled features during pre-training and fine-tuning, leading to strong performance across various vision tasks.

Contribution

It proposes a novel reversible autoencoder architecture that preserves disentangled representations throughout training and inference, improving downstream task performance.

Findings

01

Achieves 88.4% top-1 accuracy on ImageNet-1K classification.

02

Attains 58.6 mIoU on ADE20K semantic segmentation.

03

Reaches 62.1 box AP on COCO detection.

Abstract

Masked image modeling (MIM) has become a prevalent pre-training setup for vision foundation models and attains promising performance. Despite its success, existing MIM methods discard the decoder network during downstream applications, resulting in inconsistent representations between pre-training and fine-tuning and can hamper downstream task performance. In this paper, we propose a new architecture, RevColV2, which tackles this issue by keeping the entire autoencoder architecture during both pre-training and fine-tuning. The main body of RevColV2 contains bottom-up columns and top-down columns, between which information is reversibly propagated and gradually disentangled. Such design enables our architecture with the nice property: maintaining disentangled low-level and semantic information at the end of the network in MIM pre-training. Our experimental results suggest that a…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

megvii-research/revcol
pytorchOfficial

Videos

RevColV2: Exploring Disentangled Representations in Masked Image Modeling· slideslive

Taxonomy

TopicsAdvanced Neural Network Applications · Multimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis

MethodsMutual Information Machine/Mask Image Modeling