RevColV2: Exploring Disentangled Representations in Masked Image Modeling
Qi Han, Yuxuan Cai, Xiangyu Zhang

TL;DR
RevColV2 introduces a reversible autoencoder architecture for masked image modeling that maintains disentangled features during pre-training and fine-tuning, leading to strong performance across various vision tasks.
Contribution
It proposes a novel reversible autoencoder architecture that preserves disentangled representations throughout training and inference, improving downstream task performance.
Findings
Achieves 88.4% top-1 accuracy on ImageNet-1K classification.
Attains 58.6 mIoU on ADE20K semantic segmentation.
Reaches 62.1 box AP on COCO detection.
Abstract
Masked image modeling (MIM) has become a prevalent pre-training setup for vision foundation models and attains promising performance. Despite its success, existing MIM methods discard the decoder network during downstream applications, resulting in inconsistent representations between pre-training and fine-tuning and can hamper downstream task performance. In this paper, we propose a new architecture, RevColV2, which tackles this issue by keeping the entire autoencoder architecture during both pre-training and fine-tuning. The main body of RevColV2 contains bottom-up columns and top-down columns, between which information is reversibly propagated and gradually disentangled. Such design enables our architecture with the nice property: maintaining disentangled low-level and semantic information at the end of the network in MIM pre-training. Our experimental results suggest that a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsAdvanced Neural Network Applications · Multimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis
MethodsMutual Information Machine/Mask Image Modeling
