ConvMAE: Masked Convolution Meets Masked Autoencoders
Peng Gao, Teli Ma, Hongsheng Li, Ziyi Lin, Jifeng Dai, Yu Qiao

TL;DR
ConvMAE introduces a multi-scale hybrid convolution-transformer architecture with a novel masking strategy for efficient masked auto-encoding, leading to improved performance on vision tasks.
Contribution
The paper proposes a new ConvMAE framework that combines masked convolution with a block-wise masking strategy to enhance representation learning and efficiency.
Findings
ConvMAE-Base improves ImageNet-1K accuracy by 1.4%.
ConvMAE-Base outperforms MAE-Base in object detection after fewer epochs.
The proposed methods reduce computational cost and improve multi-scale feature supervision.
Abstract
Vision Transformers (ViT) become widely-adopted architectures for various vision tasks. Masked auto-encoding for feature pretraining and multi-scale hybrid convolution-transformer architectures can further unleash the potentials of ViT, leading to state-of-the-art performances on image classification, detection and semantic segmentation. In this paper, our ConvMAE framework demonstrates that multi-scale hybrid convolution-transformer can learn more discriminative representations via the mask auto-encoding scheme. However, directly using the original masking strategy leads to the heavy computational cost and pretraining-finetuning discrepancy. To tackle the issue, we adopt the masked convolution to prevent information leakage in the convolution blocks. A simple block-wise masking strategy is proposed to ensure computational efficiency. We also propose to more directly supervise the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Neural Network Applications · Visual Attention and Saliency Detection · CCD and CMOS Imaging Sensors
MethodsConvolution · Masked Convolution
