ConvMAE: Masked Convolution Meets Masked Autoencoders

Peng Gao; Teli Ma; Hongsheng Li; Ziyi Lin; Jifeng Dai; Yu Qiao

arXiv:2205.03892·cs.CV·May 20, 2022·52 cites

ConvMAE: Masked Convolution Meets Masked Autoencoders

Peng Gao, Teli Ma, Hongsheng Li, Ziyi Lin, Jifeng Dai, Yu Qiao

PDF

Open Access 5 Repos

TL;DR

ConvMAE introduces a multi-scale hybrid convolution-transformer architecture with a novel masking strategy for efficient masked auto-encoding, leading to improved performance on vision tasks.

Contribution

The paper proposes a new ConvMAE framework that combines masked convolution with a block-wise masking strategy to enhance representation learning and efficiency.

Findings

01

ConvMAE-Base improves ImageNet-1K accuracy by 1.4%.

02

ConvMAE-Base outperforms MAE-Base in object detection after fewer epochs.

03

The proposed methods reduce computational cost and improve multi-scale feature supervision.

Abstract

Vision Transformers (ViT) become widely-adopted architectures for various vision tasks. Masked auto-encoding for feature pretraining and multi-scale hybrid convolution-transformer architectures can further unleash the potentials of ViT, leading to state-of-the-art performances on image classification, detection and semantic segmentation. In this paper, our ConvMAE framework demonstrates that multi-scale hybrid convolution-transformer can learn more discriminative representations via the mask auto-encoding scheme. However, directly using the original masking strategy leads to the heavy computational cost and pretraining-finetuning discrepancy. To tackle the issue, we adopt the masked convolution to prevent information leakage in the convolution blocks. A simple block-wise masking strategy is proposed to ensure computational efficiency. We also propose to more directly supervise the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Neural Network Applications · Visual Attention and Saliency Detection · CCD and CMOS Imaging Sensors

MethodsConvolution · Masked Convolution