Uniform Masking: Enabling MAE Pre-training for Pyramid-based Vision Transformers with Locality
Xiang Li, Wenhai Wang, Lingfeng Yang, Jian Yang

TL;DR
This paper introduces Uniform Masking, a novel pre-training method for Pyramid-based Vision Transformers that improves efficiency and maintains performance by sampling patches uniformly across local windows.
Contribution
It proposes Uniform Masking, enabling effective MAE pre-training for Pyramid-based ViTs with locality, reducing computational costs while preserving downstream task performance.
Findings
UM-MAE speeds up pre-training by ~2x and reduces GPU memory usage.
Pre-trained Swin-Large with UM-MAE outperforms supervised models on ImageNet.
UM-MAE maintains competitive fine-tuning results across various vision tasks.
Abstract
Masked AutoEncoder (MAE) has recently led the trends of visual self-supervision area by an elegant asymmetric encoder-decoder design, which significantly optimizes both the pre-training efficiency and fine-tuning accuracy. Notably, the success of the asymmetric structure relies on the "global" property of Vanilla Vision Transformer (ViT), whose self-attention mechanism reasons over arbitrary subset of discrete image patches. However, it is still unclear how the advanced Pyramid-based ViTs (e.g., PVT, Swin) can be adopted in MAE pre-training as they commonly introduce operators within "local" windows, making it difficult to handle the random sequence of partial vision tokens. In this paper, we propose Uniform Masking (UM), successfully enabling MAE pre-training for Pyramid-based ViTs with locality (termed "UM-MAE" for short). Specifically, UM includes a Uniform Sampling (US) that…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Neural Network Applications · Image and Signal Denoising Methods · Image Enhancement Techniques
MethodsMulti-Head Attention · Attention Is All You Need · Masked autoencoder · Linear Layer · Byte Pair Encoding · Dropout · Position-Wise Feed-Forward Layer · Adam · Residual Connection · Label Smoothing
