Uniform Masking: Enabling MAE Pre-training for Pyramid-based Vision   Transformers with Locality

Xiang Li; Wenhai Wang; Lingfeng Yang; Jian Yang

arXiv:2205.10063·cs.CV·May 23, 2022·36 cites

Uniform Masking: Enabling MAE Pre-training for Pyramid-based Vision Transformers with Locality

Xiang Li, Wenhai Wang, Lingfeng Yang, Jian Yang

PDF

Open Access 1 Repo

TL;DR

This paper introduces Uniform Masking, a novel pre-training method for Pyramid-based Vision Transformers that improves efficiency and maintains performance by sampling patches uniformly across local windows.

Contribution

It proposes Uniform Masking, enabling effective MAE pre-training for Pyramid-based ViTs with locality, reducing computational costs while preserving downstream task performance.

Findings

01

UM-MAE speeds up pre-training by ~2x and reduces GPU memory usage.

02

Pre-trained Swin-Large with UM-MAE outperforms supervised models on ImageNet.

03

UM-MAE maintains competitive fine-tuning results across various vision tasks.

Abstract

Masked AutoEncoder (MAE) has recently led the trends of visual self-supervision area by an elegant asymmetric encoder-decoder design, which significantly optimizes both the pre-training efficiency and fine-tuning accuracy. Notably, the success of the asymmetric structure relies on the "global" property of Vanilla Vision Transformer (ViT), whose self-attention mechanism reasons over arbitrary subset of discrete image patches. However, it is still unclear how the advanced Pyramid-based ViTs (e.g., PVT, Swin) can be adopted in MAE pre-training as they commonly introduce operators within "local" windows, making it difficult to handle the random sequence of partial vision tokens. In this paper, we propose Uniform Masking (UM), successfully enabling MAE pre-training for Pyramid-based ViTs with locality (termed "UM-MAE" for short). Specifically, UM includes a Uniform Sampling (US) that…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

implus/um-mae
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Neural Network Applications · Image and Signal Denoising Methods · Image Enhancement Techniques

MethodsMulti-Head Attention · Attention Is All You Need · Masked autoencoder · Linear Layer · Byte Pair Encoding · Dropout · Position-Wise Feed-Forward Layer · Adam · Residual Connection · Label Smoothing