BEiT v2: Masked Image Modeling with Vector-Quantized Visual Tokenizers

Zhiliang Peng; Li Dong; Hangbo Bao; Qixiang Ye; Furu Wei

arXiv:2208.06366·cs.CV·October 4, 2022·114 cites

BEiT v2: Masked Image Modeling with Vector-Quantized Visual Tokenizers

Zhiliang Peng, Li Dong, Hangbo Bao, Qixiang Ye, Furu Wei

PDF

Open Access 3 Repos 6 Models

TL;DR

BEiT v2 introduces a semantic-rich visual tokenizer and a patch aggregation strategy to enhance masked image modeling, significantly improving vision Transformer performance on classification and segmentation tasks.

Contribution

The paper proposes a novel vector-quantized knowledge distillation method and patch aggregation strategy to elevate MIM from pixel-level to semantic-level understanding.

Findings

01

Achieves 85.5% top-1 accuracy on ImageNet-1K with base-size BEiT v2.

02

Outperforms previous MIM methods on classification and segmentation.

03

Large-size BEiT v2 reaches 87.3% accuracy and 56.7% mIoU.

Abstract

Masked image modeling (MIM) has demonstrated impressive results in self-supervised representation learning by recovering corrupted image patches. However, most existing studies operate on low-level image pixels, which hinders the exploitation of high-level semantics for representation models. In this work, we propose to use a semantic-rich visual tokenizer as the reconstruction target for masked prediction, providing a systematic way to promote MIM from pixel-level to semantic-level. Specifically, we propose vector-quantized knowledge distillation to train the tokenizer, which discretizes a continuous semantic space to compact codes. We then pretrain vision Transformers by predicting the original visual tokens for the masked image patches. Furthermore, we introduce a patch aggregation strategy which associates discrete image patches to enhance global semantic representation. Experiments…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Neural Network Applications · Generative Adversarial Networks and Image Synthesis · Domain Adaptation and Few-Shot Learning

MethodsSoftmax · Linear Layer · Layer Normalization · Residual Connection · Attention Is All You Need · Dense Connections · Multi-Head Attention · Vision Transformer · Knowledge Distillation · Mutual Information Machine/Mask Image Modeling