BEiT v2: Masked Image Modeling with Vector-Quantized Visual Tokenizers
Zhiliang Peng, Li Dong, Hangbo Bao, Qixiang Ye, Furu Wei

TL;DR
BEiT v2 introduces a semantic-rich visual tokenizer and a patch aggregation strategy to enhance masked image modeling, significantly improving vision Transformer performance on classification and segmentation tasks.
Contribution
The paper proposes a novel vector-quantized knowledge distillation method and patch aggregation strategy to elevate MIM from pixel-level to semantic-level understanding.
Findings
Achieves 85.5% top-1 accuracy on ImageNet-1K with base-size BEiT v2.
Outperforms previous MIM methods on classification and segmentation.
Large-size BEiT v2 reaches 87.3% accuracy and 56.7% mIoU.
Abstract
Masked image modeling (MIM) has demonstrated impressive results in self-supervised representation learning by recovering corrupted image patches. However, most existing studies operate on low-level image pixels, which hinders the exploitation of high-level semantics for representation models. In this work, we propose to use a semantic-rich visual tokenizer as the reconstruction target for masked prediction, providing a systematic way to promote MIM from pixel-level to semantic-level. Specifically, we propose vector-quantized knowledge distillation to train the tokenizer, which discretizes a continuous semantic space to compact codes. We then pretrain vision Transformers by predicting the original visual tokens for the masked image patches. Furthermore, we introduce a patch aggregation strategy which associates discrete image patches to enhance global semantic representation. Experiments…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗timm/beitv2_base_patch16_224.in1k_ft_in22kmodel· 111k dl· ♡ 1111k dl♡ 1
- 🤗timm/beitv2_base_patch16_224.in1k_ft_in22k_in1kmodel· 3.1k dl3.1k dl
- 🤗timm/beitv2_large_patch16_224.in1k_ft_in22kmodel· 506 dl506 dl
- 🤗timm/beitv2_large_patch16_224.in1k_ft_in22k_in1kmodel· 1.6k dl· ♡ 21.6k dl♡ 2
- 🤗timm/beitv2_base_patch16_224.in1k_ft_in1kmodel· 212 dl· ♡ 1212 dl♡ 1
- 🤗timm/beitv2_large_patch16_224.in1k_ft_in1kmodel· 69 dl69 dl
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Neural Network Applications · Generative Adversarial Networks and Image Synthesis · Domain Adaptation and Few-Shot Learning
MethodsSoftmax · Linear Layer · Layer Normalization · Residual Connection · Attention Is All You Need · Dense Connections · Multi-Head Attention · Vision Transformer · Knowledge Distillation · Mutual Information Machine/Mask Image Modeling
