BEiT: BERT Pre-Training of Image Transformers
Hangbo Bao, Li Dong, Songhao Piao, Furu Wei

TL;DR
BEiT introduces a self-supervised masked image modeling approach inspired by BERT, pretraining vision Transformers to recover visual tokens from corrupted images, leading to state-of-the-art results on image classification tasks.
Contribution
This paper presents a novel masked image modeling pretraining method for vision Transformers, inspired by BERT, which improves downstream task performance.
Findings
Achieves 83.2% top-1 accuracy on ImageNet-1K with base-size BEiT.
Large-size BEiT surpasses ViT-L with supervised pretraining on ImageNet-22K.
Pretraining with masked image modeling enhances vision Transformer performance.
Abstract
We introduce a self-supervised vision representation model BEiT, which stands for Bidirectional Encoder representation from Image Transformers. Following BERT developed in the natural language processing area, we propose a masked image modeling task to pretrain vision Transformers. Specifically, each image has two views in our pre-training, i.e, image patches (such as 16x16 pixels), and visual tokens (i.e., discrete tokens). We first "tokenize" the original image into visual tokens. Then we randomly mask some image patches and fed them into the backbone Transformer. The pre-training objective is to recover the original visual tokens based on the corrupted image patches. After pre-training BEiT, we directly fine-tune the model parameters on downstream tasks by appending task layers upon the pretrained encoder. Experimental results on image classification and semantic segmentation show…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗YaminiDevikaKanakam/Facial-Emotion-Detection-FER-RAFDB-AffectNet-BEIT-Largemodel· 14 dl· ♡ 114 dl♡ 1
- 🤗microsoft/beit-base-finetuned-ade-640-640model· 3.4k dl· ♡ 133.4k dl♡ 13
- 🤗microsoft/beit-base-patch16-224-pt22k-ft22kmodel· 485k dl· ♡ 81485k dl♡ 81
- 🤗microsoft/beit-base-patch16-224-pt22kmodel· 2.8k dl· ♡ 32.8k dl♡ 3
- 🤗microsoft/beit-base-patch16-224model· 26k dl· ♡ 926k dl♡ 9
- 🤗microsoft/beit-base-patch16-384model· 386 dl· ♡ 5386 dl♡ 5
- 🤗microsoft/beit-large-finetuned-ade-640-640model· 4.6k dl· ♡ 154.6k dl♡ 15
- 🤗microsoft/beit-large-patch16-224-pt22k-ft22kmodel· 711 dl· ♡ 6711 dl♡ 6
- 🤗microsoft/beit-large-patch16-224-pt22kmodel· 505 dl· ♡ 3505 dl♡ 3
- 🤗microsoft/beit-large-patch16-224model· 461 dl· ♡ 2461 dl♡ 2
Videos
Taxonomy
TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Advanced Neural Network Applications
MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Multi-Head Attention · Attention Is All You Need · Linear Layer · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Feedforward Network · Byte Pair Encoding · Adam · Label Smoothing
