BEiT: BERT Pre-Training of Image Transformers

Hangbo Bao; Li Dong; Songhao Piao; Furu Wei

arXiv:2106.08254·cs.CV·September 7, 2022·924 cites

BEiT: BERT Pre-Training of Image Transformers

Hangbo Bao, Li Dong, Songhao Piao, Furu Wei

PDF

Open Access 5 Repos 10 Models 1 Datasets 1 Video

TL;DR

BEiT introduces a self-supervised masked image modeling approach inspired by BERT, pretraining vision Transformers to recover visual tokens from corrupted images, leading to state-of-the-art results on image classification tasks.

Contribution

This paper presents a novel masked image modeling pretraining method for vision Transformers, inspired by BERT, which improves downstream task performance.

Findings

01

Achieves 83.2% top-1 accuracy on ImageNet-1K with base-size BEiT.

02

Large-size BEiT surpasses ViT-L with supervised pretraining on ImageNet-22K.

03

Pretraining with masked image modeling enhances vision Transformer performance.

Abstract

We introduce a self-supervised vision representation model BEiT, which stands for Bidirectional Encoder representation from Image Transformers. Following BERT developed in the natural language processing area, we propose a masked image modeling task to pretrain vision Transformers. Specifically, each image has two views in our pre-training, i.e, image patches (such as 16x16 pixels), and visual tokens (i.e., discrete tokens). We first "tokenize" the original image into visual tokens. Then we randomly mask some image patches and fed them into the backbone Transformer. The pre-training objective is to recover the original visual tokens based on the corrupted image patches. After pre-training BEiT, we directly fine-tune the model parameters on downstream tasks by appending task layers upon the pretrained encoder. Experimental results on image classification and semantic segmentation show…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Models

Datasets

HichTala/tw-2008-ygo-dataset
dataset· 9 dl
9 dl

Videos

BEiT: BERT Pre-Training of Image Transformers· slideslive

Taxonomy

TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Advanced Neural Network Applications

MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Multi-Head Attention · Attention Is All You Need · Linear Layer · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Feedforward Network · Byte Pair Encoding · Adam · Label Smoothing