iBOT: Image BERT Pre-Training with Online Tokenizer

Jinghao Zhou; Chen Wei; Huiyu Wang; Wei Shen; Cihang Xie; Alan Yuille,; Tao Kong

arXiv:2111.07832·cs.CV·January 28, 2022·209 cites

iBOT: Image BERT Pre-Training with Online Tokenizer

Jinghao Zhou, Chen Wei, Huiyu Wang, Wei Shen, Cihang Xie, Alan Yuille,, Tao Kong

PDF

Open Access 2 Repos 1 Models

TL;DR

iBOT introduces a self-supervised masked image modeling framework using an online learnable tokenizer, achieving state-of-the-art results in image classification and dense vision tasks without pre-training a separate tokenizer.

Contribution

The paper proposes a novel self-distillation approach with an online tokenizer for masked image modeling, eliminating the need for pre-training the tokenizer separately.

Findings

01

Achieved 82.3% linear probing accuracy on ImageNet-1K.

02

Achieved 87.8% fine-tuning accuracy on ImageNet-1K.

03

Demonstrated robustness and superior performance on dense vision tasks.

Abstract

The success of language Transformers is primarily attributed to the pretext task of masked language modeling (MLM), where texts are first tokenized into semantically meaningful pieces. In this work, we study masked image modeling (MIM) and indicate the advantages and challenges of using a semantically meaningful visual tokenizer. We present a self-supervised framework iBOT that can perform masked prediction with an online tokenizer. Specifically, we perform self-distillation on masked patch tokens and take the teacher network as the online tokenizer, along with self-distillation on the class token to acquire visual semantics. The online tokenizer is jointly learnable with the MIM objective and dispenses with a multi-stage training pipeline where the tokenizer needs to be pre-trained beforehand. We show the prominence of iBOT by achieving an 82.3% linear probing accuracy and an 87.8%…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Models

🤗
birder-project/rdnet_t_ibot-bioscan5m
model· 97 dl· ♡ 1
97 dl♡ 1

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Natural Language Processing Techniques