iBOT: Image BERT Pre-Training with Online Tokenizer
Jinghao Zhou, Chen Wei, Huiyu Wang, Wei Shen, Cihang Xie, Alan Yuille,, Tao Kong

TL;DR
iBOT introduces a self-supervised masked image modeling framework using an online learnable tokenizer, achieving state-of-the-art results in image classification and dense vision tasks without pre-training a separate tokenizer.
Contribution
The paper proposes a novel self-distillation approach with an online tokenizer for masked image modeling, eliminating the need for pre-training the tokenizer separately.
Findings
Achieved 82.3% linear probing accuracy on ImageNet-1K.
Achieved 87.8% fine-tuning accuracy on ImageNet-1K.
Demonstrated robustness and superior performance on dense vision tasks.
Abstract
The success of language Transformers is primarily attributed to the pretext task of masked language modeling (MLM), where texts are first tokenized into semantically meaningful pieces. In this work, we study masked image modeling (MIM) and indicate the advantages and challenges of using a semantically meaningful visual tokenizer. We present a self-supervised framework iBOT that can perform masked prediction with an online tokenizer. Specifically, we perform self-distillation on masked patch tokens and take the teacher network as the online tokenizer, along with self-distillation on the class token to acquire visual semantics. The online tokenizer is jointly learnable with the MIM objective and dispenses with a multi-stage training pipeline where the tokenizer needs to be pre-trained beforehand. We show the prominence of iBOT by achieving an 82.3% linear probing accuracy and an 87.8%…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Natural Language Processing Techniques
