TinyMIM: An Empirical Study of Distilling MIM Pre-trained Models
Sucheng Ren, Fangyun Wei, Zheng Zhang, Han Hu

TL;DR
This paper investigates distillation techniques to effectively transfer knowledge from large Masked Image Modeling pre-trained vision Transformers to smaller models, significantly improving their fine-tuning accuracy and establishing new benchmarks.
Contribution
It systematically studies distillation strategies for MIM pre-trained models, revealing effective methods for small vision Transformer model enhancement.
Findings
Token relation distillation outperforms other methods.
Intermediate layer targets are more effective than last layer.
Weak regularization yields better results.
Abstract
Masked image modeling (MIM) performs strongly in pre-training large vision Transformers (ViTs). However, small models that are critical for real-world applications cannot or only marginally benefit from this pre-training approach. In this paper, we explore distillation techniques to transfer the success of large MIM-based pre-trained models to smaller ones. We systematically study different options in the distillation framework, including distilling targets, losses, input, network regularization, sequential distillation, etc, revealing that: 1) Distilling token relations is more effective than CLS token- and feature-based distillation; 2) An intermediate layer of the teacher network as target perform better than that using the last layer when the depth of the student mismatches that of the teacher; 3) Weak regularization is preferred; etc. With these findings, we achieve significant…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Neural Network Applications · Advanced Image and Video Retrieval Techniques · Medical Image Segmentation Techniques
MethodsMulti-Head Attention · Attention Is All You Need · Masked autoencoder · Linear Layer · Softmax · Adam · Byte Pair Encoding · Residual Connection · Label Smoothing · Dropout
