TinyMIM: An Empirical Study of Distilling MIM Pre-trained Models

Sucheng Ren; Fangyun Wei; Zheng Zhang; Han Hu

arXiv:2301.01296·cs.CV·January 4, 2023·1 cites

TinyMIM: An Empirical Study of Distilling MIM Pre-trained Models

Sucheng Ren, Fangyun Wei, Zheng Zhang, Han Hu

PDF

Open Access 2 Repos

TL;DR

This paper investigates distillation techniques to effectively transfer knowledge from large Masked Image Modeling pre-trained vision Transformers to smaller models, significantly improving their fine-tuning accuracy and establishing new benchmarks.

Contribution

It systematically studies distillation strategies for MIM pre-trained models, revealing effective methods for small vision Transformer model enhancement.

Findings

01

Token relation distillation outperforms other methods.

02

Intermediate layer targets are more effective than last layer.

03

Weak regularization yields better results.

Abstract

Masked image modeling (MIM) performs strongly in pre-training large vision Transformers (ViTs). However, small models that are critical for real-world applications cannot or only marginally benefit from this pre-training approach. In this paper, we explore distillation techniques to transfer the success of large MIM-based pre-trained models to smaller ones. We systematically study different options in the distillation framework, including distilling targets, losses, input, network regularization, sequential distillation, etc, revealing that: 1) Distilling token relations is more effective than CLS token- and feature-based distillation; 2) An intermediate layer of the teacher network as target perform better than that using the last layer when the depth of the student mismatches that of the teacher; 3) Weak regularization is preferred; etc. With these findings, we achieve significant…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Neural Network Applications · Advanced Image and Video Retrieval Techniques · Medical Image Segmentation Techniques

MethodsMulti-Head Attention · Attention Is All You Need · Masked autoencoder · Linear Layer · Softmax · Adam · Byte Pair Encoding · Residual Connection · Label Smoothing · Dropout