# Generic-to-Specific Distillation of Masked Autoencoders

**Authors:** Wei Huang, Zhiliang Peng, Li Dong, Furu Wei, Jianbin Jiao, Qixiang Ye

arXiv: 2302.14771 · 2023-03-01

## TL;DR

This paper introduces generic-to-specific distillation (G2SD), a two-stage method that enhances small vision Transformer models by transferring both task-agnostic and task-specific knowledge from large pre-trained models, improving performance across tasks.

## Contribution

The paper proposes G2SD, a novel two-stage distillation framework that effectively transfers comprehensive knowledge from large masked autoencoder pre-trained models to small ViT models.

## Key findings

- Small ViT models achieve over 98% of large model performance in classification.
- G2SD improves small ViT performance in object detection and segmentation.
- Code will be publicly available for reproducibility.

## Abstract

Large vision Transformers (ViTs) driven by self-supervised pre-training mechanisms achieved unprecedented progress. Lightweight ViT models limited by the model capacity, however, benefit little from those pre-training mechanisms. Knowledge distillation defines a paradigm to transfer representations from large (teacher) models to small (student) ones. However, the conventional single-stage distillation easily gets stuck on task-specific transfer, failing to retain the task-agnostic knowledge crucial for model generalization. In this study, we propose generic-to-specific distillation (G2SD), to tap the potential of small ViT models under the supervision of large models pre-trained by masked autoencoders. In generic distillation, decoder of the small model is encouraged to align feature predictions with hidden representations of the large model, so that task-agnostic knowledge can be transferred. In specific distillation, predictions of the small model are constrained to be consistent with those of the large model, to transfer task-specific features which guarantee task performance. With G2SD, the vanilla ViT-Small model respectively achieves 98.7%, 98.1% and 99.3% the performance of its teacher (ViT-Base) for image classification, object detection, and semantic segmentation, setting a solid baseline for two-stage vision distillation. Code will be available at https://github.com/pengzhiliang/G2SD.

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/2302.14771/full.md

## Figures

4 figures with captions in the complete paper: https://tomesphere.com/paper/2302.14771/full.md

## References

59 references — full list in the complete paper: https://tomesphere.com/paper/2302.14771/full.md

---
Source: https://tomesphere.com/paper/2302.14771