Generic-to-Specific Distillation of Masked Autoencoders

Wei Huang; Zhiliang Peng; Li Dong; Furu Wei; Jianbin Jiao; Qixiang Ye

arXiv:2302.14771·cs.CV·March 1, 2023

Generic-to-Specific Distillation of Masked Autoencoders

Wei Huang, Zhiliang Peng, Li Dong, Furu Wei, Jianbin Jiao, Qixiang Ye

PDF

Open Access 1 Repo

TL;DR

This paper introduces generic-to-specific distillation (G2SD), a two-stage method that enhances small vision Transformer models by transferring both task-agnostic and task-specific knowledge from large pre-trained models, improving performance across tasks.

Contribution

The paper proposes G2SD, a novel two-stage distillation framework that effectively transfers comprehensive knowledge from large masked autoencoder pre-trained models to small ViT models.

Findings

01

Small ViT models achieve over 98% of large model performance in classification.

02

G2SD improves small ViT performance in object detection and segmentation.

03

Code will be publicly available for reproducibility.

Abstract

Large vision Transformers (ViTs) driven by self-supervised pre-training mechanisms achieved unprecedented progress. Lightweight ViT models limited by the model capacity, however, benefit little from those pre-training mechanisms. Knowledge distillation defines a paradigm to transfer representations from large (teacher) models to small (student) ones. However, the conventional single-stage distillation easily gets stuck on task-specific transfer, failing to retain the task-agnostic knowledge crucial for model generalization. In this study, we propose generic-to-specific distillation (G2SD), to tap the potential of small ViT models under the supervision of large models pre-trained by masked autoencoders. In generic distillation, decoder of the small model is encouraged to align feature predictions with hidden representations of the large model, so that task-agnostic knowledge can be…

Tables12

Table 1. Table 1 : Top-1 accuracy on ImageNet-1k.

Table 2. Table 2 : Object detection and instance segmentation results on the MS COCO dataset.

Table 3. Table 3 : ADE20K validation results using UperNet [ 51 ] . The input image resolution is 512 × \times 512.

Method	#Param(M)	mIoU
ViT-Adapter-Ti [9]	36.1	42.6
Swin-T [28]	59.9	44.5
ConvNeXt-T [29]	60	46.0
ViT-Adapter-S [9]	57.6	46.6
DINO-S [5]	42.0	44.0
iBOT-S [59]	42.0	45.4
G2SD-Ti (ours)	11.0	44.5
G2SD-S (ours)	42.0	48.0

Table 4. Table 4 : Ablation study on single-stage and two-stage distillation methods, where G2SD w/o S.D denotes only performing generic distillation ( i.e . , without specific distillation) and MAE means performing task-specific distillation during fine-tuning phase of MAE [ 13 ] .

Table 5. Table 5 : Ablation study on generic distillation targets. 𝒆 i t subscript superscript 𝒆 𝑡 𝑖 {\bm{e}}^{t}_{i} , 𝒛 ^ i t subscript superscript ^ 𝒛 𝑡 𝑖 \hat{\bm{z}}^{t}_{i} and ℛ ( 𝒆 i t ) ℛ subscript superscript 𝒆 𝑡 𝑖 {\mathcal{R}}({\bm{e}}^{t}_{i}) respectively denote teacher encoder features, teacher decoder features, the relation among teacher encoder features. #5 is the default setting.

Target	$𝒆_{i}^{t}$	$ℛ (𝒆_{i}^{t})$	${\hat{𝒛}}_{i}^{t}$	${\hat{𝒛}}_{i}^{t}$	Accuracy	mIoU
Target	$i \in 𝒱$	$i \in 𝒱$	$i \in 𝒱$	$i \in ℳ$	(%)	(%)
#1	✓				81.60	43.69
#2		✓			81.45	43.64
#3				✓	81.96	45.20
#4	✓			✓	81.85	44.12
#5			✓	✓	81.99	46.19

Table 6. Table 6 : Ablation on the mask ratio ( top ) and target layer of the teacher model used for distillation ( bottom ).

Mask ratio	0.05	0.25	0.55	0.75	0.9
Top-1 Acc(%)	81.7	81.7	81.6	82.0	81.8
Layer Index	1	2	4	6	8
Top-1 Acc(%)	81.6	81.8	82.0	81.8	81.7

Table 7. Table 7 : Ablation study on the width and depth (D) of the student decoder. The depth and width of the teacher’s decoder are 8 and 512, respectively.

Width	D	Acc(%)	D	Acc(%)	D	Acc(%)
128	2	81.9	4	81.8	8	81.7
256		81.7		82.0		81.7
512		81.8		81.7		80.3

Table 8. Table 8 : Robustness evaluation. “IN” is short for ImageNet.

Table 9. Table 9 : Hyperparameters for distilling on ImageNet-1K.

Hyperparameters	Value	Value
Hyperparameters	(Fine-tuning)	(From scratch)
Training epochs	200	500
Base learning rate	1e-3	2.5e-4
Layer decay	0.75	1.0
Warm up epochs	5
Label smoothing	0.1
Mixup	0.8
Cutmix	1.0
Drop path	0.0
Batch size	1024
Weight decay	0.05
Optimizer	AdamW
Learning rate schedule	Cosine decay
Augmentation	RandAug(0,0.5)
Optimizer momentum	$β_{1}$ , $β_{2}$ = 0.9, 0.999

Table 10. Table 10 : G2SD v s 𝑣 𝑠 vs DeiT. The total training epochs is 500.

Table 11. Table 11 : Performance on MS COCO using the ViTDet framework [ 25 ] , which is trained for 100 epochs with single-scale input (1024 × \times 1024).

Table 12. Table 12 : Ablation study of distillation targets on ImageNet-1k. ‘S.D’ is short for specific distillation.

Distillation targets	W/O S.D Acc (%)	W S.D Acc (%)
Our default settings	82.0	82.5
MAE’s reconstructions	81.4	81.8
MAE’s reconstructions + GT	81.5	81.7

Equations12

h_{i} = e_{[M]} ⊙ δ (i \in M) + e_{i} ⊙ (1 - δ (i \in M)),

h_{i} = e_{[M]} ⊙ δ (i \in M) + e_{i} ⊙ (1 - δ (i \in M)),

L_{MAE} = i \in M \sum ∣∣ LN (x_{i}^{p}) - z_{i} ∣ ∣_{2},

L_{MAE} = i \in M \sum ∣∣ LN (x_{i}^{p}) - z_{i} ∣ ∣_{2},

L_{GD} = i \in {V ⋃ M} \sum Smooth- ℓ_{1} (LN (\hat{z}_{i}^{t}) - z_{i}^{s}),

L_{GD} = i \in {V ⋃ M} \sum Smooth- ℓ_{1} (LN (\hat{z}_{i}^{t}) - z_{i}^{s}),

L_{SD} = L_{Task} (f^{s} (x), Y) + β L_{KD} (f^{s} (x), f^{t} (x)),

L_{SD} = L_{Task} (f^{s} (x), Y) + β L_{KD} (f^{s} (x), f^{t} (x)),

θ^{s} arg max I_{θ^{s}, θ^{t}} (f^{t}, f^{s} ∣ \hat{X}),

θ^{s} arg max I_{θ^{s}, θ^{t}} (f^{t}, f^{s} ∣ \hat{X}),

θ^{s} arg max I_{θ^{s}, θ^{t}} (f^{t}, f^{s} ∣ X) + I_{θ^{s}, θ^{t}} (f^{t}, f^{s} ∣ \hat{X}) - I_{θ^{s}, θ^{t}} (f^{t}, f^{s} ∣ (X, \hat{X})),

θ^{s} arg max I_{θ^{s}, θ^{t}} (f^{t}, f^{s} ∣ X) + I_{θ^{s}, θ^{t}} (f^{t}, f^{s} ∣ \hat{X}) - I_{θ^{s}, θ^{t}} (f^{t}, f^{s} ∣ (X, \hat{X})),

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

pengzhiliang/g2sd
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Neural Network Applications · Domain Adaptation and Few-Shot Learning · COVID-19 diagnosis using AI

MethodsALIGN · Knowledge Distillation

Full text

Generic-to-Specific Distillation of Masked Autoencoders

Wei Huang1,, Zhiliang Peng1,§,11footnotemark: 1, Li Dong2, Furu Wei2, Jianbin Jiao1,†, Qixiang Ye1,†

University of Chinese Academy of Sciences1

Microsoft Research2 Equal contribution. $\S$ Contribution during internship at Microsoft Research. $\dagger$ Corresponding authors.

Abstract

Large vision Transformers (ViTs) driven by self-supervised pre-training mechanisms achieved unprecedented progress. Lightweight ViT models limited by the model capacity, however, benefit little from those pre-training mechanisms. Knowledge distillation defines a paradigm to transfer representations from large (teacher) models to small (student) ones. However, the conventional single-stage distillation easily gets stuck on task-specific transfer, failing to retain the task-agnostic knowledge crucial for model generalization. In this study, we propose generic-to-specific distillation (G2SD), to tap the potential of small ViT models under the supervision of large models pre-trained by masked autoencoders. In generic distillation, decoder of the small model is encouraged to align feature predictions with hidden representations of the large model, so that task-agnostic knowledge can be transferred. In specific distillation, predictions of the small model are constrained to be consistent with those of the large model, to transfer task-specific features which guarantee task performance. With G2SD, the vanilla ViT-Small model respectively achieves 98.7%, 98.1% and 99.3% the performance of its teacher (ViT-Base) for image classification, object detection, and semantic segmentation, setting a solid baseline for two-stage vision distillation. Code will be available at https://github.com/pengzhiliang/G2SD.

1 Introduction

Vision transformers (ViTs) [11, 55] have been promising representation models, particularly when trained upon large-scale datasets using self-supervised learning methods [5]. The masked image modeling (MIM) methods [13, 4], which train representation models by reconstructing pixels [13, 52, 57], tokens [4, 35, 8] or features [47, 2], promoted the performance of large ViT models to a new height.

However, when acclaiming the promising performance of large ViT models, we notice that small ViT models, $e.g.$ , ViT-Tiny and ViT-Small, unfortunately, benefit little from either the big training data or self-supervised learning methods. For example, the ViT-Large model trained by MAE [13] outperforms the CNN model [29] by 1.6 points on ImageNet-1k, while the ViT-Small model is inferior to its CNN counterpart [29]. In most scenarios with limited computational resources, $e.g.$ , front-end recognition systems, CNNs [15, 19] remain the preferred models.

Do vanilla small ViT models really have no future? We attempt to answer this question from the perspective of knowledge distillation in this study. To fulfill this purpose, the first step is to revisit the conventional knowledge distillation methods [18, 38, 41] in the age of supervised learning. It is observed that task-oriented distillation [41] reports unsatisfactory performance, Fig. 1. One reason could be that this kind of task-oriented distillation only focus on task-specific knowledge while missing some kind of task-agnostic knowledge which is beneficial to generalization ability improvement and can be effectively endowed by self-supervised teacher model. In natural language processing, two-stage distillation method, $e.g.$ , TinyBERT [22], was exploited to overcome the limitation and transfer generic knowledge embedded from teacher to student models. Nevertheless, whether or not this paradigm applicable to vision tasks remains unexplored.

In this study, we aim to establish a general-to-specific distillation baseline for vision tasks based on sophisticated self-supervised learning ( $e.g.$ , MAE [13]), to guarantee that lightweight ViTs can simultaneously soak up task-agnostic and task-specific representations from teacher models for greater generalization and higher task performance, Fig. 1. Specifically, at the generic distillation stage, a student model is encouraged to obtain the task-agnostic knowledge from the teacher models. The encoder and decoder of pre-trained MAE constitute the teacher model while a light-weight decoder is attached to the lightweight vision Transformer as the student model, Fig. 2. The input image is randomly partitioned to visible and masked patches. The visible patches are fed to encoders. The hidden feature outputs of teacher decoder’s intermediate layer is used to guide training of the student model. For task-specific distillation, the fine-tuned MAE model equipped with task layers [41, 14, 51] teaches student model the task-specific knowledge (e.g., classification score). The student backbone is initialized from the previous distillation stage while the task layers are randomly initialized. Predictions of the student are constrained to be consistent with those of the teacher as well as ground truth labels. Such a task-specific distillation phase guarantees the performance of downstream tasks, e.g., image classification, object detection and semantic segmentation.

With G2SD, the vanilla ViT-Small model with ** 26%** parameters and 2.6 $\times$ throughput of the ViT-Base teacher, obtains 1) 98.6% (82.5% vs. 83.6%) top-1 accuracy of its teacher on ImageNet-1k [39] for image classification task, 2) 98.1% (50.6 vs. 51.6) mAP of its teacher on MS COCO [26] for object detection and 3) 99.3% (48.0 vs. 48.3) mIoU of its teacher on ADE20k [58] for semantic segmentation. Furthermore, G2SD demonstrates better generalization ability than its single-stage distillation counterparts in terms of occlusion invariance and robustness.

The contributions are summarized as follows:

•

We propose general-to-specific distillation (G2SD) to transfer task-agnostic and task-specific knowledge from masked autoencoders to lightweight ViTs, setting a solid baseline for two-stage vision model distillation.

•

We design a simple-yet-effective generic distillation strategy by aligning the student’s predictions with hidden features of the pre-trained masked autoencoder at visible and masked patches.

•

Experiments show that the lightweight student model with G2SD achieves competitive results across vision tasks, improving the performance of lightweight ViT models to a new height.

2 Related Work

Vision Transformers.

ViTs [11] have achieved impressive performance across vision tasks [55, 36, 4, 13, 35, 56, 25]. Furthermore, ViTs demonstrated the superiority in terms of robustness and generalization [32, 13, 35], compared to their CNN counterparts. However, due to the lack of inductive bias, ViTs report unsatisfactory performance in the limited model capacity regime [11, 41]. One solution is to explicitly introduce convolutional operators to ViTs [31, 50] to enhance the competitiveness compared to lightweight CNNs [19]. The other way is using large models act as teachers to transfer inductive bias to ViTs in the knowledge distillation fashion [41, 7, 50]. This study focuses on the latter.

Self-supervised Learning.

To explore big data without high-quality labels, self-supervised learning has been the preferred paradigm to construct representation models [5]. Masked language modeling [10] achieved great success in natural language process (NLP) field. Inspired by it, BEiT [4] introduced the mask-then-predict paradigm to the computer vision filed and exploited the great potential of masked image modeling (MIM) on various tasks. BEiT v2 [35] constructed a semantic-rich visual tokenizer in order to get better target. MAE [13] set a new baseline for MIM by reconstructing pixels at mask patches with a decoupling encoder-decoder architecture. Meanwhile, feature masking and reconstruction methods [47, 2] demonstrated advantages over the pixel-reconstruction approach. Those methods, however, when exploring performance upper bound by finding better supervisions to pre-train large ViTs, ignored the adaptability of lightweight models with limited capacity. In this study, G2SD develops a two-stage knowledge distillation baseline for lightweight ViT models to enjoy MIM advantages.

Knowledge Distillation.

The pioneering work [18] compressed the “dark knowledge” from a large (teacher) model to a small (student) model by minimizing KL divergence between the output logits distribution of the two models. NKD [53] rethinked the relation between knowledge distillation loss and the original cross-entropy and proposed a new KD loss. FitNet [38] pioneered feature distillation by utilizing the intermediate layers’ features from the teacher model. To find the better feature layers for distillation, subsequent works [6, 20] studied the factor of connection path cross multiple i ntermediate layers between teacher and student networks. Besides distilling the knowledge contained in samples, inter-samples relation, as structural information, was transferred to student models [43, 33, 34]. Knowledge distillation has also been elaborately studied for ViTs [41, 54, 7, 21, 27]. SSTA [49] simultaneously learned from the supervised teacher and self-supervised teacher, which was regarded as teaching assistant. However, those methods are designed and evaluated on specific task, such as classification, detection [12] or segmentation [40]. The task-oriented methods experience difficulty in transferring task-agnostic knowledge, while task-agnostic knowledge is crucial to guarantee the generalization ability of lightweight models. To overcome the limitation, TinyBERT [22] pioneered two-stage knowledge distillation in natural language processing. Nevertheless, the problem remains to be explored in vision tasks. In this study, we focus on excavating task-agnostic knowledge embedded in masked autoencoders to establish a solid baseline for vision model distillation in the era of self-supervised learning.

3 Preliminary

Transformer Representations.

To learn visual representations, ViT [11] converts each image to a sequence of ‘words’ (vectors) by partitioning it to patch grid. In specific, the input image ${\bm{x}}\in\mathbb{R}^{H\times W\times C}$ is divided to $N=(H*W)/P^{2}$ non-overlapping patches $\{{\bm{x}}^{p}_{i}\}_{i=1}^{N}$ , where $H$ , $W$ , $C$ and $P$ respectively denote the image height, width, channel and patch stride, and ${\bm{x}}^{p}_{i}\in\mathbb{R}^{N\times(P^{2}C)}$ . In this study, a $224\times 224\times 3$ size image is reshaped to a $14\times 14$ grid of image patches, each patch size is $16\times 16\times 3$ . Meanwhile, positional information111The positional embeddings are omitted for simplicity is embedded to the patches. By passing the vectors through stacked Transformer blocks, which consist of a multi-head self-attention [44] layer and a fully connected feed-forward network, the input vectors are converted to image representations.

Masked Autoencoders.

The MAE model contains an encoder $f_{e}$ and a decoder $f_{d}$ , where both $f_{e}$ and $f_{d}$ are stacked Transformer blocks. The input tokens $\{{\bm{x}}^{p}_{i}\}_{i=1}^{N}$ are grouped to the visible token set $\{{\bm{x}}^{p}_{i}\}_{i\in{\mathcal{V}}}$ and the masked token set $\{{\bm{x}}^{p}_{i}\}_{i\in{\mathcal{M}}}$ . While the visible tokens are fed to the encoder $f_{e}$ to extract features, the masked tokens act as the learning targets, which are required to be reconstructed during self-supervised learning (MIM). In the MAE method [13], a high mask ratio (e.g., 75%) is adopted, to prevent information leakage (i.e., simply extrapolating masked pixels from the neighbors) in the pre-training phase.

Specifically, $\{{\bm{x}}^{p}_{i}\}_{i\in{\mathcal{V}}}$ are fed to $f_{e}$ to obtain latent features $\{{\bm{e}}_{i}\}_{i\in{\mathcal{V}}}$ , where ${\bm{e}}_{i}=f_{e}({\bm{x}}^{p}_{i})$ for each $i\in{\mathcal{V}}$ . A shared learnable mask token ${\bm{e}}_{[\text{M}]}$ is considered as the placeholder of tokens in ${\mathcal{M}}$ . After that, we have the input tokens $\{{\bm{h}}_{i}\}_{i=1}^{N}$ for the decoder $f_{d}$ , where

[TABLE]

and $\delta(\cdot)$ denotes an indicator function. $\{{\bm{h}}_{i}\}_{i=1}^{N}$ are then fed to $f_{d}$ to generate predictions at all positions $\{{\bm{z}}_{i}\}_{i=1}^{N}$ . The loss is calculated by comparing the normalized pixels with predictions at masked positions ${\mathcal{M}}$ , as

[TABLE]

where $\text{LN}(\cdot)$ is the layer normalization without affine transformation, $a.k.a$ , the per-patch normalization in MAE. After pre-training, the encoder acts as backbone to extract representations for various tasks and the decoder is abandoned. As the model does not access any label in the pre-training stage, it is assumed that the features extracted by encoder are general to downstream tasks.

4 Generic-to-Specific Distillation

Our generic-to-specific distillation (G2SD) emphasizes transferring the task-agnostic knowledge embedded in large pre-trained masked autoencoders [13]. In conjunction with task-specific distillation, G2SD endows lightweight models favorable generalization ability and competitive results.

4.1 Generic Distillation: Task-agnostic knowledge Transfer

In each training iteration, the generic distillation consists of a feed-forward procedure of the teacher model, a feed-forward and a back-propagation procedure of the student model, Fig. 2 (left). In the feed-forward procedure, outputs from an intermediate layer of the teacher decoder and the final layer of the student decoder are compared to calculate the generic distillation loss.

Denote the encoder and decoder of teacher model pre-trained with MAE method as $f^{t}_{e}$ and $f^{t}_{d}$ , and the encoder and decoder of student models as $f^{s}_{e}$ and $f^{s}_{d}$ , respectively. Input tokens $\{{\bm{x}}^{p}_{i}\}_{i=1}^{N}$ are randomly categorized to visible ones $\{{\bm{x}}^{p}_{i}\}_{i\in{\mathcal{V}}}$ and masked ones $\{{\bm{x}}^{p}_{i}\}_{i\in{\mathcal{M}}}$ . The visible tokens $\{{\bm{x}}^{p}_{i}\}_{i\in{\mathcal{V}}}$ are simultaneously fed to $f^{t}_{e}$ and $f^{s}_{e}$ to extract features $\{{\bm{e}}^{t}_{i}\}_{i\in{\mathcal{V}}}$ and $\{{\bm{e}}^{s}_{i}\}_{i\in{\mathcal{V}}}$ . According to Eq. 1, we have the input tokens set $\{{\bm{h}}^{t}_{i}\}_{i=1}^{N}$ for the teacher decoder and $\{{\bm{h}}^{s}_{i}\}_{i=1}^{N}$ for the student decoder. In general, a flexible decoder consists of multiple Transformer blocks. We respectively mark the depth of teacher decoder and that of student decoder as $L$ and $l$ , where $l\leq L$ in our experiments. Let features output by the $l$ -th layer of the teacher decoder $f^{t}_{d}$ as $\{\hat{\bm{z}}^{t}_{i}\}_{i=1}^{N}$ , where $\hat{\bm{z}}^{t}_{i}=f^{t}_{d_{l}}({\bm{h}}^{t}_{i})$ . The student decoder $f^{s}_{d}$ employs $l$ Transformer blocks on $\{{\bm{h}}^{s}_{i}\}_{i=1}^{N}$ and calculates the output features as $\{f^{s}_{d}({\bm{h}}^{s}_{i})\}_{i=1}^{N}$ . Subsequently, a linear layer ${\bm{W}}$ is applied on $f^{s}_{d}({\bm{h}}^{s}_{i})$ to align with the channel dimension of $\hat{\bm{z}}^{t}_{i}$ and generates predictions ${\bm{z}}^{s}_{i}$ , i.e., ${\bm{z}}^{s}_{i}={\bm{W}}f^{s}_{d}({\bm{h}}^{s}_{i})$ .

According to the above definitions, a generic distillation loss is defined as

[TABLE]

where $\text{Smooth-}\ell_{1}(\cdot)$ is a trade-off function between $\ell_{1}$ and $\ell_{2}$ . By minimizing ${\mathcal{L}}_{\rm{GD}}$ on the visible tokens ${\mathcal{V}}$ , the student encoder is optimized to extract features in the way like the teacher encoder, $i.e.$ , mimicking feature extraction behavior. By minimizing ${\mathcal{L}}_{\rm{GD}}$ on the masked tokens ${\mathcal{M}}$ , the student encoder and decoder are optimized to learn context modeling ability from teacher models. Optimizing ${\mathcal{L}}_{\rm{GD}}$ on all tokens transfers task-agnostic knowledge.

4.2 Specific Distillation: Task-specific Representation Configuration

After generic distillation, lightweight models are able to generalize to downstream tasks and reach competitive performance, which has been validated by comprehensive experiments (See Tab. 4). Nevertheless, limited by a relatively small model size and number of parameters, lightweight models still have a performance gap with their teachers. To bridge the gap, specific distillation is performed so that compact yet discriminative features can be configured for downstream tasks, such as image classification, object detection, and semantic segmentation.

For specific distillation, the teacher model $f^{t}$ is first pre-trained with MAE method then fine-tuned on the specific task. A lightweight ViT model after generic distillation is set as the student $f^{s}$ . As concrete the loss function is depend on specific tasks, we denote ${\mathcal{L}}_{\text{Task}}$ as the task loss function, ${\mathcal{L}}_{\text{KD}}$ as the task-specific distillation loss function. Combining the task loss with task-specific distillation loss, we have a joint loss to optimize the student model, as

[TABLE]

where $Y$ is the ground truth and $\beta$ is the regularization factor (Refer to Appendix A for details).

4.3 Analysis

The proposed two-stage approach is more plausible than commonly used single-stage methods, which can be justified from the perspective of mutual information [1]. The knowledge distillation can be generally interpreted as a procedure to maximize the mutual information $\mathcal{I}$ of a teacher model ( $f^{t}$ ) and a student model ( $f^{s}$ ). Denote the parameters of the student model as $\theta^{s}$ , the pre-training dataset as $X$ and the fine-tuning dataset as $\hat{X}$ . The single-stage task-specified distillation is interpreted as

[TABLE]

which maximizes the mutual information between the teacher model $f^{t}$ and the student model $f^{s}$ conditional on the fine-tuning dataset $\hat{X}$ . The proposed G2SD is interpreted as

[TABLE]

which maximizes the mutual information between the teacher model $f^{t}$ and the student model $f^{s}$ conditional on both the pre-training data $X$ and the fine-tuning dataset $\hat{X}$ . Obviously, the mutual information defined by Eq. 6 is larger than that by Eq. 5, which implies more information can be transferred by our G2SD approach.

5 Experiments

5.1 Setting

Datasets.

The generic distillation is conducted on ImageNet-1k [39] training set with 1.2M images. Following self-supervised recipes [13], we do not use the label information, so that lightweight models focus on soaking up the task-agnostic representations. In specific distillation, the models are fine-tuned from the previous stage on ImageNet-1k [39], MS COCO [26] and ADE20K [58] datasets.

Implementation details.

In generic distillation stage, the MAE pre-trained ViT-Base model [13] is employed as the teacher. The student model is trained for 300 epochs using the AdamW optimizer [30], learning rate 2.4e-3, weight decay 0.05, batch size 4096, and image resolution 224×224. Unless specified, the mask ratio is set to 75% and the student decoder contains 4 Transformer blocks with 128 and 256 dimensions for ViT-Tiny and ViT-Small, respectively.

In task-specific distillation stage, the student decoder is discarded while the encoder is utilized as backbone to extract feature for various tasks, as do in MAE [13]. We use the official or re-implemented MAE fine-tuned model as the teacher. To avoid deteriorating the general representations obtained from the previous stage, a layer decay schedule is adopted to train the student model for all downstream tasks.

For image classification, we take a fine-tuned ViT-base model as the teacher, which is officially released by MAE [13] and achieves 83.6% top-1 accuracy. Following DeiT [41] distillation recipe, we append a distillation token to the student model for token-based distillation and use the hard decision of the teacher as the distillation label. The student model is trained for 200 epochs.

For object detection and instance segmentation tasks, we follow the ViTDet [25] framework, where the official ViTDet-Base [25] model are used as the teacher. The Feature-Richness Score method [12] is adopted to stress important features that are distilled from the teacher to the student model. Student models are trained with batch size 64 for 100 epochs. The input image resolution is $1024\times 1024$ .

For semantic segmentation, we use UperNet [51] task layers and distill the model for 160K iterations. Due to the absence of officially released model weights, we fine-tune the MAE pre-trained ViT-Base model on ADE20k by using the BEiT [4] semantic segmentation codebase to get teacher model, which achieves 48.3 mIoU, is comparable to MAE official report. During specific distillation, besides the supervision from the ground-truth, activation maps from the student and the teacher are aligned $w.r.t.$ the channel dimension [40].

5.2 Main Results

Image Classification.

In Tab. 1, G2SD is compared with

supervised methods including MobileNet-v3 [19], ResNet [15, 48], DeiT [41, 42], Swin Trasnformer [28] and ConvNeXt [29];
self-supervised methods upon ViT-Small, like BEiT [4] and CAE [8]; and 3) distillation methods upon vanilla ViTs, like DeiT [41], DearKD [7], Manifold [21], MKD [27], SSTA [49] and DMAE [3]. G2SD achieves 82.5% top-1 accuracy, which outperforms CNN-based ConvNeXt by 0.4%, by using fewer parameters (22M vs. 29M). G2SD consistently outperforms self-supervised methods, BEiT and CAE, by 0.8% and 0.5%, respectively. Compared with those distillation methods, G2SD shows the superiority. Remarkably, with the limited parameters ( $\sim$ 6M), G2SD reports a substantial gain compared to DeiT-Ti and carefully designed MobileNet-v3.

Object Detection and Instance Segmentation.

In Tab. 2, we report APbbox for object detection and APmask for instance segmentation. We compare G2SD with some popular methods on various backbone network: 1) vanilla ViT, like CAE [8], ViT-Adapter [9], imTED [56]; 2) elaborately designed architecture, like CNN based ConvNeXt [29] and hierarchical designed Swin Transformer [28]. One can see that G2SD-S, with fewer parameters, obtains more than 4.4 APbbox gains compared with ConvNeXt-T and Swin-T, which contain many inductive bias. Compared with CAE-S, which benefits from masked image modeling, G2SD also show the extraordinary superiority. Moreover, G2SD-S significantly outperforms imTED-S by 2.6 APbbox on object detection and 2 APmask on instance segmentation, where imTED-S uses pre-trained MAE encoder as backbone and pre-trained MAE decoder as task layers.

Semantic Segmentation.

In Tab. 3, G2SD is compared with ViT-Adapter [9], ConvNeXt [29] and Swin Transformer [28]. G2SD-S outperforms all the compared methods by at least 1.4 mIoU, where ViT-Adapter elaborately modifies the model architecture for adapting dense prediction tasks. Remarkably, only using 11M parameters, G2SD-Ti achieves 44.5 mIoU, which pushes the performance of lightweight ViT models to a new height.

5.3 Ablation Studies: Single-stage vs. Two-stage

In Tab. 4, comprehensive experiments are conducted to compare single-stage and two-stage distillation methods. The teacher models include: 1) pre-trained MAE ViT-Base model for generic distillation; 2) fine-tuned MAE models on ImageNet-1k, MS COCO and ADE20k for specific distillation, which respectively reach 83.6% top-1 accuracy, 51.6 APbbox, 45.9 APmask and 48.3 mIoU. The student models are vanilla ViT-Tiny and ViT-Small, which are initialized from self-supervised method MAE [13] and G2SD. We denote MAE as model pre-trained with MAE and fine-tuned with task specific distillation. MAE ViT-Small model is pre-trained for 300 epochs, by using the official codebase.

When specific distillation is not used, G2SD w/o S.D outperforms MAE by a large margin, e.g., 49.9 vs. 45.3 APbbox and 46.2 vs. 41.1 mIoU, which benefits from the transferred task-agnostic knowledge. After activating specific distillation, both MAE and G2SD boost their performances, e.g., G2SD-S achieves 0.7 APbbox gains and 1.8 mIoU gains, which are attributed to discriminative representation configuration. In conclusion, G2SD outperforms MAE and MAE across model sizes and datasets, validating the superiority of our two-stage distillation approach.

5.4 Ablation Studies: Generic Distillation

Target Configuration.

We investigate the impact of target feature selection in generic distillation stage and report the results in Tab. 5. All models are trained under the same recipe and evaluated on ImageNet-1k and ADE20k. From Tab. 5 (#5), one can see that aligning student decoder features with teacher decoder’s hidden features at both visible and masked patches achieves the best results, e.g., 81.99% top-1 accuracy on ImageNet-1k and 46.19 mIoU on ADE20k.

Transferring from teacher encoder to student encoder is the most straightforward method, as shown in Tab. 5 (#1), but it only reaches 43.69 mIoU on ADE20k. The reason lies on that it overlooks the context understanding ability, which is beneficial for dense prediction tasks. Distilling the relation among tokens is popular and effective in NLP [46]. We thus conduct experiments using the self-attention relation of teacher encoder as distillation target, and find that the student only obtains 81.45% top-1 accuracy on ImageNet-1k, Tab. 5 (#2).

In Tab. 5 (#3), we align the student decoder features with those of the teacher decoder on the masked positions. In this way, the student respectively gets 0.36% accuracy and 1.51 mIoU gains on ImageNet and ADE20k compared to Tab. 5 (#1), which verifies the superiority of learning the context understanding capacity. Furthermore, we simultaneously calculate alignment loss on encoder features at visible patches and on decoder features at masked patches in Tab. 5 (#4), which is a more direct approach to let student inherit the feature extracting and context understanding capability of teacher, compared with Tab. 5 (#5). Unfortunately, the student performs worse than only calculating alignment loss on decoder features at masked patches, e.g., 44.12 vs. 46.19 mIoU on ADE20k.

Mask Ratio.

A high mask ratio (75%) works well in MAE [13], but the suitable mask ratio in generic distillation still needs to be explored. In general, predicting masked features is more challenging than predicting pixels. However, the observations are consistent with the teacher MAE, as illustrated in Tab. 6 (top), where a high mask ratio tends to generate good results. The reason may be that the teacher model can express itself to the greatest extent when the mask ratio is consistent with the MAE pre-training phase.

Target Layer.

A sufficiently deep decoder is essential for the fine-tuning performance in MAE [13]. We study the impact of which decoder layer is the best target layer Tab. 6 (bottom). One can see that using features of the 4-th layer as distillation targets for G2SD yields better accuracy. This can be explained that the last several layers in decoder are more specialized for low-level information (e.g., pixel values) reconstruction while the first several layers in a decoder can’t produce enough general representations.

Decoder Design.

We study how the performance varies with decoder depth and width, where depth and width respectively denote the number of Transformer blocks and the embedding dimension of each Transformer block. As demonstrated in Tab. 7, the student decoder of width 256 and depth 4 yields optimal results, in terms of image classification. When the student decoder is heavy, the student encoder can be “lazy” to pursue good features as the decoder is competent for both feature extraction and image reconstruction.

5.5 Analysis

G2SD is compared with MAE and DeiT in terms of occlusion invariance, representation similarity and robustness, which indicate that it learns representations general to downstream tasks. DeiT denotes performing task-specific distillation by replacing the original teacher with the fine-tuned MAE-Base model. For fair comparison, we set the total training epochs of the three methods to be same (500 epochs). The major difference between those three methods is initialization, $i.e.$ , G2SD is initialized from generic distillation, MAE from MAE pre-training, and DeiT from scratch.

Centered Kernel Alignment [24] (CKA) is a preferred metric evaluating normalized similarity between two feature maps or representations, and it is invariant to the orthogonal transformation of representations and isotropic scaling. We calculate CKA scores to analyze the occlusion invariance and representation similarity in the following.

Occlusion Invariance.

Masked autoencoders are verified to learn occlusion invariant features in [23]. In Fig. 3 (left), we directly evaluate the performance of DeiT and G2SD under various mask ratios on the ImageNet-1k validation set. G2SD decreases about 24% while DeiT decreases about 51% when the mask ratio is 80%. In Fig. 3 (right), we calculate the CKA similarity between masked image representations and complete image representations, and find that G2SD can obtain higher CKA scores than DeiT. These observations suggest that G2SD preserves more occlusion invariance than the single-stage method (e.g., DeiT).

Representation Similarity.

In Fig. 4 (a) and (b), representations generated by G2SD w/o S.D is more similar with pre-trained MAE-B than pre-trained MAE-S, indicating that generic distillation enables better features than simply reconstructing pixels. Furthermore, after task-specific distillation, G2SD consistently obtains higher CKA scores than MAE and DeiT, as illustrated in Fig. 4 (c) and (d), implying that generic distillation provides a favored initialization for specific distillation.

Robustness.

This is evaluated by testing the trained classifier on several ImageNet variants including ImageNet-A [17], ImageNet-R [16], ImageNet-S [45] and ImageNet-V2 [37]. From Tab. 8, one can see that G2SD outperforms the compared methods, which implies better generalization capability. In other words, the proposed G2SD encourages the small student model to maintain the generalization capability of teacher model endowed by the generic self-supervised method, as much as possible.

6 Conclusion

We proposed a two-stage distillation approach, termed generic-to-specific distillation (G2SD), to tap the potential of lightweight ViTs under the supervision of pre-trained large models. For generic distillation, we further designed a simple-yet-effective distillation strategy by aligning students’ predictions with latent features of large masked autoencoders at both masked and visible patches. With two-stage distillation, the task-agnostic and task-specific knowledge of large models were transferred to lightweight ones. Extensive experiments on image classification, object detection, and semantic segmentation validated the performance of the proposed G2SD approach, with striking contrast with state-of-the-art methods. This study has built a solid baseline for the two-stage vision model distillation.

Acknowledgement. This work was supported by National Natural Science Foundation of China (NSFC) under Grant 62225208, 62171431 and 61836012, and the Strategic Priority Research Program of Chinese Academy of Sciences under Grant No. XDA27000000.

Appendix A Hyperparameters

A.1 Image Classification

For distillation, as in [41], we added a learnable distillation token, which is combined with the cls token to produce final predictions in the inference phase. In experiments, the data augmentation and optimizer follow the fine-tuning recipe of MAE [13], while the learning rate, training epochs and layer-wise learning-rate decay are specified. For models training from scratch (e.g., DeiT), we set the layer decay value as 1.0, which means no layer decay is adopted. For pre-trained models (e.g., MAE [13], G2SD), we set the layer decay value to 0.75 and training epochs to 200.

A.2 Object Detection and Instance Segmentation

In the experiments, we adopt the official codebase222https://github.com/facebookresearch/detectron2/tree/main/projects/ViTDet and follow the settings used in ViTDe [25]. The total batch size is set to 64 (8 images per GPU). The learning rate is set to $1e^{-4}$ , the backbone’s drop path rate is $0.1$ , and the distill warm step is 500. The overall training target is the same as [12]: $L=L_{GT}+\alpha L_{FPN}+\beta L_{head}$ , where $\alpha$ and $\beta$ are respectivvely set to $0.001$ and $0.1$ .

A.3 Semantic Segmentation

In this experiment, we adopt the BEiT’s segmentation codebase333https://github.com/microsoft/unilm/beit and set the total batch size to 32 (4 images per GPU). The backbone’s drop path rate is $0.1$ . The layer decay rate is 0.75. The learning rate of ViT-Small and ViT-Tiny are respectively set to $2e^{-4}$ and $5e^{-4}$ . We set the temperature parameter $\tau=1$ , the loss weight $\alpha=3$ for the logits map distillation.

Appendix B Training Time and Efficiency

As shown in Table 10, G2SD outperforms DeiT [41] and DeiT [41], which have a longer training schedule (500 epochs). The teacher of DeiT is the same as G2SD’s. In the generic distillation stage, since the input of G2SD is a masked image (75% patches are discarded), the training time per epoch is less than DeiT (which computes the whole image).

Appendix C Detection Performance with ViTDet

For the lack of official Mask-RCNN [14] results and checkpoints of MAE [13], we choose ViTDet [25] as the detector. In Table 11, the backbone models are initialized from various supervisions, e.g., supervised methods (DeiT [41]), distilled methods (DeiT [41] and G2SD) and self-supervised methods (DINO [5] and iBoT [59]). From Table 11, G2SD significantly outperforms competitors on performance and convergence speed.

Appendix D More Ablations on Target Configuration

In Table 5, we have conducted ablation studies on intermediate features as generic distillation targets. Compared with using intermediate features as distillation targets, taking the teacher’s prediction as distillation objective [18, 41] is also a popular alternative. Therefore, we take the MAE’s predictions as the generic distillation targets in Table 12. When taking the MAE’s predictions as the targets for masked positions, the performance drops to 81.4% (without specific distillation) and 81.8% (with specific distillation). This observation is consist with the results in Table 6 (bottom), where the last several layers in decoder are more specialized for low-level information reconstruction task.

Bibliography59

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] Sungsoo Ahn, Shell Xu Hu, Andreas C. Damianou, Neil D. Lawrence, and Zhenwen Dai. Variational information distillation for knowledge transfer. IEEE CVPR , pages 9155–9163, 2019.
2[2] Alexei Baevski, Wei-Ning Hsu, Qiantong Xu, Arun Babu, Jiatao Gu, and Michael Auli. Data 2vec: A general framework for self-supervised learning in speech, vision and language. ar Xiv preprint ar Xiv:2202.03555 , 2022.
3[3] Yutong Bai, Zeyu Wang, Junfei Xiao, Chen Wei, Huiyu Wang, Alan Loddon Yuille, Yuyin Zhou, and Cihang Xie. Masked autoencoders enable efficient knowledge distillers. Ar Xiv , abs/2208.12256, 2022.
4[4] Hangbo Bao, Li Dong, Songhao Piao, and Furu Wei. B Ei T: BERT pre-training of image transformers. In ICLR , 2022.
5[5] Mathilde Caron, Hugo Touvron, Ishan Misra, Hervé Jégou, Julien Mairal, Piotr Bojanowski, and Armand Joulin. Emerging properties in self-supervised vision transformers. ar Xiv preprint ar Xiv:2104.14294 , 2021.
6[6] Pengguang Chen, Shu Liu, Hengshuang Zhao, and Jiaya Jia. Distilling knowledge via knowledge review. IEEE CVPR , pages 5006–5015, 2021.
7[7] Xianing Chen, Qiong Cao, Yujie Zhong, Jing Zhang, Shenghua Gao, and Dacheng Tao. Dearkd: Data-efficient early knowledge distillation for vision transformers. IEEE CVPR , pages 12042–12052, 2022.
8[8] Xiaokang Chen, Mingyu Ding, Xiaodi Wang, Ying Xin, Shentong Mo, Yunhao Wang, Shumin Han, Ping Luo, Gang Zeng, and Jingdong Wang. Context autoencoder for self-supervised representation learning. ar Xiv preprint ar Xiv:2202.03026 , 2022.

Method	Teacher	#Param(M)	Acc (%)
DeiT-Ti [41]	N/A	5	72.2
MobileNet-v3 [19]		5	75.2
ResNet-18 [15]		12	69.8
DeiT-S [41]		22	79.8
BEiT-S [4]		22	81.7
CAE-S [8]		22	82.0
DINO-S [5]		22	82.0
iBOT-S [59]		22	82.3
ResNet-50 [15]		25	76.2
Swin-T [28]		28	81.3
ConvNeXt-T [29]		29	82.1
DeiT-Ti [41]		6	74.5
DeiT-S [41]	RegNetY-	22	81.2
DearKD-Ti [7]	16GF	6	74.8
DearKD-S [7]		22	81.5
Manifold-Ti [21]		6	75.1
Manifold-S [21]	CaiT-	22	81.5
MKD-Ti [27]	S24	6	76.4
MKD-S [27]		22	82.1
SSTA-Ti [49]	DeiT-S	6	75.2
SSTA-S [49]	DeiT-B	22	81.4
DMAE-Ti [3]	MAE-B	6	70.0
DMAE-S [3]		22	79.3
G2SD-Ti (ours)		6	77.0
G2SD-S (ours)		22	82.5

Mask R-CNN [14], 36 epochs + Multi-Scale
Method	#Param(M)	AP^bbox	AP^mask
CAE-S [8]	46.1	44.1	39.2
ViT-Adapter-T [9]	28.1	46.0	41.0
Swin-T [28]	47.8	46.0	41.6
ConvNeXt-T [29]	48.1	46.2	41.7
imTED-S [56]	30.1	48.0	42.8
ViT-Adapter-S [9]	47.8	48.2	42.8
ViTDet [25], 100 epochs + Single-Scale
DeiT-S [41]	44.5	47.2	41.9
DINO-S [5]	44.5	49.1	43.3
iBOT-S [59]	44.5	49.7	44.0
G2SD-Ti (ours)	27.7	46.3	41.6
G2SD-S (ours)	44.5	50.6	44.8

Student: ViT-Tiny
Method	Params	Throughout	Generic	Specific	ImageNet-1k	MS COCO		ADE20k
Method	(M)	(Images/s)	Distillation	Distillation	Top-1 Acc (%)	AP^bbox	AP^mask	mIoU
Teacher: ViT-Base	86.57	1.0 $\times$	N/A	N/A	83.6	51.6	45.9	48.3
MAE [13]	5.72	5.84 $\times$	✗	✗	75.2	37.9	34.9	36.9
MAE [13]	5.91	5.74 $\times$	✗	✓	75.9	43.5	39.0	42.0
G2SD w/o S.D (ours)	5.72	5.84 $\times$	✓	✗	76.3	44.0	39.6	41.4
G2SD (ours)	5.91	5.74 $\times$	✓	✓	77.0	46.3	41.3	44.5
Student: ViT-Small
MAE [13]	22.05	2.62 $\times$	✗	✗	81.5	45.3	40.8	41.1
MAE [13]	22.44	2.58 $\times$	✗	✓	81.9	48.9	43.5	44.9
G2SD w/o S.D (ours)	22.05	2.62 $\times$	✓	✗	82.0	49.9	44.5	46.2
G2SD (ours)	22.44	2.58 $\times$	✓	✓	82.5	50.6	44.8	48.0

Methods	IN	IN-A	IN-R	IN-S	IN-V2
Teacher: ViT-Base	83.6	35.9	48.3	34.5	73.2
Student: ViT-Tiny
DeiT [41]	75.3	9.5	36.2	23.4	63.3
MAE [13]	75.9	10.9	38.7	26.3	64.7
G2SD (ours)	77.0	12.9	39.0	25.9	65.6
Student: ViT-Small
DeiT [41]	81.8	24.2	45.9	32.1	71.1
MAE [13]	81.9	26.6	46.8	34.3	71.1
G2SD (ours)	82.5	29.4	46.8	33.6	72.1

Methods	1-st stage	2-nd stage	Time	Top-1 Acc (%)
G2SD	G.D 300 epochs	S.D 200 epochs	71 h	82.5
DeiT	Supervised+Distillation 500 epochs		112 h	81.7 (-0.8)
DeiT	Supervised 500 epochs		53 h	81.4 (-1.1)

Methods (Supervision)	ImageNet Acc (%)	AP^bbox	AP^mask
DeiT-S (sup., 300e)	79.9	45.7	40.7
DeiT-S (sup.&distill., 300e)	81.2	47.2	41.9
DeiT-S (sup., 500e)	81.4	46.9	41.6
DINO-S (self-sup., 3200e)	82.0	49.1	43.3
iBOT-S (self-sup., 3200e)	82.3	49.7	44.0
G2SD-S (w/o S.D, 300e)	82.0	49.9	44.5
G2SD-S (300e)	82.5	50.6	44.8

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Code & Models

Videos

Taxonomy

Generic-to-Specific Distillation of Masked Autoencoders

Abstract

1 Introduction

2 Related Work

Vision Transformers.

Self-supervised Learning.

Knowledge Distillation.

3 Preliminary

Transformer Representations.

Masked Autoencoders.

4 Generic-to-Specific Distillation

4.1 Generic Distillation: Task-agnostic knowledge Transfer

4.2 Specific Distillation: Task-specific Representation Configuration

4.3 Analysis

5 Experiments

5.1 Setting

Datasets.

Implementation details.

5.2 Main Results

Image Classification.

Object Detection and Instance Segmentation.

Semantic Segmentation.

5.3 Ablation Studies: Single-stage vs**.** Two-stage

5.4 Ablation Studies: Generic Distillation

Target Configuration.

Mask Ratio.

Target Layer.

Decoder Design.

5.5 Analysis

Occlusion Invariance.

Representation Similarity.

Robustness.

6 Conclusion

Appendix A Hyperparameters

A.1 Image Classification

A.2 Object Detection and Instance Segmentation

A.3 Semantic Segmentation

Appendix B Training Time and Efficiency

Appendix C Detection Performance with ViTDet

Appendix D More Ablations on Target Configuration

5.3 Ablation Studies: Single-stage vs. Two-stage