Leveraging Angular Distributions for Improved Knowledge Distillation

Eun Som Jeon; Hongjun Choi; Ankita Shukla; Pavan Turaga

arXiv:2302.14130·cs.CV·March 1, 2023

Leveraging Angular Distributions for Improved Knowledge Distillation

Eun Som Jeon, Hongjun Choi, Ankita Shukla, Pavan Turaga

PDF

TL;DR

This paper introduces a novel angular margin-based distillation loss that leverages the angular distribution of features to improve knowledge transfer from teacher to student models in computer vision.

Contribution

It proposes a new AMD loss function that uses angular distances and margins on a hypersphere to enhance the discrimination of features during knowledge distillation.

Findings

01

AMD loss improves student model performance across datasets

02

Method is compatible with other distillation techniques

03

Angular distribution enhances feature discrimination

Abstract

Knowledge distillation as a broad class of methods has led to the development of lightweight and memory efficient models, using a pre-trained model with a large capacity (teacher network) to train a smaller model (student network). Recently, additional variations for knowledge distillation, utilizing activation maps of intermediate layers as the source of knowledge, have been studied. Generally, in computer vision applications, it is seen that the feature activation learned by a higher capacity model contains richer knowledge, highlighting complete objects while focusing less on the background. Based on this observation, we leverage the dual ability of the teacher to accurately distinguish between positive (relevant to the target object) and negative (irrelevant) areas. We propose a new loss function for distillation, called angular margin-based distillation (AMD) loss. AMD loss uses…

Tables11

Table 1. Table 1: Description of experiments and their corresponding sections.

Description	Section
1. Does AMD work to distill a better student?	5.3
• Comparison with various attention based distillation methods. • Investigating the effect of each component of the proposed method.	5.3
2. What is the effect of learning with AMD from various teachers?	5.4
• Exploring with different capacity of teachers.	5.4
3. What is the effect of different hyperparameters?	5.5
• Ablation study with $γ$ and $m$ .	5.5
4. What are the visualized results for the area of interest?	5.6
• Visualized results of activation maps from intermediate layers with or without local feature distillation.	5.6
5. Is AMD able to perform with existing methods?	5.7
• Evaluation with various methods such as fine-grained feature distillation, augmentation, and other distillation methods. • Generalizability analysis with ECE and reliability diagrams.

Table 2. Table 2: Architecture of WRN used in experiments. Downsampling is performed in the first layers of conv3 and conv4. 16 and 28 mean depth and k 𝑘 k is width (channel multiplication) of the network.

Group Name	Output Size	WRN16- $k$	WRN28- $k$
conv1	32 $\times$ 32	3 $\times$ 3, 16	3 $\times$ 3, 16
conv2	32 $\times$ 32	$[\begin{matrix} 3 \times 3, 16 k \\ 3 \times 3, 16 k \end{matrix}]$ $\times$ 2	$[\begin{matrix} 3 \times 3, 16 k \\ 3 \times 3, 16 k \end{matrix}]$ $\times$ 4
conv3	16 $\times$ 16	$[\begin{matrix} 3 \times 3, 32 k \\ 3 \times 3, 32 k \end{matrix}]$ $\times$ 2	$[\begin{matrix} 3 \times 3, 32 k \\ 3 \times 3, 32 k \end{matrix}]$ $\times$ 4
conv4	8 $\times$ 8	$[\begin{matrix} 3 \times 3, 64 k \\ 3 \times 3, 64 k \end{matrix}]$ $\times$ 2	$[\begin{matrix} 3 \times 3, 64 k \\ 3 \times 3, 64 k \end{matrix}]$ $\times$ 4
	1 $\times$ 1	average pool, 10-d fc, softmax

Table 3. Table 3: Details of teacher and student network architectures. ResNet [ 39 ] and WideResNet [ 33 ] are denoted by ResNet (depth) and WRN (depth)-(channel multiplication), respectively.

DB	Setup	Compression type	Teacher	Student	FLOPs	FLOPs	# of params	# of params	Compression
DB	Setup	Compression type	Teacher	Student	(teacher)	(student)	(teacher)	(student)	ratio
CIFAR-10	(a)	Channel	WRN16-3	WRN16-1	224.63M	27.24M	1.50M	0.18M	11.30 $%$
	(b)	Depth	WRN28-1	WRN16-1	56.07M	27.24M	0.37M	0.18M	47.38 $%$
	(c)	Depth+Channel	WRN16-3	WRN28-1	224.63M	56.07M	1.50M	0.37M	23.85 $%$
	(d)	Different architecture	ResNet44	WRN16-1	99.34M	27.24M	0.66M	0.18M	26.47 $%$
CINIC-10	(a)	Channel	WRN16-3	WRN16-1	224.63M	27.24M	1.50M	0.18M	11.30 $%$
	(b)	Depth	WRN28-1		56.07M		0.37M		47.38 $%$
	(c^a)	Depth+Channel	WRN28-3		480.98M		3.29M		5.31 $%$
	(d)	Different architecture	ResNet44		99.34M		0.66M		26.47 $%$
Tiny-ImageNet	(a)	Channel	WRN16-3	WRN16-1	898.55M	108.98M	1.59M	0.19M	11.82 $%$
	(b^b)	Depth	WRN40-1		339.60M		0.58M		32.52 $%$
	(c^b)	Depth+Channel	WRN40-2		1,323.10M		2.27M		8.26 $%$
	(d)	Different architecture	ResNet44		397.36M		0.67M		27.82 $%$

Table 4. Table 4: Accuracy ( % percent \% ) on CIFAR-10 with various knowledge distillation methods. The methods denoted by “*” are attention based distillation. “ g g \mathrm{g} ” and “ l l \mathrm{l} ” denote using global and local feature distillation, respectively.

Setup	Method
	Teacher	Student	KD	AT^∗	SP	RKD	VID	AFDS^∗	AFD^∗	AMD
	Teacher	Student	KD	AT^∗	SP	RKD	VID	AFDS^∗	AFD^∗	(g)	(g+l)
(a)	87.76	84.11	85.29	85.79	85.69	85.45	85.40	–	86.23	86.28	86.36
(a)	$\pm$ 0.12	$\pm$ 0.12	$\pm$ 0.15	$\pm$ 0.14	$\pm$ 0.11	$\pm$ 0.09	$\pm$ 0.14	–	$\pm$ 0.13	$\pm$ 0.06	$\pm$ 0.10
(b)	85.59	84.11	85.48	85.79	85.77	85.47	84.92	85.53	85.84	86.04	86.10
(b)	$\pm$ 0.13	$\pm$ 0.12	$\pm$ 0.12	$\pm$ 0.12	$\pm$ 0.07	$\pm$ 0.12	$\pm$ 0.13	$\pm$ 0.13	$\pm$ 0.11	$\pm$ 0.12	$\pm$ 0.10
(c)	87.76	85.59	86.57	86.77	86.56	86.38	86.64	–	87.24	87.13	87.35
(c)	$\pm$ 0.12	$\pm$ 0.12	$\pm$ 0.16	$\pm$ 0.11	$\pm$ 0.09	$\pm$ 0.22	$\pm$ 0.24	–	$\pm$ 0.03	$\pm$ 0.14	$\pm$ 0.10
(d)	86.41	84.11	85.44	85.95	85.41	85.50	85.17	85.14	85.78	86.22	86.34
(d)	$\pm$ 0.20	$\pm$ 0.21	$\pm$ 0.06	$\pm$ 0.05	$\pm$ 0.12	$\pm$ 0.06	$\pm$ 0.11	$\pm$ 0.13	$\pm$ 0.09	$\pm$ 0.07	$\pm$ 0.05

Table 5. Table 5: Accuracy ( % percent \% ) on CINIC-10 with various knowledge distillation methods. The methods denoted by “*” are attention based distillation. AMD outperforms RKD [ 34 ] . “ g g \mathrm{g} ” and “ l l \mathrm{l} ” denote using global and local feature distillation, respectively.

Setup	Method
	Teacher	Student	KD	AT^∗	SP	VID	AFDS^∗	AFD^∗	AMD
	Teacher	Student	KD	AT^∗	SP	VID	AFDS^∗	AFD^∗	(g)	(g+l)
(a)	75.40	72.05 $\pm$ 0.12	74.31	74.63	74.43	74.35	–	74.13	75.04	75.18
(a)	$\pm$ 0.12		$\pm$ 0.10	$\pm$ 0.13	$\pm$ 0.14	$\pm$ 0.05	–	$\pm$ 0.12	$\pm$ 0.11	$\pm$ 0.09
(b)	75.59		74.66	74.73	74.94	73.85	74.54	74.36	75.14	75.21
(b)	$\pm$ 0.15		$\pm$ 0.08	$\pm$ 0.02	$\pm$ 0.11	$\pm$ 0.08	$\pm$ 0.08	$\pm$ 0.04	$\pm$ 0.06	$\pm$ 0.04
(c^a)	76.97		74.26	74.19	75.05	74.06	–	74.20	74.72	75.17
(c^a)	$\pm$ 0.05		$\pm$ 0.06	$\pm$ 0.11	$\pm$ 0.10	$\pm$ 0.15	–	$\pm$ 0.12	$\pm$ 0.07	$\pm$ 0.07
(d)	74.30		74.47	74.67	74.46	74.43	74.64	73.31	74.93	75.10
(d)	$\pm$ 0.15		$\pm$ 0.09	$\pm$ 0.05	$\pm$ 0.17	$\pm$ 0.10	$\pm$ 0.12	$\pm$ 0.13	$\pm$ 0.07	$\pm$ 0.10

Table 6. Table 6: Accuracy ( % percent \% ) on Tiny-ImageNet with various knowledge distillation methods. The methods denoted by “*” are attention based distillation. AMD outperforms VID [ 35 ] and RKD [ 34 ] . “ g g \mathrm{g} ” and “ l l \mathrm{l} ” denote using global and local feature distillation, respectively.

Setup	Method
	Teacher	Student	KD	AT^∗	SP	AFDS^∗	AFD^∗	AMD
	Teacher	Student	KD	AT^∗	SP	AFDS^∗	AFD^∗	(g)	(g+l)
(a)	58.16	49.45 $\pm$ 0.20	49.99	49.72	49.27	–	50.00	50.32	49.92
(a)	$\pm$ 0.30		$\pm$ 0.15	$\pm$ 0.15	$\pm$ 0.19	–	$\pm$ 0.23	$\pm$ 0.07	$\pm$ 0.04
(b^b)	54.74		49.56	49.79	49.89	49.46	50.04	50.15	49.97
(b^b)	$\pm$ 0.24		$\pm$ 0.17	$\pm$ 0.22	$\pm$ 0.20	$\pm$ 0.28	$\pm$ 0.27	$\pm$ 0.10	$\pm$ 0.18
(c^b)	59.92		49.67	49.62	49.59	–	49.78	49.88	50.07
(c^b)	$\pm$ 0.15		$\pm$ 0.13	$\pm$ 0.16	$\pm$ 0.25	–	$\pm$ 0.24	$\pm$ 0.20	$\pm$ 0.10
(d)	54.66		49.52	49.45	49.13	49.55	49.44	49.92	50.08
(d)	$\pm$ 0.14		$\pm$ 0.16	$\pm$ 0.28	$\pm$ 0.20	$\pm$ 0.13	$\pm$ 0.27	$\pm$ 0.09	$\pm$ 0.16

Table 7. Table 7: Top-1 and Top-5 accuracy (%) on ImageNet with various knowledge distillation methods. The methods denoted by “*” are attention based distillation. “ g g \mathrm{g} ” and “ l l \mathrm{l} ” denote using global and local feature distillation, respectively.

	Teacher	Student	KD	AT^∗	RKD	SP	CC	AFD^∗	CRD(+KD)	AMD
	Teacher	Student	KD	AT^∗	RKD	SP	CC	AFD^∗	CRD(+KD)	(g)	(g+l)
Top-1	73.31	69.75	70.66	70.70	70.59	70.79	69.96	71.38	71.17(71.38)	71.58	71.47
Top-5	91.42	89.07	89.88	90.00	89.68	89.80	89.17	–	90.13(90.49)	90.50	90.49

Table 8. Table 8: Accuracy ( % percent \% ) with various knowledge distillation methods for different combinations of teachers and students. “Teacher” and “Student” denote results of the model used to train the distillation methods and trained from scratch, respectively. “ g g \mathrm{g} ” and “ l l \mathrm{l} ” denote using global and local feature distillation, respectively.

Method	CIFAR-10				CINIC-10									Tiny-ImageNet
Teacher	WRN	WRN	WRN	WRN	WRN	WRN	WRN	WRN	WRN	WRN	WRN	WRN	M.Net	WRN	WRN	WRN
	28-1	40-1	16-3	16-8	16-3	16-8	28-1	40-1	28-3	40-2	16-3	28-3	V2	40-1	40-2	16-3
	(0.4M,	(0.6M,	(1.5M,	(11.0M,	(1.5M,	(11.0M,	(0.4M,	(0.6M,	(3.3M,	(2.2M,	(1.5M,	(3.3M,	(0.6M,	(0.6M,	(2.3M,	(1.6M,
	85.84)	86.39)	88.15)	89.50)	75.65)	77.97)	73.91)	74.49)	77.14)	76.66)	75.65)	77.14)	80.98)	55.28)	60.18)	58.78)
Student	WRN16-1		WRN28-1		WRN16-1						ResNet20			WRN16-1		ResNet20
	(0.2M,		(0.4M,		(0.2M,						(0.3M,			(0.2M,		(0.3M,
	84.11 $\pm$ 0.21)		85.59 $\pm$ 0.13)		72.05 $\pm$ 0.12)						72.74 $\pm$ 0.09)			49.45 $\pm$ 0.20)		51.75 $\pm$ 0.19
KD	85.48	85.42	86.57	86.68	74.31	74.17	74.66	74.45	74.26	74.29	75.12	74.97	76.69	49.56	49.67	51.72
KD	$\pm$ 0.12	$\pm$ 0.11	$\pm$ 0.16	$\pm$ 0.08	$\pm$ 0.10	$\pm$ 0.16	$\pm$ 0.08	$\pm$ 0.03	$\pm$ 0.06	$\pm$ 0.09	$\pm$ 0.11	$\pm$ 0.07	$\pm$ 0.06	$\pm$ 0.17	$\pm$ 0.13	$\pm$ 0.13
AT	85.79	85.79	86.77	87.00	74.63	74.23	74.73	74.55	74.19	74.48	75.33	75.18	77.34	49.79	49.62	51.65
AT	$\pm$ 0.12	$\pm$ 0.11	$\pm$ 0.11	$\pm$ 0.05	$\pm$ 0.13	$\pm$ 0.14	$\pm$ 0.02	$\pm$ 0.06	$\pm$ 0.11	$\pm$ 0.08	$\pm$ 0.11	$\pm$ 0.09	$\pm$ 0.10	$\pm$ 0.22	$\pm$ 0.16	$\pm$ 0.05
SP	85.77	85.90	86.56	86.94	74.43	74.34	74.94	74.86	75.04	74.81	75.29	75.50	73.71	49.89	49.59	51.87
SP	$\pm$ 0.07	$\pm$ 0.11	$\pm$ 0.09	$\pm$ 0.08	$\pm$ 0.11	$\pm$ 0.13	$\pm$ 0.11	$\pm$ 0.07	$\pm$ 0.10	$\pm$ 0.09	$\pm$ 0.10	$\pm$ 0.09	$\pm$ 0.10	$\pm$ 0.20	$\pm$ 0.25	$\pm$ 0.09
AMD	86.04	86.03	87.13	87.22	75.04	74.93	75.14	75.12	74.72	74.95	75.66	75.61	78.45	50.15	49.88	51.89
(g)	$\pm$ 0.12	$\pm$ 0.09	$\pm$ 0.14	$\pm$ 0.17	$\pm$ 0.11	$\pm$ 0.09	$\pm$ 0.06	$\pm$ 0.07	$\pm$ 0.07	$\pm$ 0.20	$\pm$ 0.08	$\pm$ 0.06	$\pm$ 0.03	$\pm$ 0.11	$\pm$ 0.20	$\pm$ 0.25
AMD	86.10	86.15	87.35	87.31	75.18	75.20	75.21	75.10	75.22	75.04	75.75	75.76	78.62	49.97	50.07	52.12
(g+l)	$\pm$ 0.10	$\pm$ 0.06	$\pm$ 0.10	$\pm$ 0.15	$\pm$ 0.09	$\pm$ 0.05	$\pm$ 0.04	$\pm$ 0.04	$\pm$ 0.07	$\pm$ 0.06	$\pm$ 0.08	$\pm$ 0.11	$\pm$ 0.04	$\pm$ 0.18	$\pm$ 0.10	$\pm$ 0.15

Table 9. Table 9: Accuracy ( % percent \% ) with various knowledge distillation methods for different structure of teachers and students on CIFAR-10. “Teacher” and “Student” denote results of the model used to train the distillation methods and trained from scratch, respectively. “ g g \mathrm{g} ” and “ l l \mathrm{l} ” denote using global and local feature distillation, respectively.

Teacher	WRN	WRN	vgg13	M.Net
	28-1	16-8	vgg13	V2
	(0.4M,	(11.0M,	(9.4M,	(0.6M,
	85.84)	89.50)	88.56)	89.61)
Student	vgg8		ResNet20	ResNet26
	(3.9M,		(0.3M,	(0.4M,
	85.41 $\pm$ 0.06)		85.20 $\pm$ 0.17)	85.65 $\pm$ 0.20)
KD	86.93	86.74	85.39	87.74
KD	$\pm$ 0.11	$\pm$ 0.13	$\pm$ 0.07	$\pm$ 0.08
AT	87.16	87.29	85.63	88.61
AT	$\pm$ 0.09	$\pm$ 0.10	$\pm$ 0.20	$\pm$ 0.04
SP	87.29	86.82	85.00	85.78
SP	$\pm$ 0.00	$\pm$ 0.07	$\pm$ 0.07	$\pm$ 0.10
AMD	87.43	87.61	86.18	88.70
(g)	$\pm$ 0.04	$\pm$ 0.11	$\pm$ 0.14	$\pm$ 0.03
AMD	87.56	87.63	86.41	88.42
(g+l)	$\pm$ 0.03	$\pm$ 0.07	$\pm$ 0.04	$\pm$ 0.08

Table 10. Table 10: ECE ( % percent \% ) and NLL ( % percent \% ) for various knowledge distillation methods with Mixup on CIFAR-10. “ g g \mathrm{g} ” and “ l l \mathrm{l} ” denote using global and local feature distillation, respectively. The results (ECE, NLL) for WRN16-3 and WRN28-1 teachers are (1.469 % percent \% , 44.42 % percent \% ) and (2.108 % percent \% , 64.38 % percent \% ), respectively.

Setup	Method	w/o Mixup		w/ Mixup
Setup	Method	ECE	NLL	ECE	NLL
	Student	2.273	70.49	7.374 (+5.101)	90.58 (+20.09)
(a)	KD [8]	2.065	63.34	1.818 (-0.247)	55.62 (-7.71)
	AT [21]	1.978	60.48	1.652 (-0.326)	50.84 (-9.64)
	AFD [23]	1.890	56.71	1.651 (-0.240)	50.22 (-6.49)
	AMD (g)	1.933	59.67	1.645 (-0.288)	50.33 (-9.34)
	AMD (g+l)	1.895	57.60	1.592 (-0.304)	49.68 (-7.92)
(b)	KD [8]	2.201	68.75	1.953 (-0.249)	58.81 (-9.93)
	AT [21]	2.156	67.14	1.895 (-0.261)	56.51 (-10.62)
	AFDS [22]	2.197	68.53	1.978 (-0.219)	58.86 (-9.68)
	AFD [23]	2.143	66.05	1.900 (-0.243)	57.68 (-8.37)
	AMD (g)	2.117	66.47	1.869 (-0.248)	56.05 (-10.42)
	AMD (g+l)	2.123	67.51	1.853 (-0.270)	55.15 (-12.36)

Table 11. Table 11: ECE ( % percent \% ) and NLL ( % percent \% ) for various knowledge distillation methods with SP on CIFAR-10. “ g g \mathrm{g} ” and “ l l \mathrm{l} ” denote using global and local feature distillation, respectively. The results (ECE, NLL) for WRN16-3 and WRN28-1 teachers are (1.469 % percent \% , 44.42 % percent \% ) and (2.108 % percent \% , 64.38 % percent \% ), respectively.

Setup	Method	w/o SP		w/ SP
Setup	Method	ECE	NLL	ECE	NLL
(a)	AT [21]	1.978	60.48	1.861 (-0.118)	56.22 (-4.26)
	AFD [23]	1.890	56.71	1.881 (-0.010)	56.73 (-0.02)
	AMD (g)	1.933	59.67	1.808 (-0.125)	54.74 (-4.93)
	AMD (g+l)	1.895	57.60	1.803 (-0.092)	53.80 (-3.80)
(b)	AT [21]	2.156	67.14	2.095 (-0.060)	65.38 (-1.75)
	AFDS [22]	2.197	68.53	2.128 (-0.069)	66.61 (-1.92)
	AFD [23]	2.143	66.05	2.118 (-0.024)	65.39 (-0.66)
	AMD (g)	2.117	66.47	2.058 (-0.059)	63.37 (-3.10)
	AMD (g+l)	2.123	67.51	2.043 (-0.080)	63.23 (-4.28)

Equations18

L = (1 - λ) L_{C} + λ L_{K},

L = (1 - λ) L_{C} + λ L_{K},

L_{C} = H (so f t ma x (a_{S}), y),

L_{C} = H (so f t ma x (a_{S}), y),

L_{K} = τ^{2} K L (z_{T}, z_{S}),

L_{K} = τ^{2} K L (z_{T}, z_{S}),

f_{T}^{l} = j = 1 \sum c ∣ A_{T, j}^{l} ∣^{2} .

f_{T}^{l} = j = 1 \sum c ∣ A_{T, j}^{l} ∣^{2} .

G^{i} = l o g (\frac{e ^{s \cdot (cos (m \cdot θ_{y_{i}}))}}{e ^{s \cdot (cos (m \cdot θ_{y_{i}}))} + \sum _{j = 1, j \neq = y_{i}}^{J} e ^{s \cdot (cos (θ_{j}))}}),

G^{i} = l o g (\frac{e ^{s \cdot (cos (m \cdot θ_{y_{i}}))}}{e ^{s \cdot (cos (m \cdot θ_{y_{i}}))} + \sum _{j = 1, j \neq = y_{i}}^{J} e ^{s \cdot (cos (θ_{j}))}}),

G^{l} (Q_{p}, Q_{n}) = l o g (\frac{e ^{s \cdot (cos (m \cdot θ_{p_{l}}))}}{e ^{s \cdot (cos (m \cdot θ_{p_{l}}))} + e ^{s \cdot (cos (θ_{n_{l}}))}}),

G^{l} (Q_{p}, Q_{n}) = l o g (\frac{e ^{s \cdot (cos (m \cdot θ_{p_{l}}))}}{e ^{s \cdot (cos (m \cdot θ_{p_{l}}))} + e ^{s \cdot (cos (θ_{n_{l}}))}}),

\begin{split}&\mathcal{L}_{AM}(Q_{Tp},Q_{Tn},Q_{Sp},Q_{Sn})=\frac{1}{3|L|}\sum_{(l,l^{\prime})\in L}\\ &\left(\underbracket{\Bigl{\|}\hat{G}^{l}(Q_{Tp},Q_{Tn})-\hat{G}^{l^{\prime}}(Q_{Sp},Q_{Sn})\Bigr{\|}^{2}_{F}}_{\text{\normalsize\clap{{A}}}}+\right.\\ &\left.\underbracket{\Bigl{\|}\hat{Q}^{l}_{Tp}-\hat{Q}^{l^{\prime}}_{Sp}\Bigr{\|}^{2}_{F}}_{\text{\normalsize\clap{{P}}}}+\underbracket{\Bigl{\|}\hat{Q}^{l}_{Tn}-\hat{Q}^{l^{\prime}}_{Sn}\Bigr{\|}^{2}_{F}}_{\text{\normalsize\clap{{N}}}}\right).\end{split}

\begin{split}&\mathcal{L}_{AM}(Q_{Tp},Q_{Tn},Q_{Sp},Q_{Sn})=\frac{1}{3|L|}\sum_{(l,l^{\prime})\in L}\\ &\left(\underbracket{\Bigl{\|}\hat{G}^{l}(Q_{Tp},Q_{Tn})-\hat{G}^{l^{\prime}}(Q_{Sp},Q_{Sn})\Bigr{\|}^{2}_{F}}_{\text{\normalsize\clap{{A}}}}+\right.\\ &\left.\underbracket{\Bigl{\|}\hat{Q}^{l}_{Tp}-\hat{Q}^{l^{\prime}}_{Sp}\Bigr{\|}^{2}_{F}}_{\text{\normalsize\clap{{P}}}}+\underbracket{\Bigl{\|}\hat{Q}^{l}_{Tn}-\hat{Q}^{l^{\prime}}_{Sn}\Bigr{\|}^{2}_{F}}_{\text{\normalsize\clap{{N}}}}\right).\end{split}

L_{A M D} = λ_{1} L_{C} + λ_{2} L_{K} + γ L_{A},

L_{A M D} = λ_{1} L_{C} + λ_{2} L_{K} + γ L_{A},

L_{A_{global}} = L_{A M} (Q_{T}, Q_{S}), L_{A_{local}} = \frac{1}{K} k = 1 \sum K L_{A M} (Q_{T}^{k}, Q_{S}^{k}),

L_{A_{global}} = L_{A M} (Q_{T}, Q_{S}), L_{A_{local}} = \frac{1}{K} k = 1 \sum K L_{A M} (Q_{T}^{k}, Q_{S}^{k}),

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Full text

Leveraging Angular Distributions for Improved Knowledge Distillation

Eun Som Jeon, Hongjun Choi, Ankita Shukla, and Pavan Turaga

Geometric Media Lab, School of Arts, Media and Engineering and School of Electrical, Computer and Energy Engineering, Arizona State University, Tempe, AZ, 85281, USA

Abstract

Knowledge distillation as a broad class of methods has led to the development of lightweight and memory efficient models, using a pre-trained model with a large capacity (teacher network) to train a smaller model (student network). Recently, additional variations for knowledge distillation, utilizing activation maps of intermediate layers as the source of knowledge, have been studied. Generally, in computer vision applications, it is seen that the feature activation learned by a higher-capacity model contains richer knowledge, highlighting complete objects while focusing less on the background. Based on this observation, we leverage the teacher’s dual ability to accurately distinguish between positive (relevant to the target object) and negative (irrelevant) areas.

We propose a new loss function for distillation, called angular margin-based distillation (AMD) loss. AMD loss uses the angular distance between positive and negative features by projecting them onto a hypersphere, motivated by the near angular distributions seen in many feature extractors. Then, we create a more attentive feature that is angularly distributed on the hypersphere by introducing an angular margin to the positive feature. Transferring such knowledge from the teacher network enables the student model to harness the teacher’s higher discrimination of positive and negative features, thus distilling superior student models. The proposed method is evaluated for various student-teacher network pairs on four public datasets. Furthermore, we show that the proposed method has advantages in compatibility with other learning techniques, such as using fine-grained features, augmentation, and other distillation methods.

keywords:

Knowledge distillation, Angular distribution, Angular margin, Image classification

††journal: Neurocomputing

1 Introduction

In the past decade, convolutional neural networks (CNN) have been widely deployed into many commercial applications. Various architectures that go beyond convolutional methods have also been developed. However, a core challenge in all of them is that they are accompanied by high computational complexity, and large storage requirements [1, 2]. For this reason, application of deep networks is still limited to environments that have massive computational support. In emerging applications, there is growing demand for applying deep nets on edge, mobile, and IoT devices [3, 4, 5, 6]. To move beyond these limitations, many studies have developed a lightweight form of neural models which assure performance while ‘lightening’ the network scale [2, 3, 4, 5, 6, 7, 8].

Knowledge distillation (KD) is one of the promising solutions that can reduce the network size and develop an efficient network model [1, 2, 9] for various fields including wearable sensor data [10], sound [11, 12], and image classification [13, 14]. The concept of knowledge distillation is that the network consists of two networks, a larger one called teacher and a smaller one called student [8]. During training the student, the teacher transfers its knowledge to the student, using the logits from the final layer. So, the student can retain the teacher model’s classification performance.

Recent insights have shown that features learnt in deep-networks often exhibit an angular distribution, usually leveraged via a hyperspherical embedding [15, 16, 17]. Such embeddings lead to improved discriminative power, and feature separability. In terms of loss-functions, these can be implemented by using angular features that correspond to the geodesic distance on the hypersphere and incorporating a preset constant margin. In this work, we show that leveraging such spherical embeddings also improves knowledge distillation. Firstly, to get more activated features, spatial attention maps are computed and decoupled into two parts: positive and negative maps. Secondly, we construct a new form of knowledge by projecting the features onto the hypersphere to reflect the angular distance between them. Then, we introduce an angular margin to the positive feature to get a more attentive feature representation. Finally, during the distillation, the student tries to mimic the more separated decision regions of the teacher to improve the classification performance. Therefore, the proposed method effectively regularizes the feature representation of the student network to learn informative knowledge of the teacher network.

The contributions of this paper are:

•

We propose an angular margin based distillation loss (named as AMD) which performs knowledge distillation by transferring the angular distribution of attentive features from the teacher network to the student network.

•

We experimentally show that the proposed method results in significant improvements with different combinations of networks and outperforms other attention-based methods across four datasets of different complexities, corroborating that the performance of a higher capacity teacher model is not necessarily better.

•

We rigorously validate the advantages of the proposed distillation method with various aspects using visualization of activation maps, classification accuracy, and reliability diagrams.

The rest of the paper is organized as follows. In section 2 and 3, we describe related work and background, respectively. In section 4, we provide an overview of the proposed method. In section 5, we describe our experimental results and analysis. In section 6, we discuss our findings and conclusions.

2 Related Work

Knowledge distillation. Knowledge distillation, a transfer learning method, trains a smaller model by shifting knowledge from a larger model. KD is firstly introduced by Buciluǎ et al. [18] and is further explored by Hinton et al. [8]. The main concept of KD is using soft labels by a trained teacher network. That is, mimicking soft probabilities helps students get knowledge of teachers, which improves beyond using hard labels (training labels) alone. Cho et al. [2] explore which combination of student-teacher is good to obtain the better performance. They show that using a teacher trained by early stopping the training improves the efficacy of KD. KD can be categorized into two approaches that use the outputs of the teacher [1]. One is response-based KD, which uses the posterior probabilities with softmax loss. The other is feature-based KD using the intermediate features with normalization. Feature-based methods can be performed with the response-based method to complement traditional KD [1]. Recently, feature-based distillation methods for KD have been studied to learn richer information from the teacher for better-mimicking and performance improvement [1, 13, 19]. Romero et al. [20] firstly introduced the use of intermediate representations in FitNets using feature-based distillation. This method enables the student to mimic the teacher’s feature maps in intermediate layers.

Attention transfer. To capture the better knowledge of a teacher network, attention transfer [1, 21, 22, 23] has been utilized, which is one of the popular methods for feature-based distillation. Zagoruyko et al. [21] suggest activation-based attention transfer (AT), which uses a sum of squared attention mapping function computing statistics across the channel dimension. Although the depth of teacher and student is different, knowledge can be transferred by the attention mapping function, which matches the depth size as one. The activation-based spatial attention maps are used as the source of knowledge for distillation with intermediate layers, where the maps are created as: $f^{d}_{sum}$ ( $A$ ) = $\sum_{j=1}^{c}|A_{j}|^{d}$ , where $f$ is a computed attention map, $A$ is an output of a layer, $c$ is the number of channels for the output, $j$ is the number for the channel, and $d$ $>$ 1. A higher value of $d$ corresponds to a heavier weight on the most discriminative parts defined by activation level. AT (feature-based distillation method) shows better effectiveness when used with traditional KD (response-based KD) [21]. The method encourages the student to generate similar normalized maps as the teacher. However, these studies have only focused on mimicking the teacher’s activation from a layer [19], not considering the teacher’s dual ability to accurately distinguish between positive (relevant to the target object) and negative (irrelevant). Teacher not only can generate and transfer its knowledge as an activation map directly, but also can transfer separability to distinguish between positive and negative features. We refer to this as a dual ability, which we consider for improved distillation. The emphasized positive feature regions that encapsulate regions of the target object are crucial to predicting the correct class. In general, a higher-capacity model shows better performance, producing those regions with more attention and precision compared to the smaller network. This suggests that the transfer of distinct regions of the positive and negative pairs from teacher to student could significantly improve performance. This motivates us to focus on utilizing positive and negative pairs for extracting more attentive features, implying better separability, for distillation.

Spherical feature embeddings. The majority of existing methods [24, 25] rely on Euclidean distance for feature distinction. These approaches could not solve the problem that classification under open-set protocol shows a meaningful result only when successfully narrowing maximal intra-class distance. To solve this problem, an angular-softmax (A-softmax) function is proposed to distinguish the features by increasing the angular margins between features [17]. According to its geometric interpretation, using A-softmax function equivalents to the projection of features onto the hypersphere manifold, which intrinsically matches the preliminary condition that features also lie on a manifold. Applying the angular margin penalty corresponds to the geodesic distance margin penalty in the hypersphere [17]. A-softmax function encourages learned features to be discriminative on hypersphere manifold. For this reason, the A-softmax function shows superior performance to the original softmax function when tested on several classification problems [17]. On the other hand, Choi et al. [15] introduced angular margin based contrastive loss (AMC-loss) as an auxiliary loss, employing the discriminative angular distance metric that corresponds to geodesic distance on a hypersphere manifold. AMC-loss increases inter-class separability and intra-class compactness, improving performance in classification. The method can be combined with other deep techniques, because it easily encodes the angular distributions obtained from many types of deep feature learners [15].

The previous methods work with logits only or work with an auxiliary loss, such as a contrastive loss. We focus on features modeled as coming from angular distributions, and focus on their separability. The observations give us an insight that the high quality features for knowledge distillation can be obtained by projecting the feature pairs onto a hypersphere. For better distillation, we construct a derive new type of implicit knowledge with positive and negative pairs from intermediate layers. The details are explained in section 4.

3 Background

3.1 Traditional knowledge distillation

In standard knowledge distillation [8], the loss for training a student is:

[TABLE]

where, $\mathcal{L_{C}}$ denotes the standard cross entropy loss, $\mathcal{L_{K}}$ is KD loss, and $\lambda$ is a hyperparameter; $0<\lambda<1$ . The error between the output of the softmax layer of a student network and the ground-truth label is penalized by the cross-entropy loss:

[TABLE]

where $\mathcal{H(\cdot)}$ is a cross entropy loss function, $a_{S}$ is the logits of a student (inputs to the final softmax), and $y$ is a ground truth label. The outputs of student and teacher are matched by KL-divergence loss:

[TABLE]

where, $z_{T}=softmax(a_{T}/\tau)$ is a softened output of a teacher network, $z_{S}=softmax(a_{S}/\tau)$ is a softened output of a student, and $\tau$ is a hyperparameter; $\tau>1$ . Feature distillation methods using intermediate layers can be used with the standard knowledge distillation that uses output logits. When they are used together, in general, it is beneficial to guide the student network towards inducing more similar patterns of teachers and getting a better classification performance. Thus, we also utilize the standard knowledge distillation with our proposed method.

3.2 Attention map

Denote an output as $A\in\mathcal{\mathbb{R}}^{c\times{h}\times{w}}$ , where $c$ is the number of output channels, $h$ is the height for the size of output, and $w$ is width for the size of the output. The attention map for the teacher is given as follows:

[TABLE]

Here, $A_{T}$ is an output of a layer from a teacher model, $l$ is a specific layer, $c$ is the number of channels, $j$ is the number for the output channel, and $T$ denotes a teacher network. The attention map for the student is $f^{l^{\prime}}_{S}$ = $\sum_{j^{\prime}=1}^{c^{\prime}}|A^{l^{\prime}}_{S,j^{\prime}}|^{2}$ , where $A_{S}^{l^{\prime}}$ is an output of a layer from a student, $l^{\prime}$ is the corresponding layer of $l$ , $c^{\prime}$ is the number of channels for the output, $j^{\prime}$ is the number for the output channel, and $S$ denotes a student network. If the student and teacher use the same depth for transfer, $l^{\prime}$ can be the layer at the same depth as $l$ ; if not, $l^{\prime}$ can be the end of the same block for the teacher. From the attention map, we obtain positive and negative maps and we project features onto hypersphere to calculate angular distance for distillation. The details are explained in section 4.

3.3 Spherical feature with angular margin

In order to promote the learned features to have an angular distribution, [17, 26] proposed to introduce the angular distance between features $W$ and weights $x$ . For example, $W^{T}x=\|W\|\|x\|cos(\theta)$ , where bias is set as [math] for simplicity, and $\theta$ is the angle between $W$ and $x$ . Then, the normalization of feature and weight makes the outputs only depend on the angle between weights and features and further, $\|x\|$ is replaced to a constant $s$ such that the features are distributed on a hypersphere with a radius of $s$ . To enhance the discrimination power, angular margin $m$ is applied to the angle of the target. Finally, output logits are used to formulate probability with angular margin $m$ as below [17, 26]:

[TABLE]

where, $y_{i}$ is a label and $\theta_{y_{i}}$ is a target angle for class $i$ , $\theta_{j}$ is an angle obtained from $j$ -th element of output logits, $s$ is a constant, and $J$ is the class number. Liu et al. [17] and Wang et al. [26] utilized output logits to obtain more discriminative features for classification on a hypersphere manifold, which performs better than using original softmax function. We use Equation (5) to create the new type of feature-knowledge in the intermediate layers instead of output logits in the final classifier, thereby more attentive feature maps are transferred to the student model.

4 Proposed Method

The proposed method utilizes features from intermediate layers of deep networks for extracting angular-margin based knowledge as illustrated in Figure 1. The resultant angular margin loss is computed at various depths of the student and teacher as illustrated in figure 2. To obtain the angular distance between positive and negative features, we first generate attention maps from the outputs of intermediate layers. We then decouple the maps into positive and negative features. The features are projected onto a hypersphere to extract angularly distributed features. For effective distillation, more attentive features are obtained by introducing angular margin to the positive feature and the probability forms for distillation are computed. Finally, the knowledge of the teacher having better discrimination of positive and negative features is transferred to the student. The details for obtaining the positive and negative maps and the angular margin based knowledge are explained in the following section.

4.1 Generating attention maps

To transfer activated features from teacher to student, the output of intermediate layers are used. To match the dimension size between teacher and student models, we create the normalized attention maps [21], which has benefits in generating maps discriminatively between positive and negative features. This reduces the need for any additional training procedure for matching the channel dimension sizes between teacher and student. We use the power value $d=2$ for generating the attention maps, which shows the best results as reported in previous methods [21].

4.2 Angular margin computation

Although the activation map-based distillation provides additional context information for student model learning, there is still room to craft an attentive activation map that can distill a superior student model in KD. To further refine the original attention map, we propose an angular margin-based distillation (AMD) that encodes new knowledge using the angular distance between positive (relevant to the target object) and negative features (irrelevant) on the hypersphere.

We denote the normalized positive map as $Q_{p}=f/\|f\|$ where $f$ is the output map extracted from the intermediate layer in networks. Further, we can obtain the normalized negative map by $Q_{n}=1-Q_{p}$ .

Then, to make the positive map more attentive, we insert an angular margin $m$ into the positive features. In this way, a new feature-knowledge encoding attentive feature can be defined as follows:

[TABLE]

where, $\theta_{p_{l}}=cos^{-1}(Q_{p})$ and $\theta_{n_{l}}=cos^{-1}(Q_{n})$ for $l^{th}$ layer in the networks, and $m$ is a scalar angular margin. $G^{l}$ $\in\mathcal{\mathbb{R}}^{1\times{h}\times{w}}$ reflects the angular distance between positive and negative features in $l^{th}$ layer. For transferring knowledge, we aim to make the student’s $G^{l}(Q_{Sp},Q_{Sn})$ approximate the teacher’s $G^{l}(Q_{Tp},Q_{Tn})$ by minimizing the angular distance between feature maps.

4.3 Angular margin based distillation loss

With redesigned knowledge as above, we finally define the angular margin based distillation loss that accounts for the knowledge gap between the teacher and student activations as:

[TABLE]

Here, $\hat{G}$ denotes a function for normalization for output of function G, $\hat{Q}$ is a normalized map. $L$ collects the layer pairs ( $l$ and $l^{\prime}$ ), and $\|\cdot\|_{F}$ is the Frobenius norm [27]. We will verify the performance of each component (A, P, and N) in section 5.3.

The final loss ( $\mathcal{L}_{AMD}$ ) of our proposed method combines all the distillation losses, including the conventional logit distillation (Equation 3). Thus, our overall learning objective can be written as:

[TABLE]

where, $\mathcal{L_{C}}$ is a cross-entropy loss, $\mathcal{L_{K}}$ is a knowledge distillation loss, $\mathcal{L_{A}}$ denotes the angular margin based loss from $\mathcal{L}_{AM}$ , and $\lambda_{1}$ , $\lambda_{2}$ , and $\gamma$ are hyperparameters to control the balance between different losses.

Global and local feature distillation. So far, we only consider the global feature (i.e., preserving its dimension and size). However, we point out that the global feature sometimes does not transfer more informative knowledge and rich spatial information across contexts of an input. Therefore, we also suggest utilizing local features during distillation. Specifically, the global feature is the original feature without a map division. Local features are determined by the division of the global feature. We split the global feature map from each layer by 2 for the width and height sizes of the maps to create four ( $2\times 2$ ) local feature maps. That is, one local map has $h/2\times w/2$ size, where $h$ and $w$ are the height and width sizes of the global map. Similar to before, local features encoding the attentive angle can be extracted for both teacher and student. Then, the losses considering global and local features for our method are:

[TABLE]

where $Q_{T}$ and $Q_{S}$ are global features of the teacher and student for distillation, and $Q^{k}_{T}$ and $Q^{k}_{S}$ are local features of the teacher and student, respectively, for $k$ -th element of $K$ , where $K$ is the total number of local maps from a map; $K$ = 4. When $\mathcal{L_{A_{\text{global}}}}$ and $\mathcal{L_{A_{\text{local}}}}$ are used together, we applied weights of 0.2 for local and 0.8 for global features to make a balance for learning.

5 Experiments

In this section, we present experimental validation of the proposed method. We evaluate the proposed method, AMD, with various combinations of teacher and student, which have different architectural styles. We run experiments on four public datasets that have different complexities. We examine the sensitivity with several different hyperparameters ( $\gamma$ and $m$ ) for the proposed distillation and discuss which setting is the best. To demonstrate the detailed contribution, we report the results with various aspects, using classification accuracy as well as activation maps extracted by Grad-CAM [28]. Finally, we investigate performance enhancement by combining previous methods including filtered feature based distillation. Each experiment and its corresponding section are described in Table 1.

5.1 Datasets

CIFAR-10. CIFAR-10 dataset [29] includes 10 classes with 5000 training images per class and 1000 testing images per class. Each image is an RGB image of size 32 $\times$ 32. We use the 50000 images as the training set and 10000 as the testing set. The experiments on CIFAR-10 helps validate the efficacy of our models with less time consumption.

CINIC-10. We extend our experiments on CINIC-10 [30]. CINIC-10 comprises of augmented extension in the style of CIFAR-10, but the dataset contains 270,000 images whose scale is closer to that of ImageNet. The images are equally split into each ‘train’, ‘test’, and ‘validate’ sets. The size of the images is 32 $\times$ 32. There are ten classes with 9000 images per class.

Tiny-ImageNet / ImageNet. To extend our experiments on a larger scale dataset having more complexity, we use Tiny-ImageNet [31]. The size of the images for Tiny-ImageNet is 64 $\times$ 64. We pad them to 68 $\times$ 68, then they are randomly cropped to 64 $\times$ 64, and horizontally flipped, for augmentation to account for the complexity of the dataset. The training and testing sets are of size 100k and 10k respectively. The dataset includes 200 classes. For ImageNet [32], The dataset has 1k categories with 1.2M training images. The images are randomly cropped and then resized to 224 $\times$ 224 and horizontally flipped.

5.2 Settings for experiments

For experiments on CIFAR-10, CINIC-10, and Tiny-ImageNet, we set the batch size as 128, the total epochs as 200 using SGD with momentum 0.9, a weight decay of $1\times 10^{-4}$ , and the initial learning rate $lr$ as 0.1 which is decayed by a factor of 0.2 at epochs 40, 80, 120, and 160. For ImageNet, we use SGD with momentum of 0.9 and the batch size is set as 256. We run a total epoch of 100. The initial learning rate $lr$ is 0.1 decayed by 0.1 in 30, 60, and 90 epochs.

In experiments, we use the proposed method with WideResNet (WRN) [33] for teacher and student models to evaluate the classification accuracy, which is popularly used for KD [2, 9, 21, 27]. Their network architectures are described in Table 2.

To determine optimal parameters $\lambda_{1}$ and $\lambda_{2}$ for KD, we tested with different values for $\lambda_{1}$ and $\lambda_{2}$ for training based on KD on CIFAR-10 dataset. As shown in Figure 3, when $\lambda_{1}$ is 0.1 and $\lambda_{2}$ is 0.9 ( $\tau$ = 4) with KD, the accuracy of a student (WRN16-1) trained with WRN16-3 as a teacher is the best. If $\lambda_{1}$ is small and $\lambda_{2}$ is large, the distillation effect of KD is increased. Since the accuracy depends on $\lambda_{1}$ and $\lambda_{2}$ , we referred to previous studies [2, 23, 27] to choose the popular parameters for experiments. The parameters of ( $\lambda_{1}$ = 0.1, $\lambda_{2}$ = 0.9, $\tau$ = 4), ( $\lambda_{1}$ = 0.4, $\lambda_{2}$ = 0.6, $\tau$ = 16), ( $\lambda_{1}$ = 0.7, $\lambda_{2}$ = 0.3, $\tau$ = 16), and ( $\lambda_{1}$ = 1.0, $\lambda_{2}$ = 1.0, $\tau$ = 4) are used for KD on CIFAR-10, CINIC-10, Tiny-ImageNet, and ImageNet, respectively.

We perform baseline comparisons with traditional KD [8], attention transfer (AT) [21], relational knowledge distillation (RKD) [34], variational information distillation (VID) [35], similarity-preserving knowledge distillation (SP) [27], correlation congruence for knowledge distillation (CC) [36], contrastive representation distillation (CRD) [37], attentive feature distillation and selection (AFDS) [22], and attention-based feature distillation (AFD) [23] that is a new feature linking method considering similarities between the teacher and student features, including state-of-the-art approaches. Note that, for fair comparison, the distillation methods are performed with traditional KD to see if they enhance standard KD, keeping the same setting as the proposed method. The hyperparameters of the methods follow their respective papers. For the proposed method, the constant parameter $s$ and margin parameter $m$ are 64 and 1.35, respectively. The loss weight $\gamma$ of the proposed method is 5000. We determine the hyperparameters empirically, considering the distillation effects by the capacity of models. A more detailed description of parameters appears in section 5.5. All experiments were repeated five times, and the averaged best accuracy and the standard deviation of performance are reported.

No augmentation method is applied for CIFAR-10 and CINIC-10. For the proposed method, additional techniques, such as using the other hidden layers for generating better distillation effects from teachers or reshaping the dimension size of the feature maps, are not applied. All of our experiments are run on a 3.50 GHz CPU (Intel® Xeon(R) CPU E5-1650 v3), 48 GB memory, and NVIDIA TITAN Xp (3840 NVIDIA® CUDA® cores and 12 GB memory) graphic card [38].

To obtain the best performance, we adopt early-stopped KD (ESKD) [2] for training teacher and student models, leveraging its effects across the board in improving the efficacy of knowledge distillation. As shown in Figure 4, the early stopped model of a teacher tends to train student models better than Full KD that uses a fully trained teacher.

5.3 Attention-based distillation

In this section, we explore the performance of attention based distillation approaches with different types of combinations for teacher and student. We set four types of combinations for teacher and student that consist of the same or different structure of networks. The four types of combinations are described in Table 3. Since the proposed method is relevant to using attention maps, we implemented various baselines that are state-of-the-art attention based distillation methods, including AT [21], AFDS [22], and AFD [23]. As described in section 2, AT [21] uses activation-based spatial attention maps for transferring from teacher to student. AFDS [22] includes attentive feature distillation and accelerates the transfer-learned model by feature selection. Additional layers are used to calculate a transfer importance predictor used to measure the importance of the source activation maps and enforce a different penalty for training a student. AFD [23] extracts channel and spatial attention maps and identifies similar features between teacher and student, which are used to control the distillation intensities for all possible pairs and compensate for the limitation of learning to transfer (L2T) [40] using manually selected links. We implemented AFDS [22] when the dimension size of features for intermediate layers from the student is the same as the one from the teacher to concentrate on the distillation effects. We use four datasets that have varying degrees of difficulty in a classification problem. These baselines are used in the following experiments as well.

Table 4 presents the accuracy of various knowledge distillation methods for all setups in Table 3 on CIFAR-10 dataset. The proposed method, AMD (global+local), has the best performing results in all cases. Table 5 describes the CINIC-10 results. In most cases, AMD (global+local) achieves the best results. For experiments on Tiny-ImageNet, as illustrated in Table 6, AMD outperforms previous methods, and AMD (global) shows better results in (a) and (bb) setups. For (cb) and (d) setups, AMD (global+local) provides better results. For experiments on ImageNet, standard KD is not applied to baselines and Full KD is utilized. Teacher and student networks are ResNet34 and ResNet18, respectively. The results of baselines are referred from prior works [23, 37]. As described in Table 7, AMD (global) outperforms other distillation methods, increasing the top-1 and top-5 accuracy by 1.83% and 1.43% over the results of learning from scratch, respectively.

Compared to KD, AT obtains better performance in most cases across datasets. That is, the attention map helps the teacher to transfer its knowledge. Even though there is a case that AT shows lower performance than KD in Table 6, AMD outperforms KD in all cases. It verifies that applying the discriminative angular distance metric for knowledge distillation maximizes the attention map’s efficacy of transferring the knowledge and performs to complement the traditional KD for various combinations of teacher and student. The accuracies of SP with setup (a) and (d), and AFD with setup (d), are even lower than the accuracy of learning from scratch, while AMD performs better than other methods as shown in Table 6. When the classification problem is harder, AMD (global) can perform better than AMD (global+local) in some cases. When the teacher and student have different channels or architectural styles, AMD (global+local) can generate a better student than AMD (global).

Components of AMD loss function. As described in Equation 7, angular margin distillation loss function ( $\mathcal{L}_{AM}(Q_{Tp},Q_{Tn},Q_{Sp},Q_{Sn})$ ) includes three components (A, P, N). To verify the performance of each component in AMD loss, we experiment with each component separately. As shown in Figure 5, among all components, (A) provides the strongest contribution. Each component in AMD contributes to improvements in performance, which transfers different knowledge. Adding one component to the other one provides richer information, which leads to better performance. The combination of all the components (AMD) show a much higher performance. This result indicates that all components (AMD) are critical to distilling the best student model.

In Figure 6 we show $\mathcal{L_{A}}$ vs. accuracy, when using KD, SP, and AMD (global), for WRN16-1 students trained with WRN16-3, WRN28-1, and ResNet44 teachers, on CIFAR-10 testing set. As shown in Figure 6, when the loss value is smaller, the accuracy is higher. Thus, these plots verify that $\mathcal{L_{A}}$ and performance are correlated.

t-SNE visualization and cluster metrics. To measure the clustering performance, we plot t-SNE [41] and calculate V-Score [42] of outputs from penultimate layers of KD and the proposed method on CIFAR-10, where V-Score is clustering metrics implying a higher value is better clustering. As shown in Figure 7, compared to KD, AMD helps get tighter clusters and better separation between classes as seen in higher V-Score.

5.4 Effect of teacher capacity

To understand the effect of the capacity of the teacher, we implemented various combinations of teacher and student, where the teacher has a different capacity. We use well-known benchmarks for image classification which are WRN [33], ResNet [39], and MobileNetV2 (M.NetV2) [43]. We applied the same settings as in the experiments of the previous section.

The results in classification accuracy for the student models are described in Table 8 across three datasets, trained with attention based and non-attention based methods [8, 21, 27]. The number of trainable parameters are noted in in brackets. For all cases, the proposed method, AMD, shows the highest accuracy. When the complexity of the dataset is higher and the depth of teacher is largely different from the one of the student, AMD (global) tends to generate a better student than AMD (global+local). When a larger capacity of students is used, the accuracy observed is higher. This is seen in the results from WRN16-1 and ResNet20 students with WRN16-3 and WRN28-3 teachers on CINIC-10 dataset. For the combinations, ResNet20 students having a larger capacity than WRN16-1 generate better results. Furthermore, on CIFAR-10, when a WRN16-3 teacher is used, a WRN28-1 student achieves 87.35 $\%$ for AMD (global+local), whereas a WRN16-1 student achieves 86.36 $\%$ for AMD (global+local). On Tiny-ImageNet, when AMD (global+local) is used, the accuracy of a ResNet20 student is 52.12 $\%$ , which is higher than the accuracy of a WRN16-1 student, which is 49.92 $\%$ .

Compared to KD, in most cases, AT achieves better performance. However, when the classification problem is difficult, such as when using Tiny-ImageNet, and when WRN40-2 teacher and WRN16-1 student are used, both AT and SP show worse performance than KD. When the WRN16-3 teacher and ResNet20 student are used, KD and AT perform worse than the model trained from scratch. The result of AT is even lower than that of KD. So, there are cases where AT and SP cannot complement the performance of the traditional KD. On the other hand, for the proposed method, the results are better than the baselines in all the cases. Interestingly, on CIFAR-10 and CINIC-10, the result of a WRN16-1 student trained by AMD with a WRN28-1 teacher is even better than the result of the teacher. Therefore, we conclude that the proposed method maximizes the attention map’s efficacy of transferring the knowledge and complements traditional KD.

Also, when applying the larger teacher model and the smaller student model, the performance degradation of AMD can occur. For example, on CINIC-10, WRN16-1 student trained with WRN40-1 (0.6M) teacher outperforms the one trained with WRN40-2 (2.3M) teacher. Both AMD and other methods produce some cases with lower performance when a better (usually larger) teacher is used. This is consistent with prior findings [2, 19, 44] that a better teacher does not always guarantee a better student.

Heterogeneous teacher-student. In Table 8, we present the results of the teacher-student combinations from similar architecture styles. Tian et al. [37] found that feature distillation methods such as SP sometimes struggled to find the optimal solution in different architecture styles. In this regard, we implemented heterogeneous teacher-student combination, where the teacher and student have very different structure of networks. We use vgg [45] network to compose heterogeneous combinations.

As describe in Table 9, we observe similar findings, showing degraded performance in using SP when vgg13 teacher and ResNet20 student are used, while AMD consistently outperforms all baselines we explored. Also, in most cases, WRN16-8 teacher distills a better student (vgg8) than WRN28-1 teacher. However, KD and SP shows better performance with WRN28-1 teacher, which corroborates a better teacher does not always distill a better student.

5.5 Ablations and sensitivity analysis

In this section, we investigate sensitivity for hyperparameters ( $\gamma$ and $m$ ) used for the angular margin based attention distillation.

5.5.1 Effect of angular distillation hyperparameter $\gamma$

The results of a student model (WRN16-1) for AMD (global) trained with teachers (WRN16-3 and WRN28-1) by using various $\gamma$ on CIFAR-10 (the first row) and CINIC-10 (the second row) are depicted in Figure 8 ( $m$ = 1.35). When $\gamma$ is 5000, all results show the best accuracy. For CIFAR-10, when WRN16-3 is used as a teacher, the accuracy of $\gamma$ = 3000 is higher than that of $\gamma$ = 7000. However, for WRN28-1 as a teacher, the accuracy of $\gamma$ = 7000 is higher than that of $\gamma$ = 3000. When $\gamma$ is 1000, the accuracy is lower than KD, implying that it does not complement KD and adversely affects the performance. On the other hand, for CINIC-10, when the WRN16-3 teacher is used, the result of $\gamma$ = 7000 is better than that of $\gamma$ = 3000. But, for the WRN28-1 teacher, $\gamma$ = 3000 is higher than that of $\gamma$ = 7000. Therefore, $\gamma$ values between 3000 and 7000 achieve good performance, while too small or large $\gamma$ values do not help much with improvement. Therefore, setting the proper $\gamma$ value is important for performance. We recommend using $\gamma$ as 5000, which produces the best results across datasets and combinations of teacher and student.

5.5.2 Effect of angular margin $m$

The results of a student model (WRN16-1) for AMD (global) trained with teachers (WRN16-3 and WRN28-1) by various angular margin $m$ on CIFAR-10 (the first row) and CINIC-10 (the second row) are illustrated in Figure 9 ( $\gamma$ = 5000). As described in section 4.2, using the large value of $m$ corresponds to producing more distinct positive features in the attention map and making a large gap between positive and negative features for distillation. When $m$ is 1.35 for the WRN16-3 teacher, the WRN16-1 student shows the best performance of 86.28 $\%$ on CIFAR-10. When $m$ = 1.5 for CINIC-10, the student’s accuracy is 75.13 $\%$ , which is higher than when $m$ = 1.35. When the teacher is WRN28-1, the student produces the best accuracy with $m$ = 1.35 on both datasets. The student model with $m$ = 1.35 performs better than the one with $m$ = 1.1 and 2.0. When the complexity of the dataset is higher, using $m$ (1.5) which is larger than 1.35 can produce a good performance. When $m$ = 1.0 (no additional margin applied to the positive feature) for CIFAR-10 and CINIC-10 with setup ( $b$ ), the results are 85.81 $\%$ and 74.83 $\%$ , which are better than those of 85.31 $\%$ and 74.75 $\%$ from $m$ = 2.0, respectively. This result indicates that it is important to set an appropriate $m$ value for our method. We believe that angular margin plays a key role in determining the gap between positive and negative features. As angular margin increases, the positive features are further emphasized, and in this case of over-emphasis by a much larger $m$ , the performance is worse than that of the smaller $m$ . We recommend using a margin $m$ of around 1.35 ( $m>1.0$ ), which generates the best results in most cases.

5.6 Analysis with activation maps

To analyze results with intermediate layers, we adopt Grad-CAM [28] which uses class-specific gradient information to visualize the coarse localization map of the important regions in the image. In this section, we present the activation maps from intermediate layers and the high level of the layer with various methods. The red region is more crucial for the model prediction than the blue one.

5.6.1 Activation maps for the different levels of layers

The activation maps from intermediate layers with various methods are shown in Figure 10. The proposed method, AMD, shows intuitively similar activated regions to the traditional KD [8] in the low-level. However, at mid-level and high-level, the proposed method represents the higher activations around the region of a target object, which is different from the previous methods [8, 21]. Thus, the proposed method can classify positive and negative areas more discriminatively, compared to the previous methods [8, 21]. The high-level activation maps with various input images are described in Figure 11. The activation from proposed method is seen to be more centered on the target. The result shows that the proposed method performs better in focusing on the foreground object distinctly with high weight, while being less distracted by the background compared to other methods [8, 21]. With higher weight over regions of interest, the student from the proposed method has a stronger discrimination ability. Therefore, the proposed method guides student models to increase class separability.

5.6.2 Activation maps for global and local distillation of AMD

To investigate the impact of using global and local features for AMD, we illustrate relevant results in Figure 12. When both global and local features are used for distillation, the activated area is located and shaped more similar to the teacher, than using the global feature only. Also, AMD (global+local) focuses more on the foreground object with higher weights than AMD (global). AMD (global+local) guides the student to focus more on the target regions and finds discriminative regions. Thus, using global and local features is better than using global features alone for the proposed method.

5.7 Combinations with existing methods

Even if a model shows good performance in classification, it may have miscalibration problems [46] and may not always obtain improved results from combining with other robust methods. In this section, to evaluate the generalizability of models trained by each method and to explore if the method can complement other methods, we implement experiments with various existing methods. We use the method in various ways to demonstrate how easily it can be combined with any previous learning tasks. We trained students with fine-grained features [47, 48], augmentation methods, and one of the baselines such as SP [27] that is not based on the attention feature based KD. WRN16-1 students were trained with WRN16-3 and WRN28-1 teachers. We examine whether the proposed method can be combined with other techniques and compare the results to baselines.

5.7.1 Fine-grained feature-based distillation

If the features of teacher and student are compatible, it results in a student achieving ‘minor gains’ [47]. To perform better distillation and to overcome the problem of learning minor gains, a technique for generating a fine-grained feature has been used [47, 48]. For distillation with AMD and creating the fine-grained (masked) feature, a binary mask is adopted when the negative feature is created. For example, if the probability of the point for the negative map is higher than 0.5, the point is multiplied by 1, otherwise by 0. Then, compared to non-masking, it boosts the difference between teacher and student, where the difference can be more focused on loss function for training. The results for AMD with or without using masked feature-based distillation are presented in Figure 13. The parameter $\gamma$ for training a student based on AMD without masked features is 5000 for all setups across datasets. When masked features are used for AMD, to generate the best results, $\gamma$ of 3000 is applied to setup (b) on CIFAR-10, setup ( $c^{a}$ ) on CINIC-10, and all setups on Tiny-ImageNet. For CIFAR-10, AMD (global+local) without masked features has the best performing result in most cases. AMD (global+local) with masked features shows the best with setup (d). For CINIC-10, the results of AMD with masked features for setup (d) show the best. For Tiny-ImageNet, in most cases, AMD with masked features performs the best. Therefore, when the complexity of a dataset is high, fine-grained features can help more effectively improve the performance, and the smaller parameter of $\gamma$ , 3000, generates better accuracy. Also, AMD (global+local) with masked features produces better performance than AMD (global) with the one. For setup (d) – different architectures for teacher and student – with/without masked features, AMD (global+local) outperforms AMD (global). This could be due to the fact that the teacher’s features differ from the student’s because the two networks have different architectures, resulting in different distributions. So, masked features with both global and local distillation influence more on setup (d) than other setups. The difference between AMD (global) and AMD (global+local) with masked features is also discriminatively shown with the harder problem in classification. If the student’s and teacher’s architectural styles are similar, the student is more likely to achieve plausible results [19].

5.7.2 Applying augmentation methods

In this section, we investigate of the compatibility with different types of augmentation methods.

Mixup. Mixup [49] is one of the most commonly used augmentation methods. We demonstrate here that AMD complements Mixup. Mixup’s parameter is set to $\alpha_{\text{Mixup}}=0.2$ . A teacher is trained with the original training set and learns from scratch. A student is trained with Mixup and the teacher model is implemented as a pre-trained model.

As described in Figure 14, with Mixup, most of the methods generate better results. However, KD shows slight degradation when a WRN16-3 teacher is used. This degradation might be related to the artificially blended labels by Mixup. Conventional KD achieves the success by transferring concise logit knowledge. However, with Mixup in KD, the knowledge from a teacher is affected by the mixed labels and is not concise logits, which can hurt distillation quality [50]. So, the knowledge for separating different classes can be better encoded by traditional KD (without Mixup) [50]. Even though the KD performs degradation with Mixup, all other baselines and proposed methods transferring features with intermediate layers show improvement. Thus, the feature based distillation methods help to reduce the negative effects from noisy logits. When a WRN28-1 teacher is used, the performance of the student from AFD is degraded. AFD utilizes similarity of features for all possible pairs of the teacher and student. For this combination, Mixup produces noisy features, which can affect to mismatch the pair for distillation to perform degradation. Compared to the baselines, AMD obtains more gains from Mixup. To study the generalizability and regularization effects of Mixup, we measured expected calibration error (ECE) [46, 51] and negative log likelihood (NLL) [46] for each method. ECE is a metric to measure calibration, representing the reliability of the model [46]. A probabilistic model’s quality can be measured by using NLL [46]. The results of training from scratch with Mixup show a higher ECE and NLL than the results of training without Mixup, as seen in Table 10. However, the methods, including knowledge distillation, generate lower ECE and NLL. This implies that knowledge distillation from teacher to student influences the generation of a better model not only for accuracy but also for reliability. In both (a) and (b), with Mixup, AMD (global+local) shows robust calibration performance. Therefore, we confirm that an augmentation method such as Mixup gets the benefits from AMD in generating better calibrated performance. As can be seen in Figure 15, WRN16-1 trained from scratch with Mixup produces underconfident predictions [49], compared to KD [8] with Mixup. AMD (global+local) with Mixup achieves the best calibration performance. These results support the advantage of AMD, that it can be easily combined with common augmentation methods to improve the performance in classification with good calibration.

CutMix. CutMix [52] one of the most popular augmentation methods, which is more advanced method to Mixup. We evaluate AMD with CutMix. We referred to the previous study to set the parameters for CutMix [52]. As illustrated in Figure 16, all methods are improved by CutMix. Compared to other baselines, AFD gains less improvement. Both AMD (global) and AMD (global+local) perform better with CutMix and these results also show that the proposed method can be easily combined with the advanced augmentation methods.

MoEx. To test with a latent space augmentation method, MoEx [53] is adopted to train with AMD, which is one of the state-of-the-art technique for augmentation. We applied the same parameter by referring to the prior study [53]. We apply MoEx to a layer before stage 3 in the student network (WRN16-1), which achieves the best with KD.

As shown in Figure 17, most of KD based methods with MoEx perform better than the one without MoEx. AFD shows degradation. Since AFD transfers the knowledge considering all pair of features from teacher and student, MoEx in AFD hinders the pair matching and transferring the high quality knowledge. Both AMD (global) and AMD (global+local) outperform baselines. This results verify that latent space augmentation based methods can be combined with the proposed method. Therefore, the proposed method can implement with various augmentation methods to improve the performance.

Additionally, we explore the work of MoEx at different layers. As described in Figure 18, when MoEx is applied the layer before stage 3 of the student model, AMD shows the best performance. KD also shows its best when MoEx is applied to a layer before stage 3. This aspect is different from the result of learning from scratch, which shows the best when MoEx is applied to a layer before stage 1 [53]. Thus, when latent space augmentation is combined with KD based method including baselines and the proposed method, a layer to apply augmentation method has to be chosen considerably. And, these results imply that a layer before stage 3 plays a key role for knowledge distillation.

5.7.3 Combination with other distillation methods

To demonstrate how AMD can perform with the other distillation methods, we adopt SP [27] which is not an attention based distillation method. A teacher is trained with the original training set and learns from scratch. SP [27] is applied while a student is being trained. We compare with baselines, depicted in Figure 19. In all cases, with SP, the accuracy is increased. Compared to the other attention based methods, AMD gets more gains by SP. Therefore, AMD can be enhanced and can perform well with the other distillation methods such as SP. We additionally analyzed the reliability described in Table 11. AMD (global+local) with SP shows the lowest ECE and NLL values. It verifies that AMD with SP can generate a model having higher reliability with better accuracy. Thus, the proposed method can be used with an additional distillation method. Also, the proposed method with SP can perform with different combinations of teacher and student with well-calibrated results. As illustrated in Figure 20, with SP [27], AT [21] and AFD [23] produce more overconfident predictions, compared to AMD (global+local) with SP [27] that gives the best calibration performance. Conclusively, our empirical findings reveal that AMD can perform with other distillation methods such as SP [27] to generate more informative features for distillation from teacher to student.

6 Conclusion

In this paper, we proposed a new type of distillation loss function, AMD loss, which uses the angular distribution of features. We validated the effectiveness of distillation with this loss, under the setting of multiple teacher-student architecture combinations of KD in image classification. Furthermore, we have confirmed that the proposed method can be combined with previous methods such as fine-grained feature, various augmentation methods, and other types of distillation methods.

In future work, we aim to extend the proposed method to explore the distillation effects with different hypersphere feature embedding methods [54, 55]. Also, we plan to extend AMD to different approaches in image classification, such as vision transformer [56] and MLP-mixer [57] that are not based on convolutional neural network. In addition, our approach could provide insights for further advancement in other applications such as object detection and semantic segmentation.

Acknowledgements

This material is based upon work supported by the Defense Advanced Research Projects Agency (DARPA) under Agreement No. HR00112290073. Approved for public release; distribution is unlimited.

Bibliography57

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] J. Gou, B. Yu, S. J. Maybank, D. Tao, Knowledge distillation: A survey, International Journal of Computer Vision (IJCV) 129 (6) (2021) 1789–1819.
2[2] J. H. Cho, B. Hariharan, On the efficacy of knowledge distillation, in: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2019, pp. 4794–4802.
3[3] H. Li, K. Ota, M. Dong, Learning Io T in edge: Deep learning for the internet of things with edge computing, IEEE Network 32 (1) (2018) 96–101.
4[4] G. Plastiras, M. Terzi, C. Kyrkou, T. Theocharidcs, Edge intelligence: Challenges and opportunities of near-sensor machine learning applications, in: Proceedings of the IEEE International Conference on Application-specific Systems, Architectures and Processors, 2018, pp. 1–7.
5[5] I. Jang, S. Kim, H. Kim, C.-W. Park, J. H. Park, An experimental study on reinforcement learning on iot devices with distilled knowledge, in: Proceedings of the International Conference on Information and Communication Technology Convergence, 2020, pp. 869–871.
6[6] J. Wu, C. Leng, Y. Wang, Q. Hu, J. Cheng, Quantized convolutional neural networks for mobile devices, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2016, pp. 4820–4828.
7[7] S. Han, H. Mao, W. J. Dally, Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding, in: Proceedings of the International Conference on Learning and Representations (ICLR), 2016.
8[8] G. Hinton, O. Vinyals, J. Dean, Distilling the knowledge in a neural network, in: Neur IPS Deep Learning and Representation Learning Workshop, Vol. 2, 2015.

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Videos

Leveraging Angular Distributions for Improved Knowledge Distillation

Abstract

keywords:

1 Introduction

2 Related Work

3 Background

3.1 Traditional knowledge distillation

3.2 Attention map

3.3 Spherical feature with angular margin

4 Proposed Method

4.1 Generating attention maps

4.2 Angular margin computation

4.3 Angular margin based distillation loss

5 Experiments

5.1 Datasets

5.2 Settings for experiments

5.3 Attention-based distillation

5.4 Effect of teacher capacity

5.5 Ablations and sensitivity analysis

5.5.1 Effect of angular distillation hyperparameter γ\gammaγ

5.5.2 Effect of angular margin mmm

5.6 Analysis with activation maps

5.6.1 Activation maps for the different levels of layers

5.6.2 Activation maps for global and local distillation of AMD

5.7 Combinations with existing methods

5.7.1 Fine-grained feature-based distillation

5.7.2 Applying augmentation methods

5.7.3 Combination with other distillation methods

6 Conclusion

Acknowledgements

5.5.1 Effect of angular distillation hyperparameter $\gamma$

5.5.2 Effect of angular margin $m$