Pruning-Aware Merging for Efficient Multitask Inference

Xiaoxi He; Dawei Gao; Zimu Zhou; Yongxin Tong; Lothar Thiele

arXiv:1905.09676·cs.LG·June 1, 2021

Pruning-Aware Merging for Efficient Multitask Inference

Xiaoxi He, Dawei Gao, Zimu Zhou, Yongxin Tong, Lothar Thiele

PDF

Open Access

TL;DR

This paper introduces Pruning-Aware Merging (PAM), a method to merge and prune neural networks for multitask inference on resource-limited devices, significantly reducing computation costs across task combinations.

Contribution

The paper proposes a novel heuristic merging scheme, PAM, that considers future pruning to optimize multitask network efficiency, outperforming existing merging methods.

Findings

01

PAM achieves up to 4.87x less computation than no-merging baseline.

02

PAM outperforms state-of-the-art merging schemes by up to 2.01x.

03

The method is effective across different datasets and architectures.

Abstract

Many mobile applications demand selective execution of multiple correlated deep learning inference tasks on resource-constrained platforms. Given a set of deep neural networks, each pre-trained for a single task, it is desired that executing arbitrary combinations of tasks yields minimal computation cost. Pruning each network separately yields suboptimal computation cost due to task relatedness. A promising remedy is to merge the networks into a multitask network to eliminate redundancy across tasks before network pruning. However, pruning a multitask network combined by existing network merging schemes cannot minimise the computation cost of every task combination because they do not consider such a future pruning. To this end, we theoretically identify the conditions such that pruning a multitask network minimises the computation of all task combinations. On this basis, we propose…

Tables6

Table 1. Table 1. Test accuracy and computation cost of all tasks combinations with LeNet-5 on Fashion-MNIST pruned by P1/P2.

Pruning	Tasks	Accuracy			FLOPs ( $\times 10^{6}$ )
Pruning	Tasks	B1	B2	PAM	B1	B2	PAM
P1	A	95.42%	95.30%	94.67%	28.34	52.58	28.49
	B	96.30%	96.40%	95.70%	28.34	52.58	26.16
	A&B	95.86%	95.85%	95.19%	56.69	52.58	48.68
P2	A	95.82%	95.73%	95.70%	18.64	31.19	18.65
	B	96.46%	96.72%	96.38%	18.64	31.19	18.65
	A&B	96.14%	96.22%	96.04%	37.27	31.19	26.48

Table 2. Table 2. Test accuracy and computation cost of all tasks combinations with VGG-16 on CelebA pruned by P1/P2.

Pruning	Tasks	Accuracy			FLOPs ( $\times 10^{6}$ )
Pruning	Tasks	B1	B2	PAM	B1	B2	PAM
P1	A	89.45%	89.09%	89.60%	4.52	7.3	4.48
	B	87.81%	87.69%	88.00%	4.32	7.3	4.49
	A&B	88.63%	88.39%	88.80%	8.85	7.3	4.70
P2	A	90.34%	90.27%	90.36%	153.13	243.20	155.82
	B	88.84%	88.74%	88.76%	152.65	243.20	155.84
	A&B	89.59%	89.51%	89.56%	305.78	243.20	156.74

Table 3. Table 3. Test accuracy and computation cost of all tasks combinations with VGG-16 on LFW pruned by P1/P2.

Pruning	Tasks	Accuracy			FLOPs ( $\times 10^{6}$ )
Pruning	Tasks	B1	B2	PAM	B1	B2	PAM
P1	A	89.77%	89.49%	89.87%	7.96	12.66	7.94
	B	82.81%	82.82%	82.14%	7.91	12.66	7.95
	C	83.20%	82.68%	83.30%	7.94	12.66	7.94
	D	85.74%	86.45%	86.03%	7.58	12.66	7.93
	E	87.10%	86.52%	86.90%	7.87	12.66	7.93
	A&B	86.29%	86.16%	86.00%	15.87	12.66	7.98
	A&C	86.48%	86.09%	86.59%	15.90	12.66	7.97
	A&D	87.75%	87.97%	87.95%	15.54	12.66	7.97
	A&E	88.44%	88.01%	88.39%	15.84	12.66	7.96
	B&C	83.00%	82.75%	82.72%	15.85	12.66	7.98
	B&D	84.28%	84.64%	84.09%	15.49	12.66	7.97
	B&E	84.95%	84.67%	84.52%	15.79	12.66	7.97
	C&D	84.47%	84.57%	84.66%	15.52	12.66	7.96
	C&E	85.15%	84.60%	85.10%	15.81	12.66	7.96
	D&E	86.42%	86.49%	86.47%	15.45	12.66	7.96
	A&B&C	85.26%	85.00%	85.10%	23.81	12.66	8.01
	A&B&D	86.11%	86.25%	86.01%	23.45	12.66	8.01
	A&B&E	86.56%	86.28%	86.30%	23.75	12.66	8.00
	A&C&D	86.24%	86.21%	86.40%	23.48	12.66	8.00
	A&C&E	86.69%	86.23%	86.69%	23.78	12.66	7.99
	A&D&E	87.54%	87.49%	87.60%	23.42	12.66	7.99
	B&C&D	83.92%	83.98%	83.82%	23.43	12.66	8.01
	B&C&E	84.37%	84.01%	84.11%	23.73	12.66	8.00
	B&D&E	85.22%	85.26%	85.02%	23.37	12.66	8.00
	C&D&E	85.35%	85.22%	85.41%	23.39	12.66	7.99
	A&B&C&D	85.38%	85.36%	85.34%	31.39	12.66	8.04
	A&B&C&E	85.72%	85.38%	85.55%	31.69	12.66	8.03
	A&B&D&E	86.35%	86.32%	86.23%	31.33	12.66	8.03
	A&C&D&E	86.45%	86.29%	86.53%	31.36	12.66	8.02
	B&C&D&E	84.71%	84.62%	84.59%	31.31	12.66	8.03
	A&B&C&D&E	85.72%	85.59%	85.65%	39.27	12.66	8.06
P2	A	89.57%	89.38%	89.24%	22.91	36.33	23.28
	B	81.96%	83.15%	83.39%	23.16	36.33	23.29
	C	82.96%	81.61%	82.10%	22.93	36.33	23.28
	D	85.04%	85.12%	85.29%	21.16	36.33	23.27
	E	86.43%	85.81%	85.57%	21.29	36.33	23.27
	A&B	85.76%	86.27%	86.31%	46.07	36.33	23.32
	A&C	86.26%	85.50%	85.67%	45.84	36.33	23.31
	A&D	87.31%	87.25%	87.27%	44.07	36.33	23.30
	A&E	88.00%	87.60%	87.41%	44.20	36.33	23.30
	B&C	82.46%	82.38%	82.75%	46.08	36.33	23.31
	B&D	83.50%	84.14%	84.34%	44.31	36.33	23.31
	B&E	84.19%	84.48%	84.48%	44.45	36.33	23.31
	C&D	84.00%	83.37%	83.69%	44.09	36.33	23.30
	C&E	84.69%	83.71%	83.83%	44.22	36.33	23.30
	D&E	85.74%	84.47%	85.43%	42.45	36.33	23.29
	A&B&C	84.83%	84.71%	84.91%	68.99	36.33	23.34
	A&B&D	85.52%	85.88%	85.97%	67.22	36.33	23.34
	A&B&E	85.99%	86.11%	86.07%	67.36	36.33	23.34
	A&C&D	85.86%	85.37%	85.54%	67.00	36.33	23.33
	A&C&E	86.32%	85.60%	85.64%	67.13	36.33	23.32
	A&D&E	87.01%	86.77%	86.70%	65.36	36.33	23.32
	B&C&D	83.32%	83.29%	83.59%	67.24	36.33	23.34
	B&C&E	83.78%	83.52%	83.69%	67.37	36.33	23.33
	B&D&E	84.48%	84.69%	84.75%	65.60	36.33	23.33
	C&D&E	84.81%	84.18%	84.32%	65.38	36.33	23.32
	A&B&C&D	84.88%	84.82%	85.00%	90.15	36.33	23.37
	A&B&C&E	85.23%	84.99%	85.07%	90.28	36.33	23.36
	A&B&D&E	85.75%	85.87%	85.87%	88.51	36.33	23.36
	A&C&D&E	86.00%	85.48%	85.55%	88.29	36.33	23.35
	B&C&D&E	84.10%	83.92%	84.09%	88.53	36.33	23.36
	A&B&C&D&E	85.19%	85.01%	85.12%	111.44	36.33	23.39

Table 4. Table 4. Test accuracy and computation cost with ResNet-18/ResNet-34 on CelebA pruned by P1.

Model	Tasks	Accuracy			FLOPs ( $\times 10^{6}$ )
Model	Tasks	B1	B2	PAM	B1	B2	PAM
ResNet-18	A	89.83%	89.30%	89.93%	5.72	8.84	4.78
	B	88.25%	88.20%	88.36%	5.72	8.84	4.83
	A&B	89.04%	88.75%	89.15%	11.44	8.84	6.40
ResNet-34	A	89.99%	89.70%	90.05%	8.43	12.11	6.94
	B	88.44%	88.98%	88.42%	8.43	12.11	6.94
	A&B	89.22%	89.34%	89.24%	16.86	12.11	10.29

Table 5. Table 5. Decomposition of ℛ A , B ( 𝐋 i A , B ) subscript ℛ 𝐴 𝐵 superscript subscript 𝐋 𝑖 𝐴 𝐵 \mathcal{R}_{A,B}(\mathbf{L}_{i}^{A,B}) .

		$\medmath ℛ_{A, B} (𝐋_{i}^{A, B})$
(27)		$=$	$\medmath \medop \sum_{L_{i, j}^{A, B} \in 𝐋_{i}^{A, B}} H (L_{i, j}^{A, B}) - H ({\tilde{𝐋}}_{i}^{A}, {\tilde{𝐋}}_{i}^{B}) + H ({\tilde{𝐋}}_{i}^{A}, {\tilde{𝐋}}_{i}^{B} \| 𝐘^{A}, 𝐘^{B})$
	$=$	$\medmath \medop \sum_{L_{i, j}^{A} \in {\tilde{𝐋}}_{i}^{A}} H (L_{i, j}^{A}) - I ({\tilde{𝐋}}_{i}^{A}; 𝐘^{A}, 𝐘^{B}) + \medop \sum_{L_{i, j}^{B} \in {\tilde{𝐋}}_{i}^{B}} H (L_{i, j}^{B}) - I ({\tilde{𝐋}}_{i}^{B}; 𝐘^{A}, 𝐘^{B})$
(28)			$\medmath + I ({\tilde{𝐋}}_{i}^{A}; {\tilde{𝐋}}_{i}^{B}; 𝐘^{A}, 𝐘^{B}) - \medop \sum_{L_{i, j} \in 𝐋_{i}^{'} A, B} H (L_{i, j})$
	$=$	$\medmath \medop \sum_{L_{i, j}^{A} \in {\tilde{𝐋}}_{i}^{A}} H (L_{i, j}^{A}) - I ({\tilde{𝐋}}_{i}^{A}; 𝐘^{A}) - I ({\tilde{𝐋}}_{i}^{A}; 𝐘^{B} \| 𝐘^{A}) + \medop \sum_{L_{i, j}^{B} \in {\tilde{𝐋}}_{i}^{B}} H (L_{i, j}^{B}) - I ({\tilde{𝐋}}_{i}^{B}; 𝐘^{B})$
(29)			$\medmath - I ({\tilde{𝐋}}_{i}^{B}; 𝐘^{A} \| 𝐘^{B}) + I ({\tilde{𝐋}}_{i}^{A}; {\tilde{𝐋}}_{i}^{B}; 𝐘^{A}, 𝐘^{B}) - \medop \sum_{L_{i, j} \in 𝐋_{i}^{'} A, B} H (L_{i, j})$
	$=$	$\medmath ℛ_{A} ({\tilde{𝐋}}_{i}^{A}) + ℛ_{B} ({\tilde{𝐋}}_{i}^{B}) - I ({\tilde{𝐋}}_{i}^{A}; 𝐘^{B} \| 𝐘^{A})$
(30)			$\medmath - I ({\tilde{𝐋}}_{i}^{B}; 𝐘^{A} \| 𝐘^{B}) + I ({\tilde{𝐋}}_{i}^{A}; {\tilde{𝐋}}_{i}^{B}; {𝐘^{A}, 𝐘^{B}}) - \medop \sum_{L_{i, j} \in 𝐋_{i}^{'} A, B} H (L_{i, j})$

Table 6. Table 6. Test accuracy and computation cost of pre-trained single-task networks.

Model/Dataset	Task	Accuracy	FLOPs ( $\times 10^{6}$ )
LeNet-5/Fashion-MNIST	A	96.05%	106.42
LeNet-5/Fashion-MNIST	B	96.37%	106.42
VGG-16/CelebA	A	90.28%	3112.20
VGG-16/CelebA	B	89.03%	3112.20
VGG-16/LFW	A	90.23%	3110.12
	B	84.15%	3110.12
	C	85.03%	3110.12
	D	86.62%	3110.12
	E	87.44%	3110.12
ResNet-18/CelebA	A	90.56%	994.00
ResNet-18/CelebA	B	88.91%	994.00
ResNet-34/CelebA	A	90.42%	1115.06
ResNet-34/CelebA	B	88.70%	1115.06

Equations63

minimise \medop i = 1 \sum N_{A} + 1 (R_{A} (L_{i}^{A}) - ξ_{i} \cdot I (L_{i}^{A}; Y^{A}))

minimise \medop i = 1 \sum N_{A} + 1 (R_{A} (L_{i}^{A}) - ξ_{i} \cdot I (L_{i}^{A}; Y^{A}))

\medop i = 1 \sum N_{A} + 1 (R_{A} (L_{i}^{A}) - \tilde{ξ}_{i}^{A} \cdot I (L_{i}^{A}; Y^{A})),

\medop i = 1 \sum N_{A} + 1 (R_{A} (L_{i}^{A}) - \tilde{ξ}_{i}^{A} \cdot I (L_{i}^{A}; Y^{A})),

\medop i = 1 \sum N_{B} + 1 (R_{B} (L_{i}^{B}) - \tilde{ξ}_{i}^{B} \cdot I (L_{i}^{B}; Y^{B})),

\medop i = 1 \sum N_{A} (R_{A, B} (L_{i}^{A, B}) - ξ_{i}^{A} \cdot I (L_{i}^{A}; Y^{A}) - ξ_{i}^{B} \cdot I (L_{i}^{B}; Y^{B}))

I (L_{i}^{' A}; L_{i}^{' B}; Y^{A}; Y^{B}) = 0

I (L_{i}^{' A}; L_{i}^{' B}; Y^{A}; Y^{B}) = 0

I (L_{i}^{' A, B}; Y^{A} ∣ L_{i}^{' A}, Y^{B}) = 0

I (L_{i}^{' A, B}; Y^{B} ∣ L_{i}^{' B}, Y^{A}) = 0

minimise \medop i = 1 \sum N_{A} + 1 R_{A} (L_{i}^{A}) - \tilde{ξ}_{i}^{A} \cdot I (L_{i}^{A}; Y^{A}),

minimise \medop i = 1 \sum N_{A} + 1 R_{A} (L_{i}^{A}) - \tilde{ξ}_{i}^{A} \cdot I (L_{i}^{A}; Y^{A}),

minimise \medop i = 1 \sum N_{B} + 1 R_{B} (L_{i}^{B}) - \tilde{ξ}_{i}^{B} \cdot I (L_{i}^{B}; Y^{B})

A_{i}

A_{i}

B_{i}

M_{i}

I (A_{i}; B_{i}; Y^{τ_{A}}; Y^{τ_{B}}) = 0

I (A_{i}; B_{i}; Y^{τ_{A}}; Y^{τ_{B}}) = 0

I (M_{i}; Y^{τ_{A}} ∣ A_{i}, Y^{τ_{B}}) = 0

I (M_{i}; Y^{τ_{B}} ∣ B_{i}, Y^{τ_{A}}) = 0

For every t \in υ : minimise \medop i = 1 \sum N + 1 R_{t} (L_{i}^{t}) - \tilde{ξ}_{i}^{t} \cdot I (L_{i}^{t}; Y^{t})

For every t \in υ : minimise \medop i = 1 \sum N + 1 R_{t} (L_{i}^{t}) - \tilde{ξ}_{i}^{t} \cdot I (L_{i}^{t}; Y^{t})

L_{i}^{' τ} = t \in τ ⋂ L_{i}^{t} ∖ t \in / τ ⋃ L_{i}^{t}

L_{i}^{' τ} = t \in τ ⋂ L_{i}^{t} ∖ t \in / τ ⋃ L_{i}^{t}

I (L_{i}^{A}; Y^{A}) = I (L_{i}^{' A}; Y^{A}) +

I (L_{i}^{A}; Y^{A}) = I (L_{i}^{' A}; Y^{A}) +

I (L_{i}^{' A, B}; Y^{A} ∣ L_{i}^{' A}, Y^{B}) + I (L_{i}^{' A, B}; Y^{A}; Y^{B} ∣ L_{i}^{' A})

R_{B} (L_{i}^{B}) = \medop L_{i, j}^{B} \in L_{i}^{B} \sum H (L_{i, j}^{B}) - I (L_{i}^{B}; Y^{B})

R_{B} (L_{i}^{B}) = \medop L_{i, j}^{B} \in L_{i}^{B} \sum H (L_{i, j}^{B}) - I (L_{i}^{B}; Y^{B})

= \medop L_{i, j}^{B} \in L_{i}^{B} \sum H (L_{i, j}^{B}) - H (L_{i}^{B}) + H (L_{i}^{B} ∣ Y^{B})

H (L_{i}^{B} ∣ Y^{B})

H (L_{i}^{B} ∣ Y^{B})

=

=

=

=

H (L_{i}^{' A, B} ∣ Y^{A}, Y^{B}) + H (L_{i}^{' B} ∣ L_{i}^{' A, B}, Y^{B})

I (L_{i}^{A}; Y^{A}) = I (L_{i}^{' A}; Y^{A}) + I (L_{i}^{' A, B}; Y^{A}; Y^{B} ∣ L_{i}^{' A})

I (L_{i}^{A}; Y^{A}) = I (L_{i}^{' A}; Y^{A}) + I (L_{i}^{' A, B}; Y^{A}; Y^{B} ∣ L_{i}^{' A})

\displaystyle\mathcal{R}_{A,B}(\mathbf{L}^{A,B})-\big{(}\mathcal{R}_{A}(\widetilde{\mathbf{L}}_{i}^{A})+\mathcal{R}_{B}(\widetilde{\mathbf{L}}_{i}^{B})\big{)}

\displaystyle\mathcal{R}_{A,B}(\mathbf{L}^{A,B})-\big{(}\mathcal{R}_{A}(\widetilde{\mathbf{L}}_{i}^{A})+\mathcal{R}_{B}(\widetilde{\mathbf{L}}_{i}^{B})\big{)}

\leq

\leq

=

\leq

\leq

I (L_{i}^{' A}; L_{i}^{' B})

I (L_{i}^{' A}; L_{i}^{' B})

=

\leq

\leq

I (L_{i}^{' A}; L_{i}^{' B}; Y^{A}; Y^{B}) = 0

I (L_{i}^{' A}; L_{i}^{' B}; Y^{A}; Y^{B}) = 0

I (L_{i}^{' A, B}; Y^{A} ∣ L_{i}^{' A}, Y^{B}) = 0

I (L_{i}^{' A, B}; Y^{B} ∣ L_{i}^{' B}, Y^{A}) = 0

0 \leq I (L_{i}^{' A}; L_{i}^{' B}; Y^{A}; Y^{B}) \leq min {I (L_{i}^{' A}; Y^{B}), I (L_{i}^{' B}; Y^{A})}

0 \leq I (L_{i}^{' A}; L_{i}^{' B}; Y^{A}; Y^{B}) \leq min {I (L_{i}^{' A}; Y^{B}), I (L_{i}^{' B}; Y^{A})}

I (L_{i}^{' A, B}; Y^{A} ∣ L_{i}^{' A}, Y^{B})

I (L_{i}^{' A, B}; Y^{A} ∣ L_{i}^{' A}, Y^{B})

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Neural Network Applications · Domain Adaptation and Few-Shot Learning · Anomaly Detection Techniques and Applications

MethodsPruning

Full text

Pruning-Aware Merging for Efficient Multitask Inference

Xiaoxi He

ETH ZürichZürichSwitzerland

[email protected]

,

Dawei Gao

SKLSDE & BDBC, Beihang UniversityBeijingChina

david˙[email protected]

,

Zimu Zhou

Singapore Management UniversitySingaporeSingapore

[email protected]

,

Yongxin Tong

SKLSDE & BDBC, Beihang UniversityBeijingChina

[email protected]

and

Lothar Thiele

ETH ZürichZürichSwitzerland

[email protected]

(2021; 2021)

Abstract.

Many mobile applications demand selective execution of multiple correlated deep learning inference tasks on resource-constrained platforms. Given a set of deep neural networks, each pre-trained for a single task, it is desired that executing arbitrary combinations of tasks yields minimal computation cost. Pruning each network separately yields suboptimal computation cost due to task relatedness. A promising remedy is to merge the networks into a multitask network to eliminate redundancy across tasks before network pruning. However, pruning a multitask network combined by existing network merging schemes cannot minimise the computation cost of every task combination because they do not consider such a future pruning. To this end, we theoretically identify the conditions such that pruning a multitask network minimises the computation of all task combinations. On this basis, we propose Pruning-Aware Merging (PAM), a heuristic network merging scheme to construct a multitask network that approximates these conditions. The merged network is then ready to be further pruned by existing network pruning methods. Evaluations with different pruning schemes, datasets, and network architectures show that PAM achieves up to $4.87\times$ less computation against the baseline without network merging, and up to $2.01\times$ less computation against the baseline with a state-of-the-art network merging scheme.

Deep Learning; Network Pruning; Multitask Inference

††journalyear: 2021††copyright: acmcopyright††conference: Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery and Data Mining; August 14–18, 2021; Singapore, Singapore††booktitle: Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD ’21), August 14–18, 2021, Singapore, Singapore††price: 15.00††doi: 10.1145/1122445.1122456††isbn: 978-1-4503-XXXX-X/21/08††copyright: acmcopyright††journalyear: 2021††doi: 10.1145/1122445.1122456††ccs: Computing methodologies Neural networks

1. Introduction

Deep neural networks that can run locally on resource-constrained devices hold potential for various emerging applications such as autonomous drones and social robots (Fang et al., 2018; Lee and Nirjon, 2020). These applications often simultaneously perform a set of correlated inference tasks based on the current context to deliver accurate and adaptive services. Although deep neural networks pre-trained for individual tasks are readily available (LeCun et al., 1998; Simonyan and Zisserman, 2014), deploying multiple such networks easily overwhelms the resource budget.

To support these applications on low-resource platforms, we investigate efficient multitask inference. Given a set of correlated inference tasks and deep neural networks (each network pre-trained for an individual task), we aim to minimise the computation cost when any subset of tasks is performed at inference time.

One naive solution to efficient multitask inference is to prune each network for individual tasks separately. A deep neural network is typically over-parameterised (Denil et al., 2013). Network pruning (Dai et al., 2018; Deng et al., 2020; Gao et al., 2020; Molchanov et al., 2019; Sze et al., 2017) can radically reduce the number of operations within a network without accuracy loss in the inference task. This solution, however, is only optimal if a single task is executed at a time. When multiple correlated tasks are running concurrently, this solution is unable to save computation cost by exploiting tasks relatedness and sharing intermediate results among networks.

A more promising solution framework is “merge & prune”, which merges multiple networks into a multitask network, before pruning it (Fig. 1). A few pioneer studies (Chou et al., 2018; He et al., 2018) have explored network merging schemes to eliminate the redundancy among multiple networks pre-trained for correlated tasks. However, pruning a multitask network merged via these schemes can only minimise computation cost when all tasks are executed at the same time.

In this paper, we propose Pruning-Aware Merging (PAM), a new network merging scheme for efficient multitask inference. By applying existing network pruning methods on the multitask network merged by PAM, the computation cost when performing any subset of tasks can be reduced. Extensive experiments show that “PAM & Prune” consistently achieves solid advantages over the state-of-the-art network merging scheme across tasks, datasets, network architectures and pruning methods.

Our main contributions and results are as follows:

•

We theoretically show that pruning a multitask network may not simultaneously minimise the computation cost of all task combinations in the network. We then identify conditions such that minimising the computation of all task combinations via network pruning becomes feasible. To the best of our knowledge, this is the first explicit analysis on the applicability of network pruning in multitask networks.

•

We propose Pruning-Aware Merging (PAM), a heuristic network merging scheme to construct a multitask network that approximately meets the conditions in our analysis and enables “merge & prune” for efficient multitask inference.

•

We evaluate PAM with various pruning schemes, datasets and architectures. PAM achieves up to $4.87\times$ less computation cost against the baseline without network merging, and up to $2.01\times$ less computation cost against the baseline with the state-of-the-art network merging scheme (He et al., 2018).

In the rest of this paper, we review related work in Sec. 2, introduce our problem statement in Sec. 3, theoretical analysis in Sec. 4 and our solution in Sec. 5. We present the evaluations of our methods in Sec. 6 and finally conclude in Sec. 7.

2. Related Work

Our work is related to the following categories of research.

Network Pruning. Network pruning reduces the number of operations in a deep neural network without loss in accuracy (Deng et al., 2020; Sze et al., 2017). Unstructured pruning removes unimportant weights (Dong et al., 2017; Gao et al., 2020; Guo et al., 2016). However, customised hardware (Han et al., 2016) is compulsory to exploit such irregular sparse connections for acceleration. Structured pruning enforces sparsity at the granularity of channels/filters/neurons (Dai et al., 2018; Li et al., 2017; Molchanov et al., 2019; Wen et al., 2016). The resulting sparsity is fit for acceleration on general-purpose processors. Prior pruning proposals implicitly assume a single task in the given network. We identify the challenges to prune a multitask network and propose a network merging scheme such that pruning the merged multitask network minimises computation cost of all task combinations in the network.

Multitask Networks. A multitask network can be either constructed from scratch via Multi-Task Learning (MTL) or merged from multiple networks pre-trained for individual tasks. MTL joint trains multiple tasks for better generalisation (Zhang and Yang, 2017), while we focus on the computation cost of running multiple tasks at inference time. Network merging schemes (Chou et al., 2018; He et al., 2018) aim to construct a compact multitask network from networks pre-trained for individual tasks. Both MTZ (He et al., 2018) and NeuralMerger (Chou et al., 2018) enforce weight sharing among networks to reduce their overall storage. In contrast, we account for the computation cost of a multitask network. Although constructing a multitask network using these schemes (Chou et al., 2018; He et al., 2018) and pruning it via existing pruning methods can reduce the computation when all tasks are concurrently executed, they cannot minimise the computation cost for every combination of tasks.

3. Problem Statement

We define and analyse our problem based on the graph representation of neural networks. The graph representation reflects the computation cost of neural networks (see below) and facilitates an information theoretical understanding on network pruning (see Sec. 4). Fig. 2 shows important notations used throughout this paper. For ease of illustration, we explain our analysis using two tasks. Extensions to more than two tasks are in Sec. 5.4.

3.1. Graph Representation of Neural Networks

Task. Consider three sets of random variable $\mathbf{X}\in\mathcal{X}$ , $\mathbf{Y}^{A}\in\mathcal{Y}^{A}$ , and $\mathbf{Y}^{B}\in\mathcal{Y}^{B}$ . Task $A$ outputs $\widehat{\mathbf{Y}}^{A}$ , a prediction of $\mathbf{Y}^{A}$ , by learning the conditional distribution $\text{Pr}\{\mathbf{Y}^{A}=\mathbf{y}|\mathbf{X}=\mathbf{x}\}$ . Task $B$ outputs $\widehat{\mathbf{Y}}^{B}$ , a prediction of $\mathbf{Y}^{B}$ , by learning $\text{Pr}\{\mathbf{Y}^{B}=\mathbf{y}|\mathbf{X}=\mathbf{x}\}$ .

Single-Task Network. For task $A$ , a neural network without feedback loops can be represented by an acyclic directed graph $G_{A}=\{V^{A},E^{A}\}$ . Each vertex represents a neuron. There is an edge between two vertices if two neurons are connected. The vertex set $V_{A}$ can be categorised into three types of nodes: source, internal and sink node. $\text{deg}^{-}(v)$ / $\text{deg}^{+}(v)$ is the indegree/outdegree of a vertex $v$ .

•

Source node set $\mathbf{v}^{A}_{X}=\{v|v\in V^{A}\wedge\text{deg}^{-}(v)=0\}$ represents the input layer. Each source node represents an input neuron and outputs a random variable $X_{i}\in\mathbf{X}$ . The output of the input layer is the input random variable set $\mathbf{X}$ .

•

Internal nodes $v_{i}\in\{v|v\in V\wedge\ \text{deg}^{-}(v)\neq 0\wedge\text{deg}^{+}(v)\neq 0\}$ represents the hidden neurons. The output of each hidden neuron is generated by calculating the weighted sum of its inputs and then applying an activation function.

•

Sink node set $\mathbf{v}^{A}_{Y}=\{v|v\in V\wedge\text{deg}^{+}(v)=0\}$ represents the output layer. Each sink node represents an output neuron and the output is calculated in the same way as the hidden neurons. The output of the output layer is the prediction $\widehat{\mathbf{Y}}^{A}$ of ground-truth labels $\mathbf{Y}^{A}$ .

We organise the hidden neurons $v_{i}$ of $G^{A}$ into layers $\mathbf{v}_{i}^{A}$ by Algorithm 1. $N^{+}(\mathbf{v})$ represents the out-coming neighbours of the vertex set $\mathbf{v}$ . Algorithm 1 can organise any acyclic single-task network into layers and the layer outputs satisfy the Markov property.

Multitask Network. For task $A$ and $B$ , a multitask network without feedback loops can be represented by an acyclic directed graph $G_{A,B}$ . All paths from the input neurons to the output neurons for task $A$ form a subgraph $\widetilde{G}_{A}$ (see Fig. 2(c)), which is in effect the same as a single-task network. When only task $A$ is performed, only $\widetilde{G}_{A}$ is activated. Subgraph $\widetilde{G}_{B}$ is defined similarly. We also organise vertices of $\widetilde{G}_{A}$ and $\widetilde{G}_{B}$ into layers with Algorithm 1. Layer outputs of $\widetilde{G}_{A}$ and $\widetilde{G}_{B}$ are denoted as $\widetilde{\mathbf{L}}_{i}^{A}$ and $\widetilde{\mathbf{L}}_{i}^{B}$ . Suppose $\widetilde{G}_{A}$ and $\widetilde{G}_{B}$ have respectively $N_{A}$ and $N_{B}$ hidden layers. We assume $N_{A}\leq N_{B}$ w.l.o.g.. Then the $i$ -th layer output of $G_{A,B}$ is defined as $\mathbf{L}_{i}^{A,B}=\widetilde{\mathbf{L}}_{i}^{A}\cup\widetilde{\mathbf{L}}_{i}^{B}$ with $i=0,\cdots,N_{A}$ . As shown in Fig. 2(b), $\mathbf{L}_{i}^{A,B}$ consists of three sets of neurons: $\mathbf{L}^{\prime A}_{i}$ , $\mathbf{L}^{\prime B}_{i}$ and $\mathbf{L}^{\prime A,B}_{i}$ .

Remarks. The above definitions have two benefits. (i) The computation cost of a neural network is an increasing function of the size of the graph, i.e., the number of edges plus vertices. Reducing the computation cost of the network is transformed into removing edges or vertices in the graph. (ii) For a single-task network with $N_{A}$ hidden layers, its layer outputs form a Markov chain: $\mathbf{Y}^{A}\to\mathbf{L}_{0}^{A}\to\cdots\to\mathbf{L}_{N_{A}+1}^{A}$ . All layer outputs $\mathbf{L}_{i}^{A,B}$ in a multitask network also form a Markov chain. The Markov property allows an information theoretical analysis on neural networks (Saxe et al., 2018; Tishby and Zaslavsky, 2015).

3.2. Problem Definition

Given two single-task networks $G_{A}$ and $G_{B}$ pre-trained for task $A$ and $B$ , we aim to construct a multitask network $G_{A,B}$ such that pruning on $G_{A,B}$ can minimise the number of vertices and edges in $G_{A,B}$ , $\widetilde{G}_{A}$ and $\widetilde{G}_{B}$ while preserving inference accuracy on $A$ and $B$ . To ensure minimal computation of any subset of tasks, we need to minimise the number of vertices and edges in any subgraph. For two tasks, $G_{A,B}$ corresponds to running task $A$ and $B$ concurrently; $\widetilde{G}_{A}$ ( $\widetilde{G}_{B}$ ) corresponds to running task $A$ ( $B$ ) only. Next, we show the difficulty to optimise all subgraphs simultaneously.

4. Theoretical Understanding

This section presents a theoretical understanding on the challenges to prune a multitask network and identifies conditions such that minimising the computation cost of all task combinations via pruning becomes feasible (Theorem 3). Proofs are in Appendix A.

4.1. Why Pruning a Single-task Network Work

Pruning a single-task network reduces the computation cost of a neural network while retaining task inference accuracy by suppressing redundancy in the network (Deng et al., 2020; Sze et al., 2017). From the information theoretical perspective (Saxe et al., 2018; Tishby and Zaslavsky, 2015), since the layer outputs form a Markov chain, the inference accuracy for a given task $A$ is positively correlated to the task related information transmitted through the network at each layer, measured by $I(\mathbf{L}^{A}_{i};\mathbf{Y}^{A})$ . All other information is irrelevant for the task. Hence the redundancy within a single-task network can be defined as below.

Definition 0.

For the $i$ -th layer in the single-task neural network $G_{A}$ , the redundancy of the layer is defined as $\mathcal{R}_{A}(\mathbf{L}^{A}_{i})=\sum_{L^{A}_{i,j}\in\mathbf{L}^{A}_{i}}H(L^{A}_{i,j})-I(\mathbf{L}^{A}_{i};\mathbf{Y}^{A})$ .

$\sum_{L^{A}_{i,j}\in\mathbf{L}^{A}_{i}}H(L^{A}_{i,j})$ measures the maximal amount of information the layer can express. $I(\mathbf{L}^{A}_{i};\mathbf{Y}^{A})$ measures the amount of task $A$ related information in the layer output. By definition, $\mathcal{R}_{A}(\mathbf{L}^{A}_{i})\geq 0$ .

Remarks. $\sum_{L^{A}_{i,j}\in\mathbf{L}^{A}_{i}}H(L^{A}_{i,j})$ is positively correlated to the number of vertices and incoming edges of the $i$ -th layer. Therefore, in a well trained network where $I(\mathbf{L}^{A}_{i};\mathbf{Y}^{A})$ can no longer increase, the computation cost can be minimised by reducing $\mathcal{R}_{A}(\mathbf{L}^{A}_{i})$ .

Accordingly, pruning a single-task network can be formalised as an optimisation problem

[TABLE]

where $\xi_{i}>0$ controls the trade-off between inference accuracy and computation cost.

Remarks. Existing pruning methods implicitly assume a single-task network. That is, they are all designed to solve optimisation problem (1), even though the concrete strategies vary. We now show the problems that occur when these pruning methods are applied to a multitask network.

4.2. Why Pruning a Multitask Network Fail

As mentioned in Sec. 3.2, we aim to minimise the computation cost of any subset of tasks, which is a multi-objective optimisation problem. As we will show below, existing network pruning methods are unable to handle these objectives simultaneously.

We first define redundancy when performing two tasks at the same time, similarly as in Definition 1.

Definition 0.

For a multitask network $G_{A,B}$ , the redundancy of its $i$ -th layer is $\mathcal{R}_{A,B}(\mathbf{L}^{A,B}_{i})=\sum_{L^{A,B}_{i,j}\in\mathbf{L}^{A,B}_{i}}H(L^{A,B}_{i,j})-I(\mathbf{L}^{A,B}_{i};\mathbf{Y}^{A},\mathbf{Y}^{B})$ .

Following the above definitions of redundancy, our objective in Sec. 3.2 is equivalent to minimising the redundancy in $G_{A,B}$ as well as in its two subgraphs $\widetilde{G}_{A}$ and $\widetilde{G}_{B}$ , which leads to the following three-objective optimisation (still, we assume $N_{A}\leq N_{B}$ w.l.o.g.):

[TABLE]

Reducing $\mathcal{R}_{A}(\widetilde{\mathbf{L}}_{i}^{A})$ , $\mathcal{R}_{B}(\widetilde{\mathbf{L}}_{i}^{B})$ and $\mathcal{R}_{A,B}(\mathbf{L}_{i}^{A,B})$ decreases the number of vertices and edges in $\widetilde{G}_{A}$ , $\widetilde{G}_{B}$ and $G_{A,B}$ , respectively. $\xi^{A}_{i},\xi^{B}_{i},\tilde{\xi}^{A}_{i},\tilde{\xi}^{B}_{i}>0$ are parameters to control the trade-off between computation cost and inference accuracy, as well as to balance task $A$ and $B$ .

To solve optimisation problem (2) with prior network pruning methods, we observe two problems.

Problem 1: The first two objectives in (2) may conflict. This is because reducing $\mathcal{R}_{B}(\widetilde{\mathbf{L}}_{i}^{B})$ may decrease $I(\widetilde{\mathbf{L}}_{i}^{A};\mathbf{Y}^{A})$ (proofs in Appendix A.1). In other words, when pruning subgraph $\widetilde{G}_{B}$ , it is possible that some information related to task A is removed from the shared vertices between $\widetilde{G}_{A}$ and $\widetilde{G}_{B}$ . Hence $I(\widetilde{\mathbf{L}}_{i}^{A};\mathbf{Y}^{A})$ decreases and the inference accuracy of task $A$ deteriorates.

Problem 2: It is unclear how to minimise the third objective in (2). As mentioned in Sec. 4.1, most pruning methods are designed with a single-task network in mind. It is unknown how to apply them to a multitask network $G_{A,B}$ with architecture in Fig. 2 (a).

4.3. When Pruning a Multitask Network Work

The two problems in Sec. 4.2 show that not all multitask networks can be pruned for efficient multitask inference. However, a multitask network can be effectively pruned if it meets the conditions stated by the following theorem.

Theorem 3.

If $\forall\,1\leq i\leq N_{A}$ , the conditions below are satisfied:

[TABLE]

where $I(\mathbf{L}_{i}^{\prime A};\mathbf{L}_{i}^{\prime B};\mathbf{Y}^{A};\mathbf{Y}^{B})$ is the co-information (Bell, 2003), then the three-objective optimisation problem (2) can be reduced to two non-conflicting optimisation problems that can be solved independently:

[TABLE]

Each of the two optimisation problems (4) are in effect single-task pruning problem like optimisation problem (1), which can be effectively solved by prior pruning proposals.

Remarks. Theorem 3 provides important guidelines to design the network merging scheme for our problem in Sec. 3.2. Specifically, if $G_{A}$ and $G_{B}$ can be merged into a a multitask network $G_{A,B}$ such that conditions (3) are satisfied, we can simply apply existing network pruning on the two subgraphs $\widetilde{G}_{A}$ and $\widetilde{G}_{B}$ to minimise the computation cost when performing any subset of tasks.

5. Pruning-Aware Merging

Based on the above analysis, we propose Pruning-Aware Merging (PAM), a novel network merging scheme that constructs a multitask network from pre-trained single task networks. PAM approximately meets the conditions in Theorem 3 such that the merged multitask network can be effectively pruned for efficient multitask inference.

5.1. PAM Workflow

Given two single-task networks $G_{A}$ and $G_{B}$ pre-trained for task $A$ and $B$ ( $N_{A}\leq N_{B}$ ), PAM constructs a multitask network $G_{A,B}$ with the steps below (see Fig. 3).

(1)

Assign $\mathbf{L}^{A,B}_{0}=\mathbf{X}$ , as $G_{A,B}$ , $G_{A}$ and $G_{B}$ use the same inputs. 2. (2)

For $i=1,\cdots,N_{A}$ , regroup the neurons from $\mathbf{L}_{i}^{A}$ and $\mathbf{L}_{i}^{B}$ into $\mathbf{L}^{\prime A}_{i}$ , $\mathbf{L}^{\prime B}_{i}$ and $\mathbf{L}^{\prime A,B}_{i}$ by the regrouping algorithm in Sec. 5.2. 3. (3)

Take over the output layer for task $A$ : $\widetilde{\mathbf{L}}^{A}_{N_{A}+1}=\mathbf{L}^{A}_{N_{A}+1}$ . For $i=N_{A}+1,\cdots,N_{B}+1$ , take over the remaining layers from $G_{B}$ : $\widetilde{\mathbf{L}}^{B}_{i}=\mathbf{L}^{B}_{i}$ . 4. (4)

Reconnect the neurons as in Fig. 3. If a connection exist before merging, it preserves its original weight. Otherwise it is initialised with a zero. 5. (5)

Finetune $G_{A,B}$ on $A$ and $B$ to learn the newly added connections. For the shared connections, $\mathbf{L}_{i-1}^{\prime A,B}\to\mathbf{L}_{i}^{\prime A,B}$ . The gradients are first calculated separately on $A$ and $B$ , and then averaged before weight updating.

Now the multitask network $G_{A,B}$ is ready to be pruned. From Theorem 3, we can apply network pruning on the two subgraphs $\widetilde{G}_{A}$ and $\widetilde{G}_{B}$ independently and achieve a minimal computation cost for all combinations of tasks. However, since we only approximate the conditions in (3), pruning $\widetilde{G}_{A}$ and $\widetilde{G}_{B}$ is not perfectly independent in practice. Hence we prune $\widetilde{G}_{A}$ and $\widetilde{G}_{B}$ in an alternating manner to balance between task $A$ and $B$ .

5.2. Regrouping Algorithm

The core of PAM is the regrouping algorithm in the second step in Sec. 5.1. It regroups the neurons from $\mathbf{L}_{i}^{A}$ and $\mathbf{L}_{i}^{B}$ into three sets: $\mathbf{L}^{\prime A}_{i}$ , $\mathbf{L}^{\prime B}_{i}$ and $\mathbf{L}^{\prime A,B}_{i}$ , such that the conditions (3) in Theorem 3 are satisfied. However, it is computation-intensive to estimate the co-information and conditional mutual information in (3) precisely. We rely on the following theorem to approximate the conditions.

Theorem 1.

The conditions in (3) can be achieved by minimising $I(\mathbf{L}_{i}^{\prime A};\mathbf{Y}^{B})$ , $I(\mathbf{L}_{i}^{\prime B};\mathbf{Y}^{A})$ , and maximising $I(\mathbf{L}_{i}^{\prime A};\mathbf{Y}^{A})$ , $I(\mathbf{L}_{i}^{\prime B};\mathbf{Y}^{B})$ .

Remarks. $I(\mathbf{L}_{i}^{\prime A};\mathbf{Y}^{B})$ and $I(\mathbf{L}_{i}^{\prime B};\mathbf{Y}^{A})$ describe the “misplaced” information, i.e., the information that is useful for one task, but contained in neurons that are not connected to the outputs of this task. Therefore such information is redundant and needs to be minimised. $I(\mathbf{L}_{i}^{\prime A};\mathbf{Y}^{A})$ and $I(\mathbf{L}_{i}^{\prime B};\mathbf{Y}^{B})$ measure the “relevant” information, i.e., the information useful for one task and contained in neurons connected to this task. Note that this information may not be simply maximised, because it includes the information that is useful for both tasks. It requires simultaneously minimising the “misplaced” information and maximising the “correct” information to achieve the conditions in (3). The proof of Theorem 1 is in Sec. A.3.

Based on Theorem 1, we propose an algorithm to regroup the neurons such that conditions (3) are approximately met. It constructs the largest possible set $\mathbf{L}^{\prime A}_{i}$ and $\mathbf{L}^{\prime B}_{i}$ from all the neurons in $\mathbf{L}_{i}^{A}$ and $\mathbf{L}_{i}^{B}$ while $I(\mathbf{L}_{i}^{\prime A};\mathbf{Y}^{B})$ and $I(\mathbf{L}_{i}^{\prime B};\mathbf{Y}^{A})$ remain close to zero, such that $I(\mathbf{L}_{i}^{\prime A};\mathbf{Y}^{A})$ and $I(\mathbf{L}_{i}^{\prime B};\mathbf{Y}^{B})$ are approximately maximised. To estimate $I(\mathbf{L}_{i}^{\prime A};\mathbf{Y}^{B})$ and $I(\mathbf{L}_{i}^{\prime B};\mathbf{Y}^{A})$ , we use a Kullback–Leibler-based mutual information upper bound estimator from (Kolchinsky and Tracey, 2017).

Algorithm 2 illustrates the pseudocode to regroup the neurons such that the conditions in Theorem 3 are approximated met. Central in Algorithm 2 is a greedy search in Lines 5-8 and 10-13. In Lines 5-8, we search for the largest possible set of neuron $\mathbf{L}^{\prime A}_{i}$ while $I(\mathbf{L}^{\prime A}_{i};\mathbf{Y}^{B})$ remains approximately zero (smaller than a pre-defined threshold $\alpha$ ), such that $I(\mathbf{L}^{\prime A}_{i};\mathbf{Y}^{A})$ is approximately maximised. Similarly, in Lines 10-13, we approximately maximise $I(\mathbf{L}^{\prime B}_{i};\mathbf{Y}^{B})$ while keeping $I(\mathbf{L}^{\prime B}_{i};\mathbf{Y}^{A})$ close to zero. According to Theorem 1, the conditions in Theorem 3 are approximately met.

Practical Issue: How to Estimate Mutual Information. We use a Kullback–Leibler-based mutual information upper bound estimator from (Kolchinsky and Tracey, 2017) to estimate the upper bounds of $I(\mathbf{L}^{\prime A}_{i};\mathbf{Y}^{B})$ and $I(\mathbf{L}^{\prime B}_{i};\mathbf{Y}^{A})$ . Since the upper bounds are approximate, it is impossible to request them to be exactly zero. Hence, we use a threshold parameter $\alpha$ to keep $I(\mathbf{L}^{\prime A}_{i};\mathbf{Y}^{B})$ and $I(\mathbf{L}^{\prime B}_{i};\mathbf{Y}^{A})$ close to zero.

Practical Issue: How to Tune Threshold $\alpha$ . The parameter $\alpha$ affects the performance of “PAM & prune”. A larger $\alpha$ results in more neurons in $\mathbf{L}^{\prime A}_{i}$ and $\mathbf{L}^{\prime B}_{i}$ and fewer shared neurons in $\mathbf{L}^{\prime A,B}_{i}$ . In this case, the multitask network after “PAM & prune” performs worse in terms of efficiency when both tasks are executed concurrently, but better when only one task is executed (similar to “baseline 1 & prune”). Conversely, a smaller $\alpha$ results in more shared neurons. In this case, the multitask network after “PAM & prune” performs worse when only one task is executed, but better when both tasks are executed concurrently, (similar to “baseline 2 & prune”).

The parameter $\alpha$ can be empirically tuned as follows:

(1)

Execute Algorithm 2 with a small $\alpha$ . 2. (2)

Increase the value of $\alpha$ slightly and rerun Algorithm 2. Since Lines 5-8 and 10-13 are greedy search, the results for the smaller $\alpha$ in Step 1 (i.e., the already constructed neuron sets $\mathbf{L}^{\prime A}_{i}$ and $\mathbf{L}^{\prime B}_{i}$ ) can be reused, instead of starting with empty sets as in Line 4 and 9. 3. (3)

Iterate Step 2 till a satisfying balance among task combinations. In each iteration of Step 2, we can reuse the neuron sets $\mathbf{L}^{\prime A}_{i}$ and $\mathbf{L}^{\prime B}_{i}$ from the last iteration.

The impact of $\alpha$ is shown in Appendix C.

5.3. Extensions to ResNets

In order to support merging Residual Networks (He et al., 2016), PAM needs to be slightly modified. As illustrated in Fig. 4, the regrouping of the last layer in each residual block happens not directly after the weighted summation, but after the superposition with the shortcut connection and just before the vector is passed as inputs to the first layer in the next block. This input vector of the first layer in each block is also regrouped using Algorithm 2 and then pruned at a later stage. This special treatment for the last layer in each residual block is consistent with ResNet compatible pruning methods such as (Molchanov et al., 2019), which can also prune the block outputs just before it is fed into the first layer in the next block.

5.4. Extension to Three or More Tasks

When there are $K\geq 3$ tasks, we define the set of all the task as $\upsilon=\{t_{1},\cdots,t_{K}\}$ . The merged multitask network can be divided into subgraphs $\widetilde{G}_{\tau}$ , where $\tau\subseteq\upsilon$ and $\tau\neq\emptyset$ is a nonempty subset of tasks. Each vertex in $\widetilde{G}_{\tau}$ has paths to all the outputs $\widehat{\mathbf{Y}}^{t}$ with $t\in\tau$ . When a task combination (i.e., a subset of tasks) $\tau$ is executed, only subgraph $\widetilde{G}_{\tau}$ is activated. Layers in $\widetilde{G}_{\tau}$ is denoted as $\widetilde{\mathbf{L}}_{i}^{\tau}$ . The output layer for task combination $\tau$ is denoted as $\widehat{\mathbf{Y}}^{\tau}=\bigcup_{t\in\tau}\widehat{\mathbf{Y}}^{t}$ , which is the prediction of ground-truth labels $\mathbf{Y}^{\tau}=\bigcup_{t\in\tau}\mathbf{Y}^{t}$ .

Extension of Theorem 3. For any pair of non-overlapped nonempty subsets of task $\tau_{A}$ and $\tau_{B}$ ( $\tau_{A}\cap\tau_{B}=\emptyset$ ), define:

[TABLE]

Then Theorem 3 is extended into:

Theorem 2.

If for all $i=1,\cdots,N$ with $N=\min_{t\in\upsilon}N_{t}$ , and for any pair of non-overlapped nonempty subsets of task $\tau_{A}$ and $\tau_{B}$ , the following conditions are satisfied:

[TABLE]

then the computation cost of executing all task combinations can be minimised by the following $K$ non-conflicting optimisation problems that can be solved independently:

[TABLE]

Theorem 2 can be proven by recursively applying Theorem 3.

Extension of PAM. The neuron sets $\mathbf{L}_{i}^{\prime A}$ , $\mathbf{L}_{i}^{\prime B}$ and $\mathbf{L}_{i}^{\prime A,B}$ are extended to:

[TABLE]

Note that neurons in $\mathbf{L}_{i}^{\prime\tau}$ are activated iff any task $t\in\tau$ is executed. Now Algorithm 2 is extended to Algorithm 3. And at step 5 of the PAM workflow in Sec. 5.1, we connect $\mathbf{L}_{i-1}^{\prime\tau_{1}}\to\mathbf{L}_{i}^{\prime\tau_{2}}$ iff $\tau_{2}\subseteq\tau_{1}$ .

It is worth mentioning that when tasks are highly related, the numbers of neurons in $\mathbf{L}_{i}^{\tau}$ with $1<|\tau|<K$ can be extremely small (as in our experiment on the LFW dataset in Appendix B). Therefore we can simplify Algorithm 3 by fixing $n=1$ and skip the remaining loops. Every layer in the multitask network merged by the simplified PAM contains only neuron sets $\mathbf{L}_{i}^{t}$ with $t\in\upsilon$ and one shared neuron set $\mathbf{L}_{i}^{\upsilon}$ . Shared neurons in $\mathbf{L}_{i}^{\upsilon}$ are always activated, while non-shared neurons in $\mathbf{L}_{i}^{t}$ are activated iff task $t$ is executed.

6. Experiments

We compare different network merging schemes on whether lower computation is achieved when performing any subset of tasks.

6.1. Experiment Settings

Baselines for Network Merging. We compare PAM with two merging schemes.

•

Baseline 1. It simply skips network merging in the “merge & prune” framework. Therefore, no multitask network is constructed. As mentioned in Sec. 1, this scheme optimises the pruning of single-task networks.

•

Baseline 2. Pre-trained single-task networks are merged as a multitask network by MTZ (He et al., 2018), a state-of-the-art network merging scheme. Applying MTZ in “merge & prune” can minimise the computation cost of a multitask network when all tasks are executed.

Methods for Network Pruning. Since we aim to compare different network merging schemes in the “merge & prune” framework, we apply the same network pruning method on the neural network(s) constructed by different merging schemes. To show that PAM works with different pruning methods, we choose two state-of-the-art structured network pruning methods: one (Dai et al., 2018) uses information theory based metrics (denoted as P1), and the other (Molchanov et al., 2019) uses sensitivity based metrics (denoted as P2).

The pruning methods are applied to the neural network(s) constructed by different merging schemes as follows. For Baseline 1, each single-task network is pruned independently. For the multitask network constructed with Baseline 2 and PAM, we prune every subgraph for each individual task in an alternating manner (e.g., task $A\to B\to C\to A\to B\to\cdots$ ) in order to balance between tasks. However, only P2 is originally designed to prune a ResNet. Hence we only experiment ResNets with P2.

Datasets and Single-Task Networks. We define tasks from three datasets: Fashion-MNIST (Xiao et al., 2017), CelebA (Liu et al., 2015), and LFW (Huang et al., 2012). Fashion-MNIST and CelebA each contains two tasks. LFW contains five tasks. We use LeNet-5 (LeCun et al., 1998) as pre-trained single-task networks for tasks derived from Fashion-MNIST, and VGG-16 (Simonyan and Zisserman, 2014) for tasks from CelebA and LFW. We also use ResNet-18 and ResNet-34 (He et al., 2016) as pre-trained single-task networks for CelebA. See Appendix B for more details of dataset setup and the inference accuracy and FLOPs of the pre-trained single-task networks.

Evaluation Metrics. For a given set of tasks, we aim to minimise the computation cost of all task combinations. To assess computation cost independent of hardware, we use the number of floating point operations (FLOP) as the metric. For fair comparison, the network(s) constructed by different merging schemes are pruned while preserving almost the same inference accuracy. To quantify the performance advantage of PAM over baselines over all task combinations, we adopt the following two single-valued criteria:

•

Average Gain. This metric measures the averaged computation cost reduction of “PAM & prune” over “baseline & prune” across all task combinations. For example, given two tasks $A$ and $B$ , there are three task combinations: $A$ , $B$ and $A\&B$ . When executing these task combinations, the FLOPs of the network after “PAM & prune” are $c_{A}^{P}$ , $c_{B}^{P}$ and $c_{A,B}^{P}$ , respectively. After “baseline 1 & prune”, the FLOPs are $c_{A}^{B1}$ , $c_{B}^{B1}$ and $c_{A,B}^{B1}$ , respectively. The average gain over baseline 1 is calculated as $\frac{1}{3}(c_{A}^{B1}/c_{A}^{P}+c_{B}^{B1}/c_{B}^{P}+c_{A,B}^{B1}/c_{A,B}^{P})$ .

•

Peak Gain. This metric measures the maximal computation cost reduction across all task combinations. Using the same example and notations as above, the peak gain over baseline 1 is calculated as $\max\{c_{A}^{B1}/c_{A}^{P},c_{B}^{B1}/c_{B}^{P},c_{A,B}^{B1}/c_{A,B}^{P}\}$ .

All experiments are implemented with TensorFlow and conducted on a workstation with Nvidia RTX 2080 Ti GPU.

6.2. Main Experiment Results

Overall Performance Gain. Fig. 5 shows the average and peak gains of PAM over the two baselines tested with different models (LeNet-5, VGG-16, ResNet-18, RestNet-34), datasets (Fashion-MNIST, CelebA, LFW), and pruning methods (P1, P2). The detailed FLOPs and inference accuracy on task merging (Fashion-MNIST and CelebA) are listed in Table 1, Table 2, Table 3 and Table 4.

Compared with baseline 1, PAM achieves $1.07\times$ to $1.64\times$ average gain and $1.16\times$ to $4.87\times$ peak gain. Compared with baseline 2, PAM achieves $1.51\times$ to $1.69\times$ average gain and $1.56\times$ to $2.01\times$ peak gain. In general, PAM has significant performance advantage over both baselines across datasets and network architectures.

Effectiveness of PAM. From Fig. 5, the performance gain of PAM varies across baselines and datasets. Such variations in average and peak gains are influenced by how many neurons are shared and how many networks are merged. Fig. 6 shows how many neurons (kernels) are shared after “PAM & prune” on LeNet-5 and VGG-16.

•

The more neurons shared, the higher gain PAM has over baseline 1. “Baseline 1 & prune” can effectively reduce the computation cost when only one task is performed. However, when many neurons can be shared (see Fig. 6(b), (c), (e), and (f)), baseline 1 is sub-optimal when multiple tasks are executed simultaneously, as it is unable to reduce computation by sharing neurons. This is why PAM outperforms baseline 1 more on CelebA and LFW.

•

The fewer neurons shared, the higher gain PAM has over baseline 2. “Baseline 2 & prune” can effectively reduce the computation cost via neuron sharing when all tasks are performed simultaneously. However, when only few neurons can be shared (see Fig. 6(a) and (d)), the multitask network merged by baseline 2 cannot shut down the unnecessary neurons when not all tasks are executed, and hence yields sub-optimal computation cost. This is why PAM outperforms baseline 2 more on Fashion-MNIST.

•

The more networks merged, the higher gain PAM has over both baselines. As the number of single-task networks (tasks) increases, “PAM & prune” can either share more neurons and yield lower computation than “baseline 1 & prune”, or shut down more unnecessary neurons and yield lower computation than “baseline 2 & prune”. Therefore the performance gain of PAM over baseline 1 on LFW is such significantly higher than on CelebA. This is also the reason why the performance gain of PAM over baseline 2 on LFW is not much lower than on CelebA, although on LFW we have the highest degree of sharing.

Takeaways. Although the performance of PAM varies across tasks, it achieves consistently solid advantages over both baselines. We may conclude that it is always preferable to use PAM for efficient multitask inference, regardless of the amount of shareable neurons, of the probability of executing each task combination, of the network architecture, or of the pruning method used after merging.

6.3. Ablation Study

This subsection presents experiments to further understand the effectiveness of PAM.

6.3.1. Impact of Task Relatedness

This study aims to show the impact of task relatedness on the performance gain PAM can achieve. The number of neurons that can be shared among pre-trained networks is related to the relatedness among tasks. An effective network merging scheme should enforce increasing numbers of shared neurons between tasks with the increase of task relatedness.

Settings. We consider the 73 labels in LFW as 73 binary classification tasks, and measure the relatedness between each task pair by $I(\mathbf{Y}^{A};\mathbf{Y}^{B})$ . We then pick four pairs of tasks with $I(\mathbf{Y}^{A};\mathbf{Y}^{B})\approx 0$ , $0.1$ , $0.2$ and $0.5$ bits, train four pairs of single-task VGG-16’s on them, and construct four multitask networks using PAM.

Results. Fig. 7a plots the number of shared neurons in layer f7 of these four multitask networks with different tuning threshold $\alpha$ . The multitask networks for tasks pairs with higher correlation always share neurons. Hence, PAM can share an increasing number of neurons between tasks with the increase of task relatedness.

6.3.2. Case Study: Task Inclusion

This study aims to validate the effectiveness of PAM in an extreme yet common case of task relatedness where task $B$ is a sub-task of task $A$ . Ideally, when the mutual information is precisely estimated and true largest sets of task-exclusive neurons are selected, PAM should effectively pick out only task- $A$ -exclusive neurons.

Settings. We pick 30 labels in LFW as task $A$ and 15 of them as task $B$ . Hence task $A$ includes task $B$ . We train two single-task VGG-16’s on these two tasks separately and then merge them by PAM.

Results. Fig. 7b shows the number of non-shared neurons in $\mathbf{L}^{\prime A}_{i}$ and $\mathbf{L}^{\prime B}_{i}$ in the last eight layers of the merged network (the previous layers have exclusively shared neurons). Almost no neurons are selected for $\mathbf{L}^{\prime B}_{i}$ by Algorithm 2, validating its effectiveness.

7. Conclusion

In this paper, we investigate network merging schemes for efficient multitask inference. Given a set of single-task networks pre-trained for individual tasks, we aim to construct a multitask network such that applying existing network pruning methods on it can minimise the computation cost when performing any subset of tasks. We theoretically identify the conditions on the multitask network, and design Pruning-Aware Merging (PAM), a heuristic network merging scheme to construct such a multitask network. The merged multitask network can then be effectively pruned by existing network pruning methods. Extensive evaluations show that pruning a multitask network constructed by PAM achieves low computation cost when performing any subset of tasks in the network.

Appendix

Appendix A Proofs

A.1. Proof of Problem 1 in Sec. 4.2

Problem 1 occurs because of the lemma below.

Lemma 0.

Reducing $\mathcal{R}_{B}(\widetilde{\mathbf{L}}_{i}^{B})$ may decrease $I(\widetilde{\mathbf{L}}_{i}^{A};\mathbf{Y}^{A})$ .

Proof.

We decompose $I(\widetilde{\mathbf{L}}_{i}^{A};\mathbf{Y}^{A})$ :

[TABLE]

where $I(A;B;C)=I(A;B)-I(A;B|C)$ is the co-information (Bell, 2003). From Definition 1, we have:

[TABLE]

For the last term, we have:

[TABLE]

Hence, $H(\widetilde{\mathbf{L}}^{B}_{i}|\mathbf{Y}^{B})$ includes $I(\mathbf{L}_{i}^{\prime A,B};\mathbf{Y}^{A}|\mathbf{L}^{\prime A}_{i},\mathbf{Y}^{B})$ . Reducing $\mathcal{R}_{B}(\widetilde{\mathbf{L}}_{i}^{B})$ may decrease $I(\widetilde{\mathbf{L}}_{i}^{A};\mathbf{Y}^{A})$ . ∎

A.2. Proof of Theorem 3

Proof.

The proof shows the conditions in Theorem 3 solve (i) Problem 1 in Sec. 4.2 and (ii) Problem 2 in Sec. 4.2.

Solving Problem 1 in Sec. 4.2. From (11) we have the following if $I(\mathbf{L}_{i}^{\prime A,B};\mathbf{Y}^{A}|\mathbf{L}_{i}^{\prime A},\mathbf{Y}^{B})=0$ :

[TABLE]

$\mathbf{L}_{i}^{\prime A}$ is not in $\widetilde{\mathbf{L}}^{B}_{i}$ . Hence $I(\mathbf{L}_{i}^{\prime A};\mathbf{Y}^{A})$ is unaffected when $\mathcal{R}_{B}(\widetilde{\mathbf{L}}^{B}_{i})$ is reduced. $I(\mathbf{L}_{i}^{\prime A,B};\mathbf{Y}^{A};\mathbf{Y}^{B}|\mathbf{L}^{\prime A}_{i})$ is included in $I(\widetilde{\mathbf{L}}^{B}_{i};\mathbf{Y}^{B})$ . Thus minimising $\mathcal{R}_{B}(\widetilde{\mathbf{L}}_{i}^{B})-\tilde{\xi}^{B}_{i}\cdot I(\widetilde{\mathbf{L}}_{i}^{B};\mathbf{Y}^{B})$ will not reduce $I(\mathbf{L}_{i}^{\prime A,B};\mathbf{Y}^{A};\mathbf{Y}^{B}|\mathbf{L}^{\prime A}_{i})$ with a proper $\tilde{\xi}^{B}_{i}$ . All still hold if we swap $A$ and $B$ in the above equations. Consequently, if $I(\mathbf{L}_{i}^{\prime A,B};\mathbf{Y}^{A}|\mathbf{L}_{i}^{\prime A},\mathbf{Y}^{B})=I(\mathbf{L}_{i}^{\prime A,B};\mathbf{Y}^{B}|\mathbf{L}_{i}^{\prime B},\mathbf{Y}^{A})$ = [math], the first two objectives in optimisation problem (2) become non-conflicting.

Solving Problem 2 in Sec. 4.2. We first decompose $\mathcal{R}_{A,B}(\mathbf{L}_{i}^{A,B})$ as in Table 5. Then from (30), we have

[TABLE]

Further,

[TABLE]

This is a loose upper bound. However, since $\mathcal{R}_{A,B}(\mathbf{L}^{A,B})$ , $\mathcal{R}_{A}(\widetilde{\mathbf{L}}_{i}^{A})$ and $\mathcal{R}_{B}(\widetilde{\mathbf{L}}_{i}^{B})$ are lower bounded by [math], it suffices to show that when $I(\mathbf{L}_{i}^{\prime A};\mathbf{L}_{i}^{\prime B};\mathbf{Y}^{A};\mathbf{Y}^{B})=0$ , minimising $\mathcal{R}_{A}(\widetilde{\mathbf{L}}_{i}^{A})$ and $\mathcal{R}_{B}(\widetilde{\mathbf{L}}_{i}^{B})$ will minimise $\mathcal{R}_{A,B}(\mathbf{L}^{A,B})$ .

In summary, when

[TABLE]

the optimisation problem (2) is reduced to two non-conflicting optimisation problems (4). ∎

A.3. Proof of Theorem 1

Proof.

First, for co-information between four random variables, we have from (Bell, 2003):

[TABLE]

Therefore, the first condition in Theorem 3, i.e., $I(\mathbf{L}_{i}^{\prime A};\mathbf{L}_{i}^{\prime B};\mathbf{Y}^{A};\mathbf{Y}^{B})$ = [math], is achieved by minimising $I(\mathbf{L}_{i}^{\prime A};\mathbf{Y}^{B})$ and $I(\mathbf{L}_{i}^{\prime B};\mathbf{Y}^{A})$ to [math].

For the second condition in Theorem 3, i.e., $I(\mathbf{L}_{i}^{\prime A,B};\mathbf{Y}^{A}|\mathbf{L}_{i}^{\prime A},\mathbf{Y}^{B})=0$ , we have:

[TABLE]

Given $A$ and $B$ , $H(\mathbf{Y}^{A}|\mathbf{Y}^{B})$ is constant. The second condition in Theorem 3 is achieved by minimising $I(\mathbf{L}_{i}^{\prime A};\mathbf{Y}^{B})$ to [math] and maximising $I(\mathbf{Y}^{A};\mathbf{L}_{i}^{\prime A})$ to $H(\mathbf{Y}^{A}|\mathbf{Y}^{B})$ .

The same holds if we swap $A$ and $B$ . The third condition in Theorem 3, i.e., $I(\mathbf{L}_{i}^{\prime A,B};\mathbf{Y}^{B}|\mathbf{L}_{i}^{\prime B},\mathbf{Y}^{A})=0$ , is achieved by minimising $I(\mathbf{L}_{i}^{\prime B};\mathbf{Y}^{A})$ and maximising $I(\mathbf{Y}^{B};\mathbf{L}_{i}^{\prime B})$ . ∎

Appendix B Detailed Dataset Setup

Fashion-MNIST. The Fashion-MNIST dataset111https://github.com/f-rumblefish/Multi-Label-Fashion-MNIST contains $8000$ training images and $2000$ test images with a resolution of $496\times 124$ . Each image has four fashion product images randomly selected from Fashion-MNIST (Xiao et al., 2017). The 10 categories of fashion products is considered as 10 binary classification problem, and we divide them into two groups (5/5) to form task $A$ and $B$ . On each task we train a LeNet-5, a commonly used architecture for Fashion-MNIST.

CelebA.

The CelebA dataset222http://mmlab.ie.cuhk.edu.hk/projects/CelebA.html contains over $200$ thousand celebrity face images labelled with $40$ attributes. The $40$ attributes is divided into two groups ( $20$ / $20$ ) to form task $A$ and $B$ . The dataset is divided into training and test sets containing 80% and 20% of the samples. The input picture resolution is resized to $72\times 72$ . On each task we train slightly modified VGG-16 models, a commonly used single-task network architecture on CelebA. The width of the fully connected layers in VGG-16 is changed to 512. The convolutional layers are initialised with weights pre-trained for imdb-wiki (Rothe et al., 2018), and use the same pre-processing steps.

LFW. The Labeled Faces in the Wild (LFW) dataset333http://vis-www.cs.umass.edu/lfw/ contains over 13,000 face photographs collected from the web. Each face photo is associated with 73 attributes (Kumar et al., 2009). We randomly split the 73 labels in the LFW dataset into four groups with 15 labels each and one group with 13 labels. Each group of labels forms a single task. The dataset is divided into training and test sets containing 80% and 20% of the samples. Same as in CelebA, the input picture resolution is resized to $72\times 72$ . On each task we train slightly modified VGG-16 models, a commonly used single-task network architecture on LFW. The width of the fully connected layers in VGG-16 is changed to 128. The convolutional layers are initialised with weights pre-trained for imdb-wiki (Rothe et al., 2018), and use the same pre-processing steps.

Table 6 summarises the inference accuracy and FLOPs of the pre-trained single-task networks.

Appendix C Visualisation of Algorithm 2

Fig. 8 illustrates two iterations of Line 19-22 and 24-27 in Algorithm 2 by showing $I(\mathbf{L}^{\prime A}_{i};\mathbf{Y}^{B})$ and $I(\mathbf{L}^{\prime B}_{i};\mathbf{Y}^{A})$ against the number of iterations. Here we use the f7 layer of VGG-16 trained and merged for CelebA dataset as an example. The tuning parameter $\alpha$ is set to infinitely large in order to show all the possible cases of the iterations. From Fig. 8, we can observe three phases:

(1)

In the first phase, $I(\mathbf{L}^{\prime A}_{i};\mathbf{Y}^{B})$ and $I(\mathbf{L}^{\prime B}_{i};\mathbf{Y}^{A})$ remains small, indicating that the selected $\mathbf{L}^{\prime A}_{i}$ and $\mathbf{L}^{\prime B}_{i}$ provides little information about the other task. 2. (2)

In the second phase, $I(\mathbf{L}^{\prime A}_{i};\mathbf{Y}^{B})$ and $I(\mathbf{L}^{\prime B}_{i};\mathbf{Y}^{A})$ start to increase as it is impossible to add more neurons to $\mathbf{L}^{\prime A}_{i}$ and $\mathbf{L}^{\prime B}_{i}$ while keeping $I(\mathbf{L}^{\prime A}_{i};\mathbf{Y}^{B})$ and $I(\mathbf{L}^{\prime B}_{i};\mathbf{Y}^{A})$ close to zero. 3. (3)

In the third phase, $I(\mathbf{L}^{\prime A}_{i};\mathbf{Y}^{B})$ and $I(\mathbf{L}^{\prime B}_{i};\mathbf{Y}^{A})$ start to saturate as the newly joined neurons contain mostly information already included in existing $\mathbf{L}^{\prime A}_{i}$ and $\mathbf{L}^{\prime B}_{i}$ .

In practice, the parameter $\alpha$ tuned as remains small, and the iterations in Algorithm 2 as well as Algorithm 3 usually stop at the end of the first phase or the beginning of the second phase.

Bibliography29

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1(1)
2Bell (2003) Anthony J Bell. 2003. The co-information lattice. In International Workshop on Independent Component Analysis and Blind Signal Separation: ICA . IEEE Press, Piscataway, NJ, USA.
3Chou et al . (2018) Yi-Min Chou, Yi-Ming Chan, Jia-Hong Lee, Chih-Yi Chiu, and Chu-Song Chen. 2018. Unifying and merging well-trained deep neural networks for inference stage. In IJCAI . Morgan Kaufmann, Burlington, MA, USA, 2049–2056.
4Dai et al . (2018) Bin Dai, Chen Zhu, Baining Guo, and David Wipf. 2018. Compressing neural networks using the variational information bottleneck. In ICML . ACM, New York, NY, USA, 1143–1152.
5Deng et al . (2020) Lei Deng, Guoqi Li, Song Han, Luping Shi, and Yuan Xie. 2020. Model compression and hardware acceleration for neural networks: a comprehensive survey. Proc. IEEE 108, 4 (2020), 485–532.
6Denil et al . (2013) Misha Denil, Babak Shakibi, Laurent Dinh, Nando De Freitas, et al . 2013. Predicting parameters in deep learning. In Neur IPS . Curran Associates Inc., Red Hook, NY, USA, 2148–2156.
7Dong et al . (2017) Xin Dong, Shangyu Chen, and Sinno Pan. 2017. Learning to prune deep neural networks via layer-wise optimal brain surgeon. In Neur IPS . Curran Associates Inc., Red Hook, NY, USA, 4860–4874.
8Fang et al . (2018) Biyi Fang, Xiao Zeng, and Mi Zhang. 2018. Nest DNN: resource-aware multi-tenant on-device deep learning for continuous mobile vision. In Mobi Com . ACM, New York, NY, USA, 115–127.