Go Wide, Then Narrow: Efficient Training of Deep Thin Networks

Denny Zhou; Mao Ye; Chen Chen; Tianjian Meng; Mingxing Tan; Xiaodan; Song; Quoc Le; Qiang Liu; and Dale Schuurmans

arXiv:2007.00811·cs.LG·August 18, 2020·5 cites

Go Wide, Then Narrow: Efficient Training of Deep Thin Networks

Denny Zhou, Mao Ye, Chen Chen, Tianjian Meng, Mingxing Tan, Xiaodan, Song, Quoc Le, Qiang Liu, and Dale Schuurmans

PDF

Open Access 1 Video

TL;DR

This paper introduces an efficient three-stage training method for deep thin networks, leveraging wide networks for initialization and layerwise imitation, backed by theoretical guarantees and empirical validation.

Contribution

It proposes a novel training approach that improves deep thin network performance by using wide networks for initialization and layerwise imitation, with theoretical analysis and large-scale experiments.

Findings

01

ResNet50 with our method outperforms ResNet101.

02

BERT Base becomes comparable to BERT Large.

03

Theoretical guarantee via neural mean field analysis.

Abstract

For deploying a deep learning model into production, it needs to be both accurate and compact to meet the latency and memory constraints. This usually results in a network that is deep (to ensure performance) and yet thin (to improve computational efficiency). In this paper, we propose an efficient method to train a deep thin network with a theoretic guarantee. Our method is motivated by model compression. It consists of three stages. First, we sufficiently widen the deep thin network and train it until convergence. Then, we use this well-trained deep wide network to warm up (or initialize) the original deep thin network. This is achieved by layerwise imitation, that is, forcing the thin network to mimic the intermediate outputs of the wide network from layer to layer. Finally, we further fine tune this already well-initialized deep thin network. The theoretical guarantee is established…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Go Wide, Then Narrow: Efficient Training of Deep Thin Networks· slideslive

Taxonomy

TopicsStochastic Gradient Optimization Techniques · Brain Tumor Detection and Classification · Machine Learning and ELM

MethodsLinear Layer · Multi-Head Attention · Layer Normalization · Attention Is All You Need · Dropout · Residual Connection · Attention Dropout · Weight Decay · Softmax · WordPiece