Reducing Transformer Depth on Demand with Structured Dropout

Angela Fan; Edouard Grave; Armand Joulin

arXiv:1909.11556·cs.LG·September 26, 2019·273 cites

Reducing Transformer Depth on Demand with Structured Dropout

Angela Fan, Edouard Grave, Armand Joulin

PDF

Open Access 5 Repos

TL;DR

This paper introduces LayerDrop, a structured dropout method that enables dynamic depth adjustment in transformer models, improving efficiency and performance across multiple NLP tasks without additional fine-tuning.

Contribution

The paper presents LayerDrop, a novel structured dropout technique that allows for selecting sub-networks of varying depth from a single trained transformer, enhancing efficiency and performance.

Findings

01

Improved state-of-the-art results on multiple NLP benchmarks.

02

Enables efficient inference with variable model depth.

03

Produces higher quality small models compared to training from scratch or distillation.

Abstract

Overparameterized transformer networks have obtained state of the art results in various natural language processing tasks, such as machine translation, language modeling, and question answering. These models contain hundreds of millions of parameters, necessitating a large amount of computation and making them prone to overfitting. In this work, we explore LayerDrop, a form of structured dropout, which has a regularization effect during training and allows for efficient pruning at inference time. In particular, we show that it is possible to select sub-networks of any depth from one large network without having to finetune them and with limited impact on performance. We demonstrate the effectiveness of our approach by improving the state of the art on machine translation, language modeling, summarization, question answering, and language understanding benchmarks. Moreover, we show that…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Multimodal Machine Learning Applications

MethodsPruning · Linear Layer · Absolute Position Encodings · Position-Wise Feed-Forward Layer · LayerDrop · Residual Connection · Byte Pair Encoding · Dense Connections · Label Smoothing · *Communicated@Fast*How Do I Communicate to Expedia?