Depth-Adaptive Transformer

Maha Elbayad; Jiatao Gu; Edouard Grave; Michael Auli

arXiv:1910.10073·cs.CL·February 18, 2020·61 cites

Depth-Adaptive Transformer

Maha Elbayad, Jiatao Gu, Edouard Grave, Michael Auli

PDF

Open Access

TL;DR

This paper introduces a depth-adaptive Transformer that dynamically adjusts its computation based on input complexity, maintaining accuracy while significantly reducing the number of decoder layers needed.

Contribution

It proposes a novel method for depth-adaptive computation in Transformers, enabling variable-depth processing tailored to each input's difficulty.

Findings

01

Achieves comparable translation accuracy with fewer decoder layers.

02

Reduces computational cost by over 75% on IWSLT German-English translation.

03

Demonstrates effective dynamic depth adjustment in sequence-to-sequence tasks.

Abstract

State of the art sequence-to-sequence models for large scale tasks perform a fixed number of computations for each input sequence regardless of whether it is easy or hard to process. In this paper, we train Transformer models which can make output predictions at different stages of the network and we investigate different ways to predict how much computation is required for a particular sequence. Unlike dynamic computation in Universal Transformers, which applies the same set of layers iteratively, we apply different layers at every step to adjust both the amount of computation as well as the model capacity. On IWSLT German-English translation our approach matches the accuracy of a well tuned baseline Transformer while using less than a quarter of the decoder layers.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNeural Networks and Applications · Time Series Analysis and Forecasting · Image and Signal Denoising Methods

MethodsLinear Layer · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Early exiting using confidence measures · Residual Connection · Byte Pair Encoding · Dense Connections · Label Smoothing · *Communicated@Fast*How Do I Communicate to Expedia? · Adam