Depth-Adaptive Transformer
Maha Elbayad, Jiatao Gu, Edouard Grave, Michael Auli

TL;DR
This paper introduces a depth-adaptive Transformer that dynamically adjusts its computation based on input complexity, maintaining accuracy while significantly reducing the number of decoder layers needed.
Contribution
It proposes a novel method for depth-adaptive computation in Transformers, enabling variable-depth processing tailored to each input's difficulty.
Findings
Achieves comparable translation accuracy with fewer decoder layers.
Reduces computational cost by over 75% on IWSLT German-English translation.
Demonstrates effective dynamic depth adjustment in sequence-to-sequence tasks.
Abstract
State of the art sequence-to-sequence models for large scale tasks perform a fixed number of computations for each input sequence regardless of whether it is easy or hard to process. In this paper, we train Transformer models which can make output predictions at different stages of the network and we investigate different ways to predict how much computation is required for a particular sequence. Unlike dynamic computation in Universal Transformers, which applies the same set of layers iteratively, we apply different layers at every step to adjust both the amount of computation as well as the model capacity. On IWSLT German-English translation our approach matches the accuracy of a well tuned baseline Transformer while using less than a quarter of the decoder layers.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNeural Networks and Applications · Time Series Analysis and Forecasting · Image and Signal Denoising Methods
MethodsLinear Layer · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Early exiting using confidence measures · Residual Connection · Byte Pair Encoding · Dense Connections · Label Smoothing · *Communicated@Fast*How Do I Communicate to Expedia? · Adam
