TL;DR
Reordering transformer sublayers, especially placing self-attention at the bottom and feedforward at the top, can improve language modeling performance without additional costs, though effects vary by task.
Contribution
Introduces the sandwich transformer pattern with a new sublayer ordering that enhances language modeling performance and explores the impact of sublayer arrangements.
Findings
Sandwich transformer improves perplexity on language modeling benchmarks.
Randomly ordered transformers can outperform the baseline.
Performance gains are task-dependent, with mixed results in machine translation.
Abstract
Multilayer transformer networks consist of interleaved self-attention and feedforward sublayers. Could ordering the sublayers in a different pattern lead to better performance? We generate randomly ordered transformers and train them with the language modeling objective. We observe that some of these models are able to achieve better performance than the interleaved baseline, and that those successful variants tend to have more self-attention at the bottom and more feedforward sublayers at the top. We propose a new transformer pattern that adheres to this property, the sandwich transformer, and show that it improves perplexity on multiple word-level and character-level language modeling benchmarks, at no cost in parameters, memory, or training time. However, the sandwich reordering pattern does not guarantee performance gains across every task, as we demonstrate on machine translation…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsLinear Layer · L1 Regularization · Embedding Dropout · Attention Dropout · Adaptive Masking · Adaptive Span Transformer · Sandwich Transformer · Residual Connection · Dense Connections · *Communicated@Fast*How Do I Communicate to Expedia?
