Hierarchical Transformers Are More Efficient Language Models
Piotr Nawrot, Szymon Tworkowski, Micha{\l} Tyrolski, {\L}ukasz Kaiser,, Yuhuai Wu, Christian Szegedy, Henryk Michalewski

TL;DR
This paper introduces Hourglass, a hierarchical Transformer architecture that enhances efficiency in handling long sequences, achieving state-of-the-art results on ImageNet32 and enwik8 benchmarks.
Contribution
The paper proposes a novel hierarchical Transformer model, Hourglass, which improves computational efficiency while maintaining or surpassing baseline performance.
Findings
Hourglass outperforms baseline Transformers on ImageNet32.
Hourglass improves language modeling efficiency on enwik8.
Hierarchical design enables better long-sequence handling.
Abstract
Transformer models yield impressive results on many NLP and sequence modeling tasks. Remarkably, Transformers can handle long sequences which allows them to produce long coherent outputs: full paragraphs produced by GPT-3 or well-structured images produced by DALL-E. These large language models are impressive but also very inefficient and costly, which limits their applications and accessibility. We postulate that having an explicit hierarchical architecture is the key to Transformers that efficiently handle long sequences. To verify this claim, we first study different ways to downsample and upsample activations in Transformers so as to make them hierarchical. We use the best performing upsampling and downsampling layers to create Hourglass - a hierarchical Transformer language model. Hourglass improves upon the Transformer baseline given the same amount of computation and can yield…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Multimodal Machine Learning Applications · Natural Language Processing Techniques
MethodsAttention Is All You Need · Linear Layer · Rotary Position Embedding · Locality Sensitive Hashing Attention · Relative Position Encodings · Cosine Annealing · Position-Wise Feed-Forward Layer · Adam · Multi-Head Attention · Dropout
