Hierarchical Transformers Are More Efficient Language Models

Piotr Nawrot; Szymon Tworkowski; Micha{\l} Tyrolski; {\L}ukasz Kaiser,; Yuhuai Wu; Christian Szegedy; Henryk Michalewski

arXiv:2110.13711·cs.LG·April 19, 2022

Hierarchical Transformers Are More Efficient Language Models

Piotr Nawrot, Szymon Tworkowski, Micha{\l} Tyrolski, {\L}ukasz Kaiser,, Yuhuai Wu, Christian Szegedy, Henryk Michalewski

PDF

Open Access 3 Repos

TL;DR

This paper introduces Hourglass, a hierarchical Transformer architecture that enhances efficiency in handling long sequences, achieving state-of-the-art results on ImageNet32 and enwik8 benchmarks.

Contribution

The paper proposes a novel hierarchical Transformer model, Hourglass, which improves computational efficiency while maintaining or surpassing baseline performance.

Findings

01

Hourglass outperforms baseline Transformers on ImageNet32.

02

Hourglass improves language modeling efficiency on enwik8.

03

Hierarchical design enables better long-sequence handling.

Abstract

Transformer models yield impressive results on many NLP and sequence modeling tasks. Remarkably, Transformers can handle long sequences which allows them to produce long coherent outputs: full paragraphs produced by GPT-3 or well-structured images produced by DALL-E. These large language models are impressive but also very inefficient and costly, which limits their applications and accessibility. We postulate that having an explicit hierarchical architecture is the key to Transformers that efficiently handle long sequences. To verify this claim, we first study different ways to downsample and upsample activations in Transformers so as to make them hierarchical. We use the best performing upsampling and downsampling layers to create Hourglass - a hierarchical Transformer language model. Hourglass improves upon the Transformer baseline given the same amount of computation and can yield…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Multimodal Machine Learning Applications · Natural Language Processing Techniques

MethodsAttention Is All You Need · Linear Layer · Rotary Position Embedding · Locality Sensitive Hashing Attention · Relative Position Encodings · Cosine Annealing · Position-Wise Feed-Forward Layer · Adam · Multi-Head Attention · Dropout