ETC: Encoding Long and Structured Inputs in Transformers
Joshua Ainslie, Santiago Ontanon, Chris Alberti, Vaclav Cvicek,, Zachary Fisher, Philip Pham, Anirudh Ravula, Sumit Sanghai, Qifan Wang, Li, Yang

TL;DR
ETC is a new Transformer architecture that effectively handles longer and structured inputs by introducing global-local attention, relative position encodings, and CPC pre-training, achieving state-of-the-art results in NLP tasks.
Contribution
The paper proposes the Extended Transformer Construction (ETC), a novel architecture that scales attention for longer inputs and encodes structured data using global-local attention and CPC pre-training.
Findings
Achieves state-of-the-art results on four NLP datasets.
Effectively encodes long and structured inputs.
Demonstrates improved scalability of Transformer models.
Abstract
Transformer models have advanced the state of the art in many Natural Language Processing (NLP) tasks. In this paper, we present a new Transformer architecture, Extended Transformer Construction (ETC), that addresses two key challenges of standard Transformer architectures, namely scaling input length and encoding structured inputs. To scale attention to longer inputs, we introduce a novel global-local attention mechanism between global tokens and regular input tokens. We also show that combining global-local attention with relative position encodings and a Contrastive Predictive Coding (CPC) pre-training objective allows ETC to encode structured inputs. We achieve state-of-the-art results on four natural language datasets requiring long and/or structured inputs.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Multimodal Machine Learning Applications
MethodsLinear Layer · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Relative Position Encodings · Global-Local Attention · Extended Transformer Construction · InfoNCE · Contrastive Predictive Coding · Residual Connection · Byte Pair Encoding
