ETC: Encoding Long and Structured Inputs in Transformers

Joshua Ainslie; Santiago Ontanon; Chris Alberti; Vaclav Cvicek,; Zachary Fisher; Philip Pham; Anirudh Ravula; Sumit Sanghai; Qifan Wang; Li; Yang

arXiv:2004.08483·cs.LG·October 28, 2020·28 cites

ETC: Encoding Long and Structured Inputs in Transformers

Joshua Ainslie, Santiago Ontanon, Chris Alberti, Vaclav Cvicek,, Zachary Fisher, Philip Pham, Anirudh Ravula, Sumit Sanghai, Qifan Wang, Li, Yang

PDF

Open Access 2 Repos

TL;DR

ETC is a new Transformer architecture that effectively handles longer and structured inputs by introducing global-local attention, relative position encodings, and CPC pre-training, achieving state-of-the-art results in NLP tasks.

Contribution

The paper proposes the Extended Transformer Construction (ETC), a novel architecture that scales attention for longer inputs and encodes structured data using global-local attention and CPC pre-training.

Findings

01

Achieves state-of-the-art results on four NLP datasets.

02

Effectively encodes long and structured inputs.

03

Demonstrates improved scalability of Transformer models.

Abstract

Transformer models have advanced the state of the art in many Natural Language Processing (NLP) tasks. In this paper, we present a new Transformer architecture, Extended Transformer Construction (ETC), that addresses two key challenges of standard Transformer architectures, namely scaling input length and encoding structured inputs. To scale attention to longer inputs, we introduce a novel global-local attention mechanism between global tokens and regular input tokens. We also show that combining global-local attention with relative position encodings and a Contrastive Predictive Coding (CPC) pre-training objective allows ETC to encode structured inputs. We achieve state-of-the-art results on four natural language datasets requiring long and/or structured inputs.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Multimodal Machine Learning Applications

MethodsLinear Layer · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Relative Position Encodings · Global-Local Attention · Extended Transformer Construction · InfoNCE · Contrastive Predictive Coding · Residual Connection · Byte Pair Encoding