LongT5: Efficient Text-To-Text Transformer for Long Sequences

Mandy Guo; Joshua Ainslie; David Uthus; Santiago Ontanon; Jianmo Ni,; Yun-Hsuan Sung; Yinfei Yang

arXiv:2112.07916·cs.CL·May 4, 2022·6 cites

LongT5: Efficient Text-To-Text Transformer for Long Sequences

Mandy Guo, Joshua Ainslie, David Uthus, Santiago Ontanon, Jianmo Ni,, Yun-Hsuan Sung, Yinfei Yang

PDF

Open Access 4 Repos 10 Models

TL;DR

LongT5 is a scalable Transformer model that combines long-input attention mechanisms and summarization pre-training to achieve state-of-the-art results on summarization and question answering tasks.

Contribution

The paper introduces LongT5, a new scalable Transformer architecture integrating TGlobal attention and pre-training strategies for improved long-sequence processing.

Findings

01

Achieves state-of-the-art results on summarization tasks.

02

Outperforms original T5 models on question answering.

03

Introduces TGlobal attention mechanism without extra side-inputs.

Abstract

Recent work has shown that either (1) increasing the input length or (2) increasing model size can improve the performance of Transformer-based neural models. In this paper, we present a new model, called LongT5, with which we explore the effects of scaling both the input length and model size at the same time. Specifically, we integrated attention ideas from long-input transformers (ETC), and adopted pre-training strategies from summarization pre-training (PEGASUS) into the scalable T5 architecture. The result is a new attention mechanism we call {\em Transient Global} (TGlobal), which mimics ETC's local/global attention mechanism, but without requiring additional side-inputs. We are able to achieve state-of-the-art results on several summarization tasks and outperform the original T5 models on question answering tasks.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Speech Recognition and Synthesis

MethodsGated Linear Unit · Multi-Head Attention · Attention Is All You Need · Linear Layer · Byte Pair Encoding · Attention Dropout · Adafactor · Dense Connections · Residual Connection · Layer Normalization