LongT5: Efficient Text-To-Text Transformer for Long Sequences
Mandy Guo, Joshua Ainslie, David Uthus, Santiago Ontanon, Jianmo Ni,, Yun-Hsuan Sung, Yinfei Yang

TL;DR
LongT5 is a scalable Transformer model that combines long-input attention mechanisms and summarization pre-training to achieve state-of-the-art results on summarization and question answering tasks.
Contribution
The paper introduces LongT5, a new scalable Transformer architecture integrating TGlobal attention and pre-training strategies for improved long-sequence processing.
Findings
Achieves state-of-the-art results on summarization tasks.
Outperforms original T5 models on question answering.
Introduces TGlobal attention mechanism without extra side-inputs.
Abstract
Recent work has shown that either (1) increasing the input length or (2) increasing model size can improve the performance of Transformer-based neural models. In this paper, we present a new model, called LongT5, with which we explore the effects of scaling both the input length and model size at the same time. Specifically, we integrated attention ideas from long-input transformers (ETC), and adopted pre-training strategies from summarization pre-training (PEGASUS) into the scalable T5 architecture. The result is a new attention mechanism we call {\em Transient Global} (TGlobal), which mimics ETC's local/global attention mechanism, but without requiring additional side-inputs. We are able to achieve state-of-the-art results on several summarization tasks and outperform the original T5 models on question answering tasks.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗google/long-t5-local-basemodel· 1.5k dl· ♡ 151.5k dl♡ 15
- 🤗google/long-t5-local-largemodel· 71 dl· ♡ 571 dl♡ 5
- 🤗google/long-t5-tglobal-basemodel· 30k dl· ♡ 4930k dl♡ 49
- 🤗google/long-t5-tglobal-largemodel· 8.5k dl· ♡ 168.5k dl♡ 16
- 🤗Stancld/longt5-tglobal-large-16384-pubmed-3k_stepsmodel· 1.7k dl· ♡ 221.7k dl♡ 22
- 🤗google/long-t5-tglobal-xlmodel· 449 dl· ♡ 24449 dl♡ 24
- 🤗pszemraj/long-t5-tglobal-base-16384-book-summarymodel· 437 dl· ♡ 136437 dl♡ 136
- 🤗whaleloops/longt5-tglobal-large-16384-pubmed-10k_stepsmodel· 5 dl· ♡ 25 dl♡ 2
- 🤗kmfoda/long-t5-tglobal-xxlmodel· 4 dl· ♡ 44 dl♡ 4
- 🤗pszemraj/long-t5-tglobal-xl-16384-book-summarymodel· 10 dl· ♡ 1910 dl♡ 19
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Speech Recognition and Synthesis
MethodsGated Linear Unit · Multi-Head Attention · Attention Is All You Need · Linear Layer · Byte Pair Encoding · Attention Dropout · Adafactor · Dense Connections · Residual Connection · Layer Normalization
