UL2: Unifying Language Learning Paradigms

Yi Tay; Mostafa Dehghani; Vinh Q. Tran; Xavier Garcia; Jason Wei,; Xuezhi Wang; Hyung Won Chung; Siamak Shakeri; Dara Bahri; Tal Schuster,; Huaixiu Steven Zheng; Denny Zhou; Neil Houlsby; Donald Metzler

arXiv:2205.05131·cs.CL·March 1, 2023·97 cites

UL2: Unifying Language Learning Paradigms

Yi Tay, Mostafa Dehghani, Vinh Q. Tran, Xavier Garcia, Jason Wei,, Xuezhi Wang, Hyung Won Chung, Siamak Shakeri, Dara Bahri, Tal Schuster,, Huaixiu Steven Zheng, Denny Zhou, Neil Houlsby, Donald Metzler

PDF

Open Access 2 Repos 10 Models

TL;DR

UL2 introduces a unified pre-training framework for NLP models that combines diverse objectives, enabling strong performance across various tasks, datasets, and learning paradigms, including fine-tuning and in-context learning.

Contribution

The paper proposes Mixture-of-Denoisers (MoD), a novel pre-training objective that unifies multiple paradigms, and demonstrates its effectiveness in scaling models to 20B parameters with state-of-the-art results.

Findings

01

UL2 outperforms T5 and GPT-like models across multiple setups.

02

UL2 20B achieves SOTA on 50 NLP tasks.

03

UL2 excels in zero-shot, one-shot, and reasoning tasks.

Abstract

Existing pre-trained models are generally geared towards a particular class of problems. To date, there seems to be still no consensus on what the right architecture and pre-training setup should be. This paper presents a unified framework for pre-training models that are universally effective across datasets and setups. We begin by disentangling architectural archetypes with pre-training objectives -- two concepts that are commonly conflated. Next, we present a generalized & unified perspective for self-supervision in NLP and show how different pre-training objectives can be cast as one another and how interpolating between different objectives can be effective. We then propose Mixture-of-Denoisers (MoD), a pre-training objective that combines diverse pre-training paradigms together. We furthermore introduce a notion of mode switching, wherein downstream fine-tuning is associated with…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Multimodal Machine Learning Applications

MethodsAttention Is All You Need · UL2 · Linear Layer · Attention Dropout · SentencePiece · Gated Linear Unit · Adam · Cosine Annealing · Byte Pair Encoding · Multi-Head Attention