UL2: Unifying Language Learning Paradigms
Yi Tay, Mostafa Dehghani, Vinh Q. Tran, Xavier Garcia, Jason Wei,, Xuezhi Wang, Hyung Won Chung, Siamak Shakeri, Dara Bahri, Tal Schuster,, Huaixiu Steven Zheng, Denny Zhou, Neil Houlsby, Donald Metzler

TL;DR
UL2 introduces a unified pre-training framework for NLP models that combines diverse objectives, enabling strong performance across various tasks, datasets, and learning paradigms, including fine-tuning and in-context learning.
Contribution
The paper proposes Mixture-of-Denoisers (MoD), a novel pre-training objective that unifies multiple paradigms, and demonstrates its effectiveness in scaling models to 20B parameters with state-of-the-art results.
Findings
UL2 outperforms T5 and GPT-like models across multiple setups.
UL2 20B achieves SOTA on 50 NLP tasks.
UL2 excels in zero-shot, one-shot, and reasoning tasks.
Abstract
Existing pre-trained models are generally geared towards a particular class of problems. To date, there seems to be still no consensus on what the right architecture and pre-training setup should be. This paper presents a unified framework for pre-training models that are universally effective across datasets and setups. We begin by disentangling architectural archetypes with pre-training objectives -- two concepts that are commonly conflated. Next, we present a generalized & unified perspective for self-supervision in NLP and show how different pre-training objectives can be cast as one another and how interpolating between different objectives can be effective. We then propose Mixture-of-Denoisers (MoD), a pre-training objective that combines diverse pre-training paradigms together. We furthermore introduce a notion of mode switching, wherein downstream fine-tuning is associated with…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗google/ul2model· 2.3k dl· ♡ 1822.3k dl♡ 182
- 🤗Finnish-NLP/ul2-small-nl16-finnishmodel· 4 dl4 dl
- 🤗Finnish-NLP/ul2-base-nl36-finnishmodel· 2 dl· ♡ 22 dl♡ 2
- 🤗Finnish-NLP/ul2-tiny-nl6-finnishmodel· 10 dl10 dl
- 🤗Finnish-NLP/ul2-mini-nl8-finnishmodel· 5 dl· ♡ 15 dl♡ 1
- 🤗Finnish-NLP/ul2-small-nl24-finnishmodel· 2 dl· ♡ 12 dl♡ 1
- 🤗togethercomputer/GPT-JT-6B-v1model· 894 dl· ♡ 302894 dl♡ 302
- 🤗iliemihai/GPT-JT-6B-v1-8bitmodel· 10 dl· ♡ 710 dl♡ 7
- 🤗ai-forever/FRED-T5-1.7Bmodel· 732 dl· ♡ 82732 dl♡ 82
- 🤗ai-forever/FRED-T5-largemodel· 279 dl· ♡ 28279 dl♡ 28
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Multimodal Machine Learning Applications
MethodsAttention Is All You Need · UL2 · Linear Layer · Attention Dropout · SentencePiece · Gated Linear Unit · Adam · Cosine Annealing · Byte Pair Encoding · Multi-Head Attention
