AlexaTM 20B: Few-Shot Learning Using a Large-Scale Multilingual Seq2Seq Model
Saleh Soltan, Shankar Ananthakrishnan, Jack FitzGerald, Rahul Gupta,, Wael Hamza, Haidar Khan, Charith Peris, Stephen Rawls, Andy Rosenbaum, Anna, Rumshisky, Chandana Satya Prakash, Mukund Sridhar, Fabian Triefenbach, Apurv, Verma, Gokhan Tur, Prem Natarajan

TL;DR
This paper introduces AlexaTM 20B, a large-scale multilingual seq2seq model that excels in few-shot learning, outperforming larger decoder-only models across various NLP tasks including summarization, translation, and multilingual benchmarks.
Contribution
The paper presents AlexaTM 20B, a 20 billion parameter multilingual seq2seq model that achieves state-of-the-art results in few-shot learning for multiple NLP tasks, demonstrating the effectiveness of seq2seq architecture over decoder-only models.
Findings
Achieves SOTA in 1-shot summarization and translation.
Outperforms GPT-3 (175B) in zero-shot tasks.
Excels in multilingual benchmarks like XNLI and SuperGLUE.
Abstract
In this work, we demonstrate that multilingual large-scale sequence-to-sequence (seq2seq) models, pre-trained on a mixture of denoising and Causal Language Modeling (CLM) tasks, are more efficient few-shot learners than decoder-only models on various tasks. In particular, we train a 20 billion parameter multilingual seq2seq model called Alexa Teacher Model (AlexaTM 20B) and show that it achieves state-of-the-art (SOTA) performance on 1-shot summarization tasks, outperforming a much larger 540B PaLM decoder model. AlexaTM 20B also achieves SOTA in 1-shot machine translation, especially for low-resource languages, across almost all language pairs supported by the model (Arabic, English, French, German, Hindi, Italian, Japanese, Marathi, Portuguese, Spanish, Tamil, and Telugu) on Flores-101 dataset. We also show in zero-shot setting, AlexaTM 20B outperforms GPT3 (175B) on SuperGLUE and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Domain Adaptation and Few-Shot Learning · COVID-19 diagnosis using AI
MethodsPathways Language Model · Sigmoid Activation · Tanh Activation · Long Short-Term Memory · Sequence to Sequence
