AlexaTM 20B: Few-Shot Learning Using a Large-Scale Multilingual Seq2Seq   Model

Saleh Soltan; Shankar Ananthakrishnan; Jack FitzGerald; Rahul Gupta,; Wael Hamza; Haidar Khan; Charith Peris; Stephen Rawls; Andy Rosenbaum; Anna; Rumshisky; Chandana Satya Prakash; Mukund Sridhar; Fabian Triefenbach; Apurv; Verma; Gokhan Tur; Prem Natarajan

arXiv:2208.01448·cs.CL·August 4, 2022·38 cites

AlexaTM 20B: Few-Shot Learning Using a Large-Scale Multilingual Seq2Seq Model

Saleh Soltan, Shankar Ananthakrishnan, Jack FitzGerald, Rahul Gupta,, Wael Hamza, Haidar Khan, Charith Peris, Stephen Rawls, Andy Rosenbaum, Anna, Rumshisky, Chandana Satya Prakash, Mukund Sridhar, Fabian Triefenbach, Apurv, Verma, Gokhan Tur, Prem Natarajan

PDF

Open Access 1 Repo

TL;DR

This paper introduces AlexaTM 20B, a large-scale multilingual seq2seq model that excels in few-shot learning, outperforming larger decoder-only models across various NLP tasks including summarization, translation, and multilingual benchmarks.

Contribution

The paper presents AlexaTM 20B, a 20 billion parameter multilingual seq2seq model that achieves state-of-the-art results in few-shot learning for multiple NLP tasks, demonstrating the effectiveness of seq2seq architecture over decoder-only models.

Findings

01

Achieves SOTA in 1-shot summarization and translation.

02

Outperforms GPT-3 (175B) in zero-shot tasks.

03

Excels in multilingual benchmarks like XNLI and SuperGLUE.

Abstract

In this work, we demonstrate that multilingual large-scale sequence-to-sequence (seq2seq) models, pre-trained on a mixture of denoising and Causal Language Modeling (CLM) tasks, are more efficient few-shot learners than decoder-only models on various tasks. In particular, we train a 20 billion parameter multilingual seq2seq model called Alexa Teacher Model (AlexaTM 20B) and show that it achieves state-of-the-art (SOTA) performance on 1-shot summarization tasks, outperforming a much larger 540B PaLM decoder model. AlexaTM 20B also achieves SOTA in 1-shot machine translation, especially for low-resource languages, across almost all language pairs supported by the model (Arabic, English, French, German, Hindi, Italian, Japanese, Marathi, Portuguese, Spanish, Tamil, and Telugu) on Flores-101 dataset. We also show in zero-shot setting, AlexaTM 20B outperforms GPT3 (175B) on SuperGLUE and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

amazon-science/alexa-teacher-models
pytorch

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Domain Adaptation and Few-Shot Learning · COVID-19 diagnosis using AI

MethodsPathways Language Model · Sigmoid Activation · Tanh Activation · Long Short-Term Memory · Sequence to Sequence