Alexa Teacher Model: Pretraining and Distilling Multi-Billion-Parameter   Encoders for Natural Language Understanding Systems

Jack FitzGerald; Shankar Ananthakrishnan; Konstantine Arkoudas; Davide; Bernardi; Abhishek Bhagia; Claudio Delli Bovi; Jin Cao; Rakesh Chada; Amit; Chauhan; Luoxin Chen; Anurag Dwarakanath; Satyam Dwivedi; Turan Gojayev,; Karthik Gopalakrishnan; Thomas Gueudre; Dilek Hakkani-Tur; Wael Hamza,; Jonathan Hueser; Kevin Martin Jose; Haidar Khan; Beiye Liu; Jianhua Lu,; Alessandro Manzotti; Pradeep Natarajan; Karolina Owczarzak; Gokmen Oz; Enrico; Palumbo; Charith Peris; Chandana Satya Prakash; Stephen Rawls; Andy; Rosenbaum; Anjali Shenoy; Saleh Soltan; Mukund Harakere Sridhar; Liz Tan,; Fabian Triefenbach; Pan Wei; Haiyang Yu; Shuai Zheng; Gokhan Tur; Prem; Natarajan

arXiv:2206.07808·cs.CL·June 17, 2022

Alexa Teacher Model: Pretraining and Distilling Multi-Billion-Parameter Encoders for Natural Language Understanding Systems

Jack FitzGerald, Shankar Ananthakrishnan, Konstantine Arkoudas, Davide, Bernardi, Abhishek Bhagia, Claudio Delli Bovi, Jin Cao, Rakesh Chada, Amit, Chauhan, Luoxin Chen, Anurag Dwarakanath, Satyam Dwivedi, Turan Gojayev,, Karthik Gopalakrishnan, Thomas Gueudre, Dilek Hakkani-Tur

PDF

TL;DR

This paper explores large-scale pretraining and distillation of multi-billion-parameter encoders for natural language understanding, demonstrating improved performance on NLU tasks and virtual assistant systems through in-domain data and model compression.

Contribution

It introduces a large-scale pretraining and distillation pipeline for multi-billion-parameter encoders tailored for NLU, with significant performance gains from in-domain data and model size reduction.

Findings

01

Teacher models perform comparably to XLM-R and mT5 on XNLI.

02

In-domain pretraining improves intent and slot filling error rates.

03

Distilled models outperform smaller baselines on NLU tasks and user dissatisfaction metrics.

Abstract

We present results from a large-scale experiment on pretraining encoders with non-embedding parameter counts ranging from 700M to 9.3B, their subsequent distillation into smaller models ranging from 17M-170M parameters, and their application to the Natural Language Understanding (NLU) component of a virtual assistant system. Though we train using 70% spoken-form data, our teacher models perform comparably to XLM-R and mT5 when evaluated on the written-form Cross-lingual Natural Language Inference (XNLI) corpus. We perform a second stage of pretraining on our teacher models using in-domain data from our system, improving error rates by 3.86% relative for intent classification and 7.01% relative for slot filling. We find that even a 170M-parameter model distilled from our Stage 2 teacher model has 2.88% better intent classification and 7.69% better slot filling error rates when compared…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

MethodsMulti-Head Attention · Attention Is All You Need · XLM-R · Linear Layer · Byte Pair Encoding · Layer Normalization · Inverse Square Root Schedule · Gated Linear Unit · Refunds@Expedia|||How do I get a full refund from Expedia? · Softmax