Alexa Teacher Model: Pretraining and Distilling Multi-Billion-Parameter Encoders for Natural Language Understanding Systems
Jack FitzGerald, Shankar Ananthakrishnan, Konstantine Arkoudas, Davide, Bernardi, Abhishek Bhagia, Claudio Delli Bovi, Jin Cao, Rakesh Chada, Amit, Chauhan, Luoxin Chen, Anurag Dwarakanath, Satyam Dwivedi, Turan Gojayev,, Karthik Gopalakrishnan, Thomas Gueudre, Dilek Hakkani-Tur

TL;DR
This paper explores large-scale pretraining and distillation of multi-billion-parameter encoders for natural language understanding, demonstrating improved performance on NLU tasks and virtual assistant systems through in-domain data and model compression.
Contribution
It introduces a large-scale pretraining and distillation pipeline for multi-billion-parameter encoders tailored for NLU, with significant performance gains from in-domain data and model size reduction.
Findings
Teacher models perform comparably to XLM-R and mT5 on XNLI.
In-domain pretraining improves intent and slot filling error rates.
Distilled models outperform smaller baselines on NLU tasks and user dissatisfaction metrics.
Abstract
We present results from a large-scale experiment on pretraining encoders with non-embedding parameter counts ranging from 700M to 9.3B, their subsequent distillation into smaller models ranging from 17M-170M parameters, and their application to the Natural Language Understanding (NLU) component of a virtual assistant system. Though we train using 70% spoken-form data, our teacher models perform comparably to XLM-R and mT5 when evaluated on the written-form Cross-lingual Natural Language Inference (XNLI) corpus. We perform a second stage of pretraining on our teacher models using in-domain data from our system, improving error rates by 3.86% relative for intent classification and 7.01% relative for slot filling. We find that even a 170M-parameter model distilled from our Stage 2 teacher model has 2.88% better intent classification and 7.69% better slot filling error rates when compared…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsMulti-Head Attention · Attention Is All You Need · XLM-R · Linear Layer · Byte Pair Encoding · Layer Normalization · Inverse Square Root Schedule · Gated Linear Unit · Refunds@Expedia|||How do I get a full refund from Expedia? · Softmax
