Comparison of Soft and Hard Target RNN-T Distillation for Large-scale   ASR

Dongseong Hwang; Khe Chai Sim; Yu Zhang; Trevor Strohman

arXiv:2210.05793·cs.LG·November 1, 2022·1 cites

Comparison of Soft and Hard Target RNN-T Distillation for Large-scale ASR

Dongseong Hwang, Khe Chai Sim, Yu Zhang, Trevor Strohman

PDF

Open Access

TL;DR

This paper compares soft and hard target knowledge distillation methods for large-scale RNN-T models in ASR, demonstrating their effectiveness in different scenarios and achieving state-of-the-art results on LibriSpeech.

Contribution

It provides a comprehensive comparison of soft and hard target distillation for RNN-T models and establishes new state-of-the-art results with soft distillation on LibriSpeech.

Findings

01

Hard targets are more effective with architecture mismatch.

02

Soft targets perform better in self-training scenarios.

03

Achieved 8% relative WER improvement on LibriSpeech dev-other.

Abstract

Knowledge distillation is an effective machine learning technique to transfer knowledge from a teacher model to a smaller student model, especially with unlabeled data. In this paper, we focus on knowledge distillation for the RNN-T model, which is widely used in state-of-the-art (SoTA) automatic speech recognition (ASR). Specifically, we compared using soft and hard target distillation to train large-scaleRNN-T models on the LibriSpeech/LibriLight public dataset (60k hours) and our in-house data (600k hours). We found that hard tar-gets are more effective when the teacher and student have different architecture, such as large teacher and small streaming student. On the other hand, soft target distillation works better in self-training scenario like iterative large teacher training. For a large model with0.6B weights, we achieve a new SoTA word error rate (WER) on LibriSpeech (8%…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Music and Audio Processing · Topic Modeling

MethodsStochastic Depth · Knowledge Distillation · RandAugment · Dropout · Noisy Student