Improving Streaming Automatic Speech Recognition With Non-Streaming Model Distillation On Unsupervised Data
Thibault Doutre, Wei Han, Min Ma, Zhiyun Lu, Chung-Cheng Chiu, Ruoming, Pang, Arun Narayanan, Ananya Misra, Yu Zhang, Liangliang Cao

TL;DR
This paper introduces a novel knowledge distillation method where non-streaming ASR models serve as teachers to improve streaming ASR models, significantly reducing word error rates on large-scale datasets across multiple languages.
Contribution
It presents a new training approach leveraging non-streaming models for distillation, enabling scalable training on millions of hours of data for streaming ASR.
Findings
Significant WER reduction on LibriSpeech and YouTube data.
Effective across four languages, including French.
Scalable training on up to 3 million hours of audio.
Abstract
Streaming end-to-end automatic speech recognition (ASR) models are widely used on smart speakers and on-device applications. Since these models are expected to transcribe speech with minimal latency, they are constrained to be causal with no future context, compared to their non-streaming counterparts. Consequently, streaming models usually perform worse than non-streaming models. We propose a novel and effective learning method by leveraging a non-streaming ASR model as a teacher to generate transcripts on an arbitrarily large data set, which is then used to distill knowledge into streaming ASR models. This way, we scale the training of streaming models to up to 3 million hours of YouTube audio. Experiments show that our approach can significantly reduce the word error rate (WER) of RNNT models not only on LibriSpeech but also on YouTube data in four languages. For example, in French,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing
