Kaizen: Continuously improving teacher using Exponential Moving Average for semi-supervised speech recognition
Vimal Manohar, Tatiana Likhomanenko, Qiantong Xu, Wei-Ning Hsu, Ronan, Collobert, Yatharth Saraf, Geoffrey Zweig, Abdelrahman Mohamed

TL;DR
The paper introduces the Kaizen framework, which employs an EMA-updated teacher model to generate pseudo-labels for semi-supervised speech recognition, leading to significant WER reductions and effective learning with limited supervised data.
Contribution
It presents a novel continuous pseudo-labeling approach using EMA for teacher updates, applicable across different training criteria in semi-supervised speech recognition.
Findings
Over 10% relative WER reduction compared to standard methods
Effective semi-supervised learning with only 10 hours of supervised data
Closes the gap to fully supervised systems with large unlabeled datasets
Abstract
In this paper, we introduce the Kaizen framework that uses a continuously improving teacher to generate pseudo-labels for semi-supervised speech recognition (ASR). The proposed approach uses a teacher model which is updated as the exponential moving average (EMA) of the student model parameters. We demonstrate that it is critical for EMA to be accumulated with full-precision floating point. The Kaizen framework can be seen as a continuous version of the iterative pseudo-labeling approach for semi-supervised training. It is applicable for different training criteria, and in this paper we demonstrate its effectiveness for frame-level hybrid hidden Markov model-deep neural network (HMM-DNN) systems as well as sequence-level Connectionist Temporal Classification (CTC) based models. For large scale real-world unsupervised public videos in UK English and Italian languages the proposed…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
