Domain Adaptation via Teacher-Student Learning for End-to-End Speech Recognition
Zhong Meng, Jinyu Li, Yashesh Gaur, Yifan Gong

TL;DR
This paper extends teacher-student learning to large-scale unsupervised domain adaptation for end-to-end speech recognition, introducing adaptive weighting of teacher and ground-truth knowledge to improve performance.
Contribution
It proposes adaptive teacher-student learning that dynamically combines teacher predictions and ground-truth labels for better domain adaptation in end-to-end speech models.
Findings
Achieved 6.3% relative WER reduction with T/S learning.
Achieved 10.3% relative WER reduction with adaptive T/S.
Validated on 3400 hours of Microsoft Cortana data.
Abstract
Teacher-student (T/S) has shown to be effective for domain adaptation of deep neural network acoustic models in hybrid speech recognition systems. In this work, we extend the T/S learning to large-scale unsupervised domain adaptation of an attention-based end-to-end (E2E) model through two levels of knowledge transfer: teacher's token posteriors as soft labels and one-best predictions as decoder guidance. To further improve T/S learning with the help of ground-truth labels, we propose adaptive T/S (AT/S) learning. Instead of conditionally choosing from either the teacher's soft token posteriors or the one-hot ground-truth label, in AT/S, the student always learns from both the teacher and the ground truth with a pair of adaptive weights assigned to the soft and one-hot labels quantifying the confidence on each of the knowledge sources. The confidence scores are dynamically estimated at…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing
