Advancing Multi-Accented LSTM-CTC Speech Recognition using a Domain Specific Student-Teacher Learning Paradigm
Shahram Ghorbani, Ahmet E. Bulut, John H.L. Hansen

TL;DR
This paper introduces a domain-specific student-teacher learning paradigm for multi-accent speech recognition using LSTM-CTC models, significantly improving accuracy across diverse accents by leveraging aligned accent-specific teachers and knowledge distillation.
Contribution
It proposes a novel multi-accent learning framework with aligned accent-specific teacher models and a student model, achieving substantial CER reduction and effective accent adaptation.
Findings
20.1% relative CER reduction with the proposed method
Aligned accent-specific models improve recognition accuracy
Knowledge distillation enhances accent adaptation performance
Abstract
Non-native speech causes automatic speech recognition systems to degrade in performance. Past strategies to address this challenge have considered model adaptation, accent classification with a model selection, alternate pronunciation lexicon, etc. In this study, we consider a recurrent neural network (RNN) with connectionist temporal classification (CTC) cost function trained on multi-accent English data including US (Native), Indian and Hispanic accents. We exploit dark knowledge from a model trained with the multi-accent data to train student models under the guidance of both a teacher model and CTC cost of target transcription. We show that transferring knowledge from a single RNN-CTC trained model toward a student model, yields better performance than the stand-alone teacher model. Since the outputs of different trained CTC models are not necessarily aligned, it is not possible to…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing
