Unsupervised Model-based speaker adaptation of end-to-end lattice-free MMI model for speech recognition
Xurong Xie, Xunying Liu, Hui Chen, Hongan Wang

TL;DR
This paper introduces an unsupervised speaker adaptation method for end-to-end lattice-free MMI speech recognition models using LHUC and BLHUC techniques, significantly reducing word error rates on the Switchboard dataset.
Contribution
It proposes a novel unsupervised model-based adaptation framework for E2E LF-MMI models employing LHUC/BLHUC, with systematic regularization and confidence-based data selection.
Findings
BLHUC adaptation reduces WER by up to 14.7% relative.
The proposed method achieves WERs comparable to state-of-the-art hybrid and Conformer systems.
Confidence score-based data selection improves adaptation effectiveness.
Abstract
Modeling the speaker variability is a key challenge for automatic speech recognition (ASR) systems. In this paper, the learning hidden unit contributions (LHUC) based adaptation techniques with compact speaker dependent (SD) parameters are used to facilitate both speaker adaptive training (SAT) and unsupervised test-time speaker adaptation for end-to-end (E2E) lattice-free MMI (LF-MMI) models. An unsupervised model-based adaptation framework is proposed to estimate the SD parameters in E2E paradigm using LF-MMI and cross entropy (CE) criterions. Various regularization methods of the standard LHUC adaptation, e.g., the Bayesian LHUC (BLHUC) adaptation, are systematically investigated to mitigate the risk of overfitting, on E2E LF-MMI CNN-TDNN and CNN-TDNN-BLSTM models. Lattice-based confidence score estimation is used for adaptation data selection to reduce the supervision label…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing
