Factorised Speaker-environment Adaptive Training of Conformer Speech Recognition Systems
Jiajun Deng, Guinan Li, Xurong Xie, Zengrui Jin, Mingyu Cui, Tianzi, Wang, Shujie Hu, Mengzhe Geng, Xunying Liu

TL;DR
This paper introduces a Bayesian factorised adaptation method for Conformer speech recognition models that separately models speaker and environment variability, leading to improved accuracy and rapid adaptation capabilities.
Contribution
It proposes a novel Bayesian factorised speaker-environment adaptive training and test-time adaptation approach for Conformer ASR models, enhancing robustness to variability.
Findings
Outperforms baseline by up to 3.1% absolute WER reduction
Achieves 10.4% relative WER reduction over speaker label only adaptation
Enables rapid adaptation to unseen speaker-environment conditions
Abstract
Rich sources of variability in natural speech present significant challenges to current data intensive speech recognition technologies. To model both speaker and environment level diversity, this paper proposes a novel Bayesian factorised speaker-environment adaptive training and test time adaptation approach for Conformer ASR models. Speaker and environment level characteristics are separately modeled using compact hidden output transforms, which are then linearly or hierarchically combined to represent any speaker-environment combination. Bayesian learning is further utilized to model the adaptation parameter uncertainty. Experiments on the 300-hr WHAM noise corrupted Switchboard data suggest that factorised adaptation consistently outperforms the baseline and speaker label only adapted Conformers by up to 3.1% absolute (10.4% relative) word error rate reductions. Further analysis…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing
