Factorised Speaker-environment Adaptive Training of Conformer Speech   Recognition Systems

Jiajun Deng; Guinan Li; Xurong Xie; Zengrui Jin; Mingyu Cui; Tianzi; Wang; Shujie Hu; Mengzhe Geng; Xunying Liu

arXiv:2306.14608·eess.AS·June 27, 2023

Factorised Speaker-environment Adaptive Training of Conformer Speech Recognition Systems

Jiajun Deng, Guinan Li, Xurong Xie, Zengrui Jin, Mingyu Cui, Tianzi, Wang, Shujie Hu, Mengzhe Geng, Xunying Liu

PDF

Open Access

TL;DR

This paper introduces a Bayesian factorised adaptation method for Conformer speech recognition models that separately models speaker and environment variability, leading to improved accuracy and rapid adaptation capabilities.

Contribution

It proposes a novel Bayesian factorised speaker-environment adaptive training and test-time adaptation approach for Conformer ASR models, enhancing robustness to variability.

Findings

01

Outperforms baseline by up to 3.1% absolute WER reduction

02

Achieves 10.4% relative WER reduction over speaker label only adaptation

03

Enables rapid adaptation to unseen speaker-environment conditions

Abstract

Rich sources of variability in natural speech present significant challenges to current data intensive speech recognition technologies. To model both speaker and environment level diversity, this paper proposes a novel Bayesian factorised speaker-environment adaptive training and test time adaptation approach for Conformer ASR models. Speaker and environment level characteristics are separately modeled using compact hidden output transforms, which are then linearly or hierarchically combined to represent any speaker-environment combination. Bayesian learning is further utilized to model the adaptation parameter uncertainty. Experiments on the 300-hr WHAM noise corrupted Switchboard data suggest that factorised adaptation consistently outperforms the baseline and speaker label only adapted Conformers by up to 3.1% absolute (10.4% relative) word error rate reductions. Further analysis…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing