Speaker Adaptation for End-To-End Speech Recognition Systems in Noisy Environments
Dominik Wagner, Ilja Baumann, Sebastian P. Bayerl, Korbinian, Riedhammer, Tobias Bocklet

TL;DR
This paper investigates how speaker adaptation using various speaker embeddings improves the robustness of end-to-end speech recognition systems, especially under noisy conditions, with significant error rate reductions.
Contribution
It demonstrates that concatenating speaker embeddings like x-vectors and ECAPA-TDNN to acoustic features enhances noise robustness in transformer and wav2vec 2.0 models, with optimal embedding sizes varying by dataset and noise level.
Findings
Speaker embeddings improve recognition accuracy under noise.
ECAPA-TDNN and x-vectors outperform i-vectors as speaker representations.
Robustness gains are more pronounced with increased noise in transformer models.
Abstract
We analyze the impact of speaker adaptation in end-to-end automatic speech recognition models based on transformers and wav2vec 2.0 under different noise conditions. By including speaker embeddings obtained from x-vector and ECAPA-TDNN systems, as well as i-vectors, we achieve relative word error rate improvements of up to 16.3% on LibriSpeech and up to 14.5% on Switchboard. We show that the proven method of concatenating speaker vectors to the acoustic features and supplying them as auxiliary model inputs remains a viable option to increase the robustness of end-to-end architectures. The effect on transformer models is stronger, when more noise is added to the input speech. The most substantial benefits for systems based on wav2vec 2.0 are achieved under moderate or no noise conditions. Both x-vectors and ECAPA-TDNN embeddings outperform i-vectors as speaker representations. The…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing
