Speaker Adaptation for End-To-End Speech Recognition Systems in Noisy   Environments

Dominik Wagner; Ilja Baumann; Sebastian P. Bayerl; Korbinian; Riedhammer; Tobias Bocklet

arXiv:2211.08774·cs.SD·December 8, 2023

Speaker Adaptation for End-To-End Speech Recognition Systems in Noisy Environments

Dominik Wagner, Ilja Baumann, Sebastian P. Bayerl, Korbinian, Riedhammer, Tobias Bocklet

PDF

Open Access

TL;DR

This paper investigates how speaker adaptation using various speaker embeddings improves the robustness of end-to-end speech recognition systems, especially under noisy conditions, with significant error rate reductions.

Contribution

It demonstrates that concatenating speaker embeddings like x-vectors and ECAPA-TDNN to acoustic features enhances noise robustness in transformer and wav2vec 2.0 models, with optimal embedding sizes varying by dataset and noise level.

Findings

01

Speaker embeddings improve recognition accuracy under noise.

02

ECAPA-TDNN and x-vectors outperform i-vectors as speaker representations.

03

Robustness gains are more pronounced with increased noise in transformer models.

Abstract

We analyze the impact of speaker adaptation in end-to-end automatic speech recognition models based on transformers and wav2vec 2.0 under different noise conditions. By including speaker embeddings obtained from x-vector and ECAPA-TDNN systems, as well as i-vectors, we achieve relative word error rate improvements of up to 16.3% on LibriSpeech and up to 14.5% on Switchboard. We show that the proven method of concatenating speaker vectors to the acoustic features and supplying them as auxiliary model inputs remains a viable option to increase the robustness of end-to-end architectures. The effect on transformer models is stronger, when more noise is added to the input speech. The most substantial benefits for systems based on wav2vec 2.0 are achieved under moderate or no noise conditions. Both x-vectors and ECAPA-TDNN embeddings outperform i-vectors as speaker representations. The…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing