Incorporating Talker Identity Aids With Improving Speech Recognition in   Adversarial Environments

Sagarika Alavilli; Annesya Banerjee; Gasser Elbanna; Annika Magaro

arXiv:2410.05423·cs.SD·October 10, 2024

Incorporating Talker Identity Aids With Improving Speech Recognition in Adversarial Environments

Sagarika Alavilli, Annesya Banerjee, Gasser Elbanna, Annika Magaro

PDF

Open Access

TL;DR

This paper presents a transformer-based speech recognition model that incorporates speaker identity features to improve robustness against noise and speech distortions, outperforming existing models in adverse conditions.

Contribution

The study introduces a joint speech recognition and speaker identification model that leverages speaker embeddings to enhance robustness in noisy and highly augmented speech environments.

Findings

01

Outperforms Whisper in high-noise conditions

02

Handles highly augmented speech effectively

03

Maintains comparable performance under clean conditions

Abstract

Current state-of-the-art speech recognition models are trained to map acoustic signals into sub-lexical units. While these models demonstrate superior performance, they remain vulnerable to out-of-distribution conditions such as background noise and speech augmentations. In this work, we hypothesize that incorporating speaker representations during speech recognition can enhance model robustness to noise. We developed a transformer-based model that jointly performs speech recognition and speaker identification. Our model utilizes speech embeddings from Whisper and speaker embeddings from ECAPA-TDNN, which are processed jointly to perform both tasks. We show that the joint model performs comparably to Whisper under clean conditions. Notably, the joint model outperforms Whisper in high-noise environments, such as with 8-speaker babble background noise. Furthermore, our joint model excels…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and Audio Processing