Length- and Noise-aware Training Techniques for Short-utterance Speaker Recognition
Wenda Chen, Jonathan Huang, Tobias Bocklet

TL;DR
This paper introduces length- and noise-aware training techniques, including invariant representation learning, centroid alignment, and a novel self-attention mechanism, to enhance short-utterance speaker recognition robustness, achieving significant EER improvements on the VOiCES dataset.
Contribution
It proposes new training methods and a self-attention mechanism to improve speaker recognition accuracy for short and noisy utterances, advancing current deep learning approaches.
Findings
Achieved 7.0% relative EER reduction on extremely short utterances.
Achieved 8.2% relative EER reduction on full-duration utterances.
Demonstrated effectiveness of proposed techniques on VOiCES far-field corpus.
Abstract
Speaker recognition performance has been greatly improved with the emergence of deep learning. Deep neural networks show the capacity to effectively deal with impacts of noise and reverberation, making them attractive to far-field speaker recognition systems. The x-vector framework is a popular choice for generating speaker embeddings in recent literature due to its robust training mechanism and excellent performance in various test sets. In this paper, we start with early work on including invariant representation learning (IRL) to the loss function and modify the approach with centroid alignment (CA) and length variability cost (LVC) techniques to further improve robustness in noisy, far-field applications. This work mainly focuses on improvements for short-duration test utterances (1-8s). We also present improved results on long-duration tasks. In addition, this work discusses a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing
