Fusion of Embeddings Networks for Robust Combination of Text Dependent   and Independent Speaker Recognition

Ruirui Li; Chelsea J.-T. Ju; Zeya Chen; Hongda Mao; Oguz Elibol,; Andreas Stolcke

arXiv:2106.10169·cs.LG·June 21, 2021·1 cites

Fusion of Embeddings Networks for Robust Combination of Text Dependent and Independent Speaker Recognition

Ruirui Li, Chelsea J.-T. Ju, Zeya Chen, Hongda Mao, Oguz Elibol,, Andreas Stolcke

PDF

Open Access

TL;DR

This paper introduces a fusion of embeddings network called foenet that combines text-dependent and independent speaker recognition models, achieving higher accuracy and robustness, especially with incomplete inputs.

Contribution

The paper proposes a novel foenet architecture that combines joint learning and neural attention for robust speaker recognition from mixed input types.

Findings

01

foenet outperforms baseline methods in accuracy

02

foenet maintains high performance with incomplete inputs

03

neural attention improves model robustness

Abstract

By implicitly recognizing a user based on his/her speech input, speaker identification enables many downstream applications, such as personalized system behavior and expedited shopping checkouts. Based on whether the speech content is constrained or not, both text-dependent (TD) and text-independent (TI) speaker recognition models may be used. We wish to combine the advantages of both types of models through an ensemble system to make more reliable predictions. However, any such combined approach has to be robust to incomplete inputs, i.e., when either TD or TI input is missing. As a solution we propose a fusion of embeddings network foenet architecture, combining joint learning with neural attention. We compare foenet with four competitive baseline methods on a dataset of voice assistant inputs, and show that it achieves higher accuracy than the baseline and score fusion methods,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Music and Audio Processing · Speech and Audio Processing