Improved Relation Networks for End-to-End Speaker Verification and   Identification

Ashutosh Chaubey; Sparsh Sinha; Susmita Ghose

arXiv:2203.17218·eess.AS·July 25, 2022

Improved Relation Networks for End-to-End Speaker Verification and Identification

Ashutosh Chaubey, Sparsh Sinha, Susmita Ghose

PDF

Open Access

TL;DR

This paper introduces improved relation networks combined with meta-learning for end-to-end speaker verification and few-shot speaker identification, demonstrating superior performance on multiple datasets.

Contribution

The paper proposes novel relation network architectures and training regimes for more effective speaker verification and identification, especially in few-shot scenarios.

Findings

01

Outperforms existing methods on VoxCeleb, SITW, and VCTK datasets.

02

Faster convergence with the new training regime.

03

Effective joint training of encoder and backend model.

Abstract

Speaker identification systems in a real-world scenario are tasked to identify a speaker amongst a set of enrolled speakers given just a few samples for each enrolled speaker. This paper demonstrates the effectiveness of meta-learning and relation networks for this use case. We propose improved relation networks for speaker verification and few-shot (unseen) speaker identification. The use of relation networks facilitates joint training of the frontend speaker encoder and the backend model. Inspired by the use of prototypical networks in speaker verification and to increase the discriminability of the speaker embeddings, we train the model to classify samples in the current episode amongst all speakers present in the training set. Furthermore, we propose a new training regime for faster model convergence by extracting more information from a given meta-learning episode with negligible…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing