Improved Relation Networks for End-to-End Speaker Verification and Identification
Ashutosh Chaubey, Sparsh Sinha, Susmita Ghose

TL;DR
This paper introduces improved relation networks combined with meta-learning for end-to-end speaker verification and few-shot speaker identification, demonstrating superior performance on multiple datasets.
Contribution
The paper proposes novel relation network architectures and training regimes for more effective speaker verification and identification, especially in few-shot scenarios.
Findings
Outperforms existing methods on VoxCeleb, SITW, and VCTK datasets.
Faster convergence with the new training regime.
Effective joint training of encoder and backend model.
Abstract
Speaker identification systems in a real-world scenario are tasked to identify a speaker amongst a set of enrolled speakers given just a few samples for each enrolled speaker. This paper demonstrates the effectiveness of meta-learning and relation networks for this use case. We propose improved relation networks for speaker verification and few-shot (unseen) speaker identification. The use of relation networks facilitates joint training of the frontend speaker encoder and the backend model. Inspired by the use of prototypical networks in speaker verification and to increase the discriminability of the speaker embeddings, we train the model to classify samples in the current episode amongst all speakers present in the training set. Furthermore, we propose a new training regime for faster model convergence by extracting more information from a given meta-learning episode with negligible…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing
