Siamese Capsule Network for End-to-End Speaker Recognition In The Wild
Amirhossein Hajavi, Ali Etemad

TL;DR
This paper introduces an end-to-end deep learning model combining thin-ResNet and Siamese capsule networks for robust speaker verification in challenging real-world conditions, outperforming existing methods with less training data.
Contribution
It presents a novel end-to-end speaker verification model using Siamese capsule networks with dynamic routing, demonstrating superior performance and data efficiency.
Findings
The model outperforms state-of-the-art solutions.
Using embeddings from the feature aggregation module yields best results.
The approach requires less training data than comparable models.
Abstract
We propose an end-to-end deep model for speaker verification in the wild. Our model uses thin-ResNet for extracting speaker embeddings from utterances and a Siamese capsule network and dynamic routing as the Back-end to calculate a similarity score between the embeddings. We conduct a series of experiments and comparisons on our model to state-of-the-art solutions, showing that our model outperforms all the other models using substantially less amount of training data. We also perform additional experiments to study the impact of different speaker embeddings on the Siamese capsule network. We show that the best performance is achieved by using embeddings obtained directly from the feature aggregation module of the Front-end and passing them to higher capsules using dynamic routing.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsCapsule Network
