S-vectors and TESA: Speaker Embeddings and a Speaker Authenticator Based on Transformer Encoder
N J Metilda Sagaya Mary, S Umesh, Sandesh V Katta

TL;DR
This paper introduces s-vectors derived from a Transformer encoder for speaker embedding, demonstrating superior performance over x-vectors, and proposes TESA, a Transformer-based speaker verification architecture that outperforms traditional methods.
Contribution
The paper presents a novel speaker embedding method called s-vectors from Transformer encoders and introduces TESA, a Transformer-based speaker verification system, improving over existing approaches.
Findings
S-vectors outperform x-vectors in speaker recognition tasks.
TESA achieves better verification accuracy than PLDA-based methods.
Self-attention in Transformers effectively captures speaker characteristics.
Abstract
One of the most popular speaker embeddings is x-vectors, which are obtained from an architecture that gradually builds a larger temporal context with layers. In this paper, we propose to derive speaker embeddings from Transformer's encoder trained for speaker classification. Self-attention, on which Transformer's encoder is built, attends to all the features over the entire utterance and might be more suitable in capturing the speaker characteristics in an utterance. We refer to the speaker embeddings obtained from the proposed speaker classification model as s-vectors to emphasize that they are obtained from an architecture that heavily relies on self-attention. Through experiments, we demonstrate that s-vectors perform better than x-vectors. In addition to the s-vectors, we also propose a new architecture based on Transformer's encoder for speaker verification as a replacement for…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing
