S-vectors and TESA: Speaker Embeddings and a Speaker Authenticator Based   on Transformer Encoder

N J Metilda Sagaya Mary; S Umesh; Sandesh V Katta

arXiv:2008.04659·eess.AS·December 14, 2021·1 cites

S-vectors and TESA: Speaker Embeddings and a Speaker Authenticator Based on Transformer Encoder

N J Metilda Sagaya Mary, S Umesh, Sandesh V Katta

PDF

Open Access

TL;DR

This paper introduces s-vectors derived from a Transformer encoder for speaker embedding, demonstrating superior performance over x-vectors, and proposes TESA, a Transformer-based speaker verification architecture that outperforms traditional methods.

Contribution

The paper presents a novel speaker embedding method called s-vectors from Transformer encoders and introduces TESA, a Transformer-based speaker verification system, improving over existing approaches.

Findings

01

S-vectors outperform x-vectors in speaker recognition tasks.

02

TESA achieves better verification accuracy than PLDA-based methods.

03

Self-attention in Transformers effectively captures speaker characteristics.

Abstract

One of the most popular speaker embeddings is x-vectors, which are obtained from an architecture that gradually builds a larger temporal context with layers. In this paper, we propose to derive speaker embeddings from Transformer's encoder trained for speaker classification. Self-attention, on which Transformer's encoder is built, attends to all the features over the entire utterance and might be more suitable in capturing the speaker characteristics in an utterance. We refer to the speaker embeddings obtained from the proposed speaker classification model as s-vectors to emphasize that they are obtained from an architecture that heavily relies on self-attention. Through experiments, we demonstrate that s-vectors perform better than x-vectors. In addition to the s-vectors, we also propose a new architecture based on Transformer's encoder for speaker verification as a replacement for…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing