Residual Information in Deep Speaker Embedding Architectures

Adriana Stan

arXiv:2302.02742·eess.AS·February 7, 2023

Residual Information in Deep Speaker Embedding Architectures

Adriana Stan

PDF

TL;DR

This paper evaluates how well deep speaker embeddings disentangle speaker identity from other speech factors, revealing residual information related to recording conditions, content, and duration across recent architectures.

Contribution

It introduces a comprehensive analysis method to quantify residual information in speaker embeddings and assesses multiple architectures using a large, controlled speech dataset.

Findings

01

High discriminative power of embeddings confirmed

02

Residual information correlates with recording conditions and content

03

Embeddings still contain non-speaker-specific information

Abstract

Speaker embeddings represent a means to extract representative vectorial representations from a speech signal such that the representation pertains to the speaker identity alone. The embeddings are commonly used to classify and discriminate between different speakers. However, there is no objective measure to evaluate the ability of a speaker embedding to disentangle the speaker identity from the other speech characteristics. This means that the embeddings are far from ideal, highly dependent on the training corpus and still include a degree of residual information pertaining to factors such as linguistic content, recording conditions or speaking style of the utterance. This paper introduces an analysis over six sets of speaker embeddings extracted with some of the most recent and high-performing DNN architectures, and in particular, the degree to which they are able to truly…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.