Residual Information in Deep Speaker Embedding Architectures
Adriana Stan

TL;DR
This paper evaluates how well deep speaker embeddings disentangle speaker identity from other speech factors, revealing residual information related to recording conditions, content, and duration across recent architectures.
Contribution
It introduces a comprehensive analysis method to quantify residual information in speaker embeddings and assesses multiple architectures using a large, controlled speech dataset.
Findings
High discriminative power of embeddings confirmed
Residual information correlates with recording conditions and content
Embeddings still contain non-speaker-specific information
Abstract
Speaker embeddings represent a means to extract representative vectorial representations from a speech signal such that the representation pertains to the speaker identity alone. The embeddings are commonly used to classify and discriminate between different speakers. However, there is no objective measure to evaluate the ability of a speaker embedding to disentangle the speaker identity from the other speech characteristics. This means that the embeddings are far from ideal, highly dependent on the training corpus and still include a degree of residual information pertaining to factors such as linguistic content, recording conditions or speaking style of the utterance. This paper introduces an analysis over six sets of speaker embeddings extracted with some of the most recent and high-performing DNN architectures, and in particular, the degree to which they are able to truly…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
