Investigating Speaker Embedding Disentanglement on Natural Read Speech
Michael Kuhlmann, Adrian Meise, Fritz Seebauer, Petra Wagner and, Reinhold Haeb-Umbach

TL;DR
This paper explores how well speaker identity can be separated from other factors in speech representations, finding limited but improvable disentanglement using standard learning objectives.
Contribution
It introduces a method to quantify speaker disentanglement in speech representations and evaluates the effectiveness of standard objectives in achieving this.
Findings
Disentanglement of speaker embeddings is limited with standard training methods.
Using enhanced objectives can improve speaker disentanglement somewhat.
Identified acoustic features serve as proxies for underlying speech factors.
Abstract
Disentanglement is the task of learning representations that identify and separate factors that explain the variation observed in data. Disentangled representations are useful to increase the generalizability, explainability, and fairness of data-driven models. Only little is known about how well such disentanglement works for speech representations. A major challenge when tackling disentanglement for speech representations are the unknown generative factors underlying the speech signal. In this work, we investigate to what degree speech representations encoding speaker identity can be disentangled. To quantify disentanglement, we identify acoustic features that are highly speaker-variant and can serve as proxies for the factors of variation underlying speech. We find that disentanglement of the speaker embedding is limited when trained with standard objectives promoting disentanglement…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Music and Audio Processing · Natural Language Processing Techniques
