Eta-WavLM: Efficient Speaker Identity Removal in Self-Supervised Speech Representations Using a Simple Linear Equation
Giuseppe Ruggiero, Matteo Testa, Jurgen Van de Walle, Luigi Di Caro

TL;DR
This paper introduces Eta-WavLM, a simple linear method to effectively disentangle speaker identity from speech representations in self-supervised learning, improving performance in content-focused tasks like voice conversion.
Contribution
The paper presents a novel linear decomposition technique for disentangling speaker information from SSL speech representations, enhancing speaker independence without complex models.
Findings
Achieves effective speaker disentanglement in SSL representations.
Improves voice conversion performance over existing methods.
Maintains content integrity while removing speaker information.
Abstract
Self-supervised learning (SSL) has reduced the reliance on expensive labeling in speech technologies by learning meaningful representations from unannotated data. Since most SSL-based downstream tasks prioritize content information in speech, ideal representations should disentangle content from unwanted variations like speaker characteristics in the SSL representations. However, removing speaker information often degrades other speech components, and existing methods either fail to fully disentangle speaker identity or require resource-intensive models. In this paper, we propose a novel disentanglement method that linearly decomposes SSL representations into speaker-specific and speaker-independent components, effectively generating speaker disentangled representations. Comprehensive experiments show that our approach achieves speaker independence and as such, when applied to…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsSpeech Recognition and Synthesis · Natural Language Processing Techniques · Speech and Audio Processing
