Disentangled Speaker Representation Learning via Mutual Information Minimization
Sung Hwan Mun, Min Hyun Han, Minchan Kim, Dongjune Lee, and Nam Soo, Kim

TL;DR
This paper introduces a framework for disentangling speaker-relevant features from speaker-unrelated features using mutual information minimization, improving speaker verification performance especially in domain mismatch scenarios.
Contribution
The paper presents a novel three-stage disentanglement framework utilizing MI minimization with CLUB, and demonstrates its effectiveness through experiments on FFSVC2022 and VoxCeleb datasets.
Findings
Effective disentanglement of speaker-related and unrelated features.
Improved speaker verification accuracy after fine-tuning with the framework.
Validation on FFSVC2022 dataset shows performance gains.
Abstract
Domain mismatch problem caused by speaker-unrelated feature has been a major topic in speaker recognition. In this paper, we propose an explicit disentanglement framework to unravel speaker-relevant features from speaker-unrelated features via mutual information (MI) minimization. To achieve our goal of minimizing MI between speaker-related and speaker-unrelated features, we adopt a contrastive log-ratio upper bound (CLUB), which exploits the upper bound of MI. Our framework is constructed in a 3-stage structure. First, in the front-end encoder, input speech is encoded into shared initial embedding. Next, in the decoupling block, shared initial embedding is split into separate speaker-related and speaker-unrelated embeddings. Finally, disentanglement is conducted by MI minimization in the last stage. Experiments on Far-Field Speaker Verification Challenge 2022 (FFSVC2022) demonstrate…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing
