Disentangled Speaker Representation Learning via Mutual Information   Minimization

Sung Hwan Mun; Min Hyun Han; Minchan Kim; Dongjune Lee; and Nam Soo; Kim

arXiv:2208.08012·eess.AS·October 13, 2022

Disentangled Speaker Representation Learning via Mutual Information Minimization

Sung Hwan Mun, Min Hyun Han, Minchan Kim, Dongjune Lee, and Nam Soo, Kim

PDF

Open Access

TL;DR

This paper introduces a framework for disentangling speaker-relevant features from speaker-unrelated features using mutual information minimization, improving speaker verification performance especially in domain mismatch scenarios.

Contribution

The paper presents a novel three-stage disentanglement framework utilizing MI minimization with CLUB, and demonstrates its effectiveness through experiments on FFSVC2022 and VoxCeleb datasets.

Findings

01

Effective disentanglement of speaker-related and unrelated features.

02

Improved speaker verification accuracy after fine-tuning with the framework.

03

Validation on FFSVC2022 dataset shows performance gains.

Abstract

Domain mismatch problem caused by speaker-unrelated feature has been a major topic in speaker recognition. In this paper, we propose an explicit disentanglement framework to unravel speaker-relevant features from speaker-unrelated features via mutual information (MI) minimization. To achieve our goal of minimizing MI between speaker-related and speaker-unrelated features, we adopt a contrastive log-ratio upper bound (CLUB), which exploits the upper bound of MI. Our framework is constructed in a 3-stage structure. First, in the front-end encoder, input speech is encoded into shared initial embedding. Next, in the decoupling block, shared initial embedding is split into separate speaker-related and speaker-unrelated embeddings. Finally, disentanglement is conducted by MI minimization in the last stage. Experiments on Far-Field Speaker Verification Challenge 2022 (FFSVC2022) demonstrate…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing