Discrete Unit based Masking for Improving Disentanglement in Voice   Conversion

Philip H. Lee; Ismail Rasim Ulgen; Berrak Sisman

arXiv:2409.11560·eess.AS·September 19, 2024

Discrete Unit based Masking for Improving Disentanglement in Voice Conversion

Philip H. Lee, Ismail Rasim Ulgen, Berrak Sisman

PDF

Open Access

TL;DR

This paper introduces a novel input masking technique that enhances speaker disentanglement in voice conversion systems by reducing phonetic dependency, leading to improved conversion quality across various models.

Contribution

The paper proposes a discrete unit masking mechanism at the input level that improves disentanglement and is compatible with any encoder-decoder voice conversion framework.

Findings

01

44% relative improvement in objective intelligibility

02

Enhanced disentanglement in attention-based VC methods

03

Applicable to multiple VC frameworks

Abstract

Voice conversion (VC) aims to modify the speaker's identity while preserving the linguistic content. Commonly, VC methods use an encoder-decoder architecture, where disentangling the speaker's identity from linguistic information is crucial. However, the disentanglement approaches used in these methods are limited as the speaker features depend on the phonetic content of the utterance, compromising disentanglement. This dependency is amplified with attention-based methods. To address this, we introduce a novel masking mechanism in the input before speaker encoding, masking certain discrete speech units that correspond highly with phoneme classes. Our work aims to reduce the phonetic dependency of speaker features by restricting access to some phonetic information. Furthermore, since our approach is at the input level, it is applicable to any encoder-decoder based VC framework. Our…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and Audio Processing · Speech Recognition and Synthesis