Seeing Your Speech Style: A Novel Zero-Shot Identity-Disentanglement   Face-based Voice Conversion

Yan Rong; Li Liu

arXiv:2409.00700·cs.SD·September 5, 2024

Seeing Your Speech Style: A Novel Zero-Shot Identity-Disentanglement Face-based Voice Conversion

Yan Rong, Li Liu

PDF

Open Access 1 Video

TL;DR

This paper introduces ID-FaceVC, a novel face-based voice conversion method that effectively disentangles speaker identity from content, allowing high-quality, controllable voice synthesis from facial images, audio, or text inputs.

Contribution

The paper proposes a new identity-disentanglement framework with contrastive learning and mutual information modules, enabling improved voice conversion and controllable speech generation.

Findings

01

Achieves state-of-the-art performance in voice conversion metrics

02

Effectively disentangles speaker identity from content

03

Supports controllable speech generation with emotional tone and speed

Abstract

Face-based Voice Conversion (FVC) is a novel task that leverages facial images to generate the target speaker's voice style. Previous work has two shortcomings: (1) suffering from obtaining facial embeddings that are well-aligned with the speaker's voice identity information, and (2) inadequacy in decoupling content and speaker identity information from the audio input. To address these issues, we present a novel FVC method, Identity-Disentanglement Face-based Voice Conversion (ID-FaceVC), which overcomes the above two limitations. More precisely, we propose an Identity-Aware Query-based Contrastive Learning (IAQ-CL) module to extract speaker-specific facial features, and a Mutual Information-based Dual Decoupling (MIDD) module to purify content features from audio, ensuring clear and high-quality voice conversion. Besides, unlike prior works, our method can accept either audio or text…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Seeing Your Speech Style: A Novel Zero-Shot Identity-Disentanglement Face-based Voice Conversion· underline

Taxonomy

TopicsFace recognition and analysis · Speech Recognition and Synthesis

MethodsContrastive Learning