Face-Driven Zero-Shot Voice Conversion with Memory-based Face-Voice   Alignment

Zheng-Yan Sheng; Yang Ai; Yan-Nian Chen; Zhen-Hua Ling

arXiv:2309.09470·cs.SD·September 19, 2023·1 cites

Face-Driven Zero-Shot Voice Conversion with Memory-based Face-Voice Alignment

Zheng-Yan Sheng, Yang Ai, Yan-Nian Chen, Zhen-Hua Ling

PDF

Open Access

TL;DR

This paper introduces a novel zero-shot face-driven voice conversion method that uses face images to convert voice characteristics without requiring target speaker data during inference, leveraging a memory-based face-voice alignment.

Contribution

The paper proposes a memory-based face-voice alignment module and a mixed supervision strategy for zero-shot face-driven voice conversion, transferring knowledge from a pretrained model.

Findings

01

Outperforms existing methods on zero-shot FaceVC tasks

02

Achieves high homogeneity and diversity in voice conversion

03

Demonstrates effectiveness through extensive subjective and objective evaluations

Abstract

This paper presents a novel task, zero-shot voice conversion based on face images (zero-shot FaceVC), which aims at converting the voice characteristics of an utterance from any source speaker to a newly coming target speaker, solely relying on a single face image of the target speaker. To address this task, we propose a face-voice memory-based zero-shot FaceVC method. This method leverages a memory-based face-voice alignment module, in which slots act as the bridge to align these two modalities, allowing for the capture of voice characteristics from face images. A mixed supervision strategy is also introduced to mitigate the long-standing issue of the inconsistency between training and inference phases for voice conversion tasks. To obtain speaker-independent content-related representations, we transfer the knowledge from a pretrained zero-shot voice conversion model to our zero-shot…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Face recognition and analysis

MethodsALIGN