Real-Time and Accurate: Zero-shot High-Fidelity Singing Voice Conversion   with Multi-Condition Flow Synthesis

Hui Li; Hongyu Wang; Zhijin Chen; Bohan Sun; Bo Li

arXiv:2405.15093·eess.AS·September 10, 2024

Real-Time and Accurate: Zero-shot High-Fidelity Singing Voice Conversion with Multi-Condition Flow Synthesis

Hui Li, Hongyu Wang, Zhijin Chen, Bohan Sun, Bo Li

PDF

Open Access

TL;DR

This paper introduces RASVC, a zero-shot high-fidelity singing voice conversion model that combines multi-decoupling feature constraints and multi-stream inverse STFT to improve detail capture and processing speed, achieving state-of-the-art results.

Contribution

The paper presents a novel flow-based singing voice conversion model with multi-decoupling features and multi-stream inverse STFT for enhanced fidelity and efficiency.

Findings

01

High fidelity and naturalness in singing voice conversion.

02

Enhanced processing speed through MS-iSTFT.

03

Competitive results with current state-of-the-art models.

Abstract

Singing voice conversion is to convert the source singing voice into the target singing voice except for the content. Currently, flow-based models can complete the task of voice conversion, but they struggle to effectively extract latent variables in the more rhythmically rich and emotionally expressive task of singing voice conversion, while also facing issues with low efficiency in speech processing. In this paper, we propose a high-fidelity flow-based model based on multi-decoupling feature constraints called RASVC, which enhances the capture of vocal details by integrating multiple latent attribute encoders. We also use Multi-stream inverse short-time Fourier transform(MS-iSTFT) to enhance the speed of speech processing by skipping some complicated decoder processing steps. We compare the synthesized singing voice with other models from multiple dimensions, and our proposed model is…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing

MethodsSPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings