LHQ-SVC: Lightweight and High Quality Singing Voice Conversion Modeling

Yubo Huang; Xin Lai; Muyang Ye; Anran Zhu; Zixi Wang; Jingzehua Xu,; Shuai Zhang; Zhiyuan Zhou; Weijie Niu

arXiv:2409.08583·cs.SD·January 22, 2025

LHQ-SVC: Lightweight and High Quality Singing Voice Conversion Modeling

Yubo Huang, Xin Lai, Muyang Ye, Anran Zhu, Zixi Wang, Jingzehua Xu,, Shuai Zhang, Zhiyuan Zhou, Weijie Niu

PDF

Open Access

TL;DR

LHQ-SVC introduces a lightweight diffusion-based singing voice conversion model that achieves high quality output with reduced computational requirements, suitable for CPU deployment.

Contribution

The paper presents LHQ-SVC, a novel, efficient SVC model that balances high audio quality with low resource consumption, advancing practical applications.

Findings

01

Maintains competitive voice conversion quality.

02

Significantly improves processing speed and efficiency.

03

Optimized for CPU execution with parallel computing.

Abstract

Singing Voice Conversion (SVC) has emerged as a significant subfield of Voice Conversion (VC), enabling the transformation of one singer's voice into another while preserving musical elements such as melody, rhythm, and timbre. Traditional SVC methods have limitations in terms of audio quality, data requirements, and computational complexity. In this paper, we propose LHQ-SVC, a lightweight, CPU-compatible model based on the SVC framework and diffusion model, designed to reduce model size and computational demand without sacrificing performance. We incorporate features to improve inference quality, and optimize for CPU execution by using performance tuning tools and parallel computing frameworks. Our experiments demonstrate that LHQ-SVC maintains competitive performance, with significant improvements in processing speed and efficiency across different devices. The results suggest that…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing

MethodsDiffusion · SPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings