PPG-based singing voice conversion with adversarial representation learning
Zhonghao Li, Benlai Tang, Xiang Yin, Yuan Wan, Ling Xu, Chen Shen,, Zejun Ma

TL;DR
This paper introduces a novel end-to-end PPG-based singing voice conversion model that leverages adversarial and mel-regressive modules to enhance naturalness, melody, and voice similarity, demonstrating robustness to noise.
Contribution
It presents a new architecture with separate encoders and specialized modules for improved singing voice conversion performance.
Findings
Significant improvement in naturalness, melody, and voice similarity over baselines.
Robustness of the method to noisy source inputs.
Effective use of adversarial and mel-regressive modules in singing voice conversion.
Abstract
Singing voice conversion (SVC) aims to convert the voice of one singer to that of other singers while keeping the singing content and melody. On top of recent voice conversion works, we propose a novel model to steadily convert songs while keeping their naturalness and intonation. We build an end-to-end architecture, taking phonetic posteriorgrams (PPGs) as inputs and generating mel spectrograms. Specifically, we implement two separate encoders: one encodes PPGs as content, and the other compresses mel spectrograms to supply acoustic and musical information. To improve the performance on timbre and melody, an adversarial singer confusion module and a mel-regressive representation learning module are designed for the model. Objective and subjective experiments are conducted on our private Chinese singing corpus. Comparing with the baselines, our methods can significantly improve the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing
