YingMusic-Singer: Zero-shot Singing Voice Synthesis and Editing with Annotation-free Melody Guidance

Junjie Zheng; Chunbo Hao; Guobin Ma; Xiaoyu Zhang; Gongyu Chen; Chaofan Ding; Zihao Chen; Lei Xie

arXiv:2512.04779·cs.SD·December 5, 2025

YingMusic-Singer: Zero-shot Singing Voice Synthesis and Editing with Annotation-free Melody Guidance

Junjie Zheng, Chunbo Hao, Guobin Ma, Xiaoyu Zhang, Gongyu Chen, Chaofan Ding, Zihao Chen, Lei Xie

PDF

Open Access

TL;DR

This paper introduces YingMusic-Singer, a zero-shot singing voice synthesis framework that synthesizes singing with arbitrary lyrics and melodies without manual annotations, using a diffusion transformer architecture and melody extraction guided by a teacher model.

Contribution

The paper presents a novel zero-shot SVS method that eliminates the need for phoneme alignment and manual melody annotations, leveraging a diffusion transformer and a melody extraction module guided by a teacher model.

Findings

01

Outperforms existing methods in objective and subjective evaluations.

02

Effective in zero-shot and lyric adaptation scenarios.

03

Maintains high audio quality without manual annotations.

Abstract

Singing Voice Synthesis (SVS) remains constrained in practical deployment due to its strong dependence on accurate phoneme-level alignment and manually annotated melody contours, requirements that are resource-intensive and hinder scalability. To overcome these limitations, we propose a melody-driven SVS framework capable of synthesizing arbitrary lyrics following any reference melody, without relying on phoneme-level alignment. Our method builds on a Diffusion Transformer (DiT) architecture, enhanced with a dedicated melody extraction module that derives melody representations directly from reference audio. To ensure robust melody encoding, we employ a teacher model to guide the optimization of the melody extractor, alongside an implicit alignment mechanism that enforces similarity distribution constraints for improved melodic stability and coherence. Additionally, we refine duration…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Music and Audio Processing · Music Technology and Sound Studies