SoulX-Singer: Towards High-Quality Zero-Shot Singing Voice Synthesis

Jiale Qian; Hao Meng; Tian Zheng; Pengcheng Zhu; Haopeng Lin; Yuhang Dai; Hanke Xie; Wenxiao Cao; Ruixuan Shang; Jun Wu; Hongmei Liu; Hanlin Wen; Jian Zhao; Zhonglin Jiang; Yong Chen; Shunshun Yin; Ming Tao; Jianguo Wei; Lei Xie; Xinsheng Wang

arXiv:2602.07803·eess.AS·March 17, 2026

SoulX-Singer: Towards High-Quality Zero-Shot Singing Voice Synthesis

Jiale Qian, Hao Meng, Tian Zheng, Pengcheng Zhu, Haopeng Lin, Yuhang Dai, Hanke Xie, Wenxiao Cao, Ruixuan Shang, Jun Wu, Hongmei Liu, Hanlin Wen, Jian Zhao, Zhonglin Jiang, Yong Chen, Shunshun Yin, Ming Tao, Jianguo Wei, Lei Xie, Xinsheng Wang

PDF

Open Access 3 Models

TL;DR

SoulX-Singer is a high-quality, open-source singing voice synthesis system that excels in zero-shot generalization, supports multiple languages and musical controls, and is accompanied by a dedicated benchmark for evaluation.

Contribution

The paper introduces SoulX-Singer, a novel SVS system with practical deployment features, multilingual support, and a new benchmark for zero-shot evaluation.

Findings

01

Achieves state-of-the-art synthesis quality across languages.

02

Supports controllable singing generation with symbolic and melodic inputs.

03

Provides a systematic zero-shot evaluation benchmark.

Abstract

While recent years have witnessed rapid progress in speech synthesis, open-source singing voice synthesis (SVS) systems still face significant barriers to industrial deployment, particularly in terms of robustness and zero-shot generalization. In this report, we introduce SoulX-Singer, a high-quality open-source SVS system designed with practical deployment considerations in mind. SoulX-Singer supports controllable singing generation conditioned on either symbolic musical scores (MIDI) or melodic representations, enabling flexible and expressive control in real-world production workflows. Trained on more than 42,000 hours of vocal data, the system supports Mandarin Chinese, English, and Cantonese and consistently achieves state-of-the-art synthesis quality across languages under diverse musical conditions. Furthermore, to enable reliable evaluation of zero-shot SVS performance in…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Music Technology and Sound Studies · Music and Audio Processing