Adapting Speech Language Model to Singing Voice Synthesis

Yiwen Zhao; Jiatong Shi; Jinchuan Tian; Yuxun Tang; Jiarui Hai; Jionghao Han; Shinji Watanabe

arXiv:2512.14657·cs.SD·December 17, 2025

Adapting Speech Language Model to Singing Voice Synthesis

Yiwen Zhao, Jiatong Shi, Jinchuan Tian, Yuxun Tang, Jiarui Hai, Jionghao Han, Shinji Watanabe

PDF

Open Access

TL;DR

This paper demonstrates how a large pre-trained speech language model can be adapted to singing voice synthesis using a small synthetic dataset, achieving competitive results with specialized models.

Contribution

The work introduces a novel adaptation method for large speech language models to singing voice synthesis with minimal data and a multi-step process.

Findings

01

The adapted model generalizes well to singing voice synthesis.

02

Achieves performance comparable to specialized SVS models.

03

Uses only 135 hours of synthetic singing data.

Abstract

Speech Language Models (SLMs) have recently emerged as a unified paradigm for addressing a wide range of speech-related tasks, including text-to-speech (TTS), speech enhancement (SE), and automatic speech recognition (ASR). However, the generalization capability of large-scale pre-trained SLMs remains underexplored. In this work, we adapt a 1.7B parameter TTS pretrained SLM for singing voice synthesis (SVS), using only a 135-hour synthetic singing corpus, ACE-Opencpop. Building upon the ESPNet-SpeechLM, our recipe involves the following procedure: (1) tokenization of music score conditions and singing waveforms, (2) multi-stream language model token prediction, (3) conditional flow matching-based mel-spectrogram generation. (4) a mel-to-wave vocoder. Experimental results demonstrate that our adapted SLM generalizes well to SVS and achieves performance comparable to leading discrete…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Voice and Speech Disorders · Music and Audio Processing